DeepSeek's cache-driven API advantage

Diving deeper into

DeepSeek

Company Report
The March 2025 operating disclosure showing a theoretical 545% cost-profit ratio on a single day's traffic suggests the gross margin profile on API revenue is unusually high for an infrastructure business
Analyzed 5 sources

This disclosure points to DeepSeek behaving less like a GPU reseller and more like a software layer that happens to sell inference. The important detail is not just low headline pricing, it is that the model architecture, cache based pricing, and open model distribution all push serving cost down faster than price. That can leave room for very high gross margins on direct API traffic, even while the company keeps cutting prices to win share.

  • Most inference businesses look like heavy infrastructure, where gross margin gets capped by GPU costs. Fireworks runs near 50% gross margin and targets 60%. Together runs near 45%. DeepSeek stands out because it only activates 37B of 671B parameters per token, uses FP8 and sparse attention, and charges less when workloads hit disk cache instead of fresh compute.
  • The pricing design matters as much as the model. DeepSeek prices repeated prompt prefixes at roughly one tenth of fresh input cost, which nudges developers toward workflows that are cheaper for DeepSeek to serve. In practice that means long system prompts, codebases, or document context can be reused many times, with billing still attached but compute intensity much lower on later calls.
  • That margin profile helps explain why DeepSeek can use open weights and still monetize. Many companies will download the model and run it elsewhere, often through hosts like Together or Fireworks, but the easiest path for startups and agent builders is still the hosted API. If direct API economics stay this strong, DeepSeek can afford to treat open source as top of funnel, not revenue leakage.

The next step is a market where model labs with the best efficiency curves set prices for everyone else. If DeepSeek keeps lowering cost per token faster than rivals, it can force the whole open model stack toward thinner infrastructure margins while preserving strong economics on its own API and becoming the default reasoning engine inside a wide range of developer tools and agents.