Home  >  Companies  >  Fireworks AI
Fireworks AI
Cloud platform for running, tuning, and scaling open-source large language, vision, and multimodal models with low-latency inference

Revenue

$315.00M

2026

Valuation

$4.00B

2025

Funding

$331.00M

2025

Details
Headquarters
San Mateo, CA
CEO
Lin Qiao
Website
Milestones
FOUNDING YEAR
2022

Revenue

Sacra estimates that Fireworks AI hit $315M in annualized revenue in February 2026, up 416% year-over-year and up from about $305M at the end of 2025.

The customer base grew from roughly 1,000 companies at the time of the Series B to more than 10,000 companies by October 2025. Named customers include Cursor, Perplexity, Notion, Sourcegraph, Uber, DoorDash, Shopify, and Upwork, spanning code assistance, conversational AI, enterprise search, and agentic workflows.

Blended annualized revenue per company works out to roughly $28,000 across the full base, though revenue is likely concentrated among a smaller number of large production deployments.

The company's gross margin sits at approximately 50%, below the 70%-plus typical of subscription software businesses, due to the GPU infrastructure costs embedded in its cost of goods sold. Fireworks has told investors it is targeting 60% gross margins through continued GPU optimization and improved utilization efficiency.

Valuation & Funding

Fireworks AI raised a $250M Series C in October 2025 at a $4 billion post-money valuation, more than seven times its valuation from a year prior.

The round was led by Lightspeed Venture Partners, Index Ventures, and Evantic, with participation from existing investor Sequoia Capital. It included a $230M primary round and a $20M secondary transaction from the same investors.

Before the Series C, Fireworks raised a $52M Series B in July 2024 at a $552M valuation, led by Sequoia Capital, with participation from NVIDIA, AMD, Databricks Ventures, MongoDB Ventures, and Benchmark. Earlier rounds included a $25M Series A in early 2024 and a seed round, bringing total funding raised to over $327M as of October 2025.

Business Model

Fireworks operates as a B2B managed infrastructure platform with a usage-based monetization model layered across multiple product surfaces.

The pricing structure maps to the customer lifecycle. Serverless inference is billed per token, fine-tuning is billed per training token, reinforcement fine-tuning is billed per GPU-hour, and on-demand dedicated deployments are billed per GPU-second or GPU-hour. Reserved capacity is contracted separately, typically with longer commitments and negotiated pricing. This lets Fireworks capture revenue at nearly every stage of a customer's AI development workflow, experimentation, production serving, model adaptation, and scaled deployment.

Go-to-market is bottoms-up at entry and top-down at expansion. Developers can start immediately with self-serve API keys and pay-as-you-go billing. Larger customers can move into negotiated enterprise relationships with higher rate limits, reserved capacity, account management, custom optimization, and private deployment options. Fireworks also has a field and partner sales motion, including an AWS Strategic Collaboration Agreement with funded proofs-of-concept and a startup acceleration program, which gives it access to enterprise buyers through existing procurement channels rather than requiring a standalone vendor evaluation.

The core economic logic is that proprietary inference optimization translates infrastructure engineering into pricing power. If Fireworks can serve the same model faster and at higher throughput than a customer could achieve through self-hosting or a generic cloud inference stack, it can charge a premium while still undercutting the total cost of the alternative. FireAttention and FireOptimizer are the mechanism by which Fireworks earns margin above raw GPU resale economics.

Multi-LoRA is important to the business model because it improves utilization economics. By consolidating many fine-tuned variants onto a single base-model deployment, Fireworks reduces the compute cost per variant while keeping customers on-platform and increasing the likelihood that they fine-tune more frequently. It makes customization cheaper without removing Fireworks from the process.

The gross margin profile, approximately 50%, with a stated target of 60%, reflects the reality that Fireworks is not a pure software business. GPU procurement, capacity planning, and regional infrastructure are real cost inputs. The path to margin expansion runs through better GPU utilization, hardware efficiency gains on newer architectures like Blackwell, and a shift in revenue mix toward higher-value dedicated deployments, fine-tuning, and enterprise contracts rather than commodity serverless token volume.

The strategic architecture is organized around open-model neutrality. Fireworks does not bet on any single foundation model. It bets on being the best place to run whichever open model is winning at a given moment. That makes the platform resilient to model turnover. When DeepSeek or a new Llama release displaces an older model, Fireworks benefits from the transition rather than being threatened by it, as long as it maintains day-zero support for new releases.

Competition

The inference market has segmented into at least four distinct competitive layers, and Fireworks faces pressure from each.

Managed open-model platforms

Together AI is Fireworks' closest direct competitor. Together offers serverless inference, dedicated model inference, dedicated container inference, batch processing, and GPU clusters, a breadth that lets customers span from quick API access to full infrastructure relationships without switching vendors. Together raised a $305M Series B in February 2025 and has been generating more than $150M in annualized revenue, with named customers including Cursor. Its Instant Clusters and Blackwell-based GPU offerings extend beyond API serving into infrastructure procurement, which is a threat in large enterprise accounts where capacity guarantees matter as much as model-serving ergonomics.

Baseten competes less as a commodity model catalog and more as an enterprise inference engineering platform. Its pitch centers on custom and proprietary models, compound AI systems, and a configurable runtime that layers optimizations on top of open-source serving engines including TensorRT, SGLang, vLLM, and TGI. Baseten explicitly offers self-hosted and hybrid deployment inside customer VPCs, which gives it an advantage in compliance-heavy accounts where security isolation is a procurement requirement. Baseten raised a $300M Series E at a $5 billion valuation in February 2026, increasing its ability to compete on enterprise GTM, support, and capacity commitments.

Replicate competes at the developer and community end of the market, with a broad model catalog and easy experimentation workflows. It is less enterprise-direct than Fireworks but competes for developer mindshare at the top of the funnel, where early adoption decisions can later shape platform standardization.

Vertically integrated silicon players

Groq, Cerebras, and SambaNova compete by combining custom silicon with inference APIs, attacking latency and cost from the hardware layer rather than through software optimization on commodity GPUs. Groq's partnership with Meta to accelerate the official Llama API gives it unusually strong distribution and first-party credibility around open-model access, a direct challenge to Fireworks' own day-zero model support positioning. Cerebras and SambaNova rank highly on specific models and metrics, making inference look increasingly like a hardware-led market in performance-sensitive workloads. These players pressure Fireworks at the premium end of latency-sensitive use cases and raise the bar for how much proprietary advantage a GPU-based software optimization layer can sustainably capture.

Hyperscaler bundling

AWS Bedrock, Google Vertex AI, and Microsoft Azure are the most structurally threatening competitors because they collapse model access, infrastructure, governance, and enterprise contracting into a single platform. Bedrock's custom model import now supports customized open-source architectures including Qwen variants, letting enterprises run adapted open models through a unified Bedrock API without adopting a standalone inference vendor. For procurement-heavy buyers, that can be sufficient even if Fireworks offers better raw performance, because Bedrock benefits from existing AWS security posture, billing relationships, and vendor consolidation. Fireworks' AWS Strategic Collaboration Agreement and its availability inside Microsoft Foundry are partly defensive responses to this dynamic, plugging into incumbent procurement channels rather than competing against them. Databricks Model Serving poses a similar bundling threat for customers with strong data platform footprints, where inference can be absorbed into an existing governed workflow rather than evaluated as a standalone service.

Open-source commoditization

The most structural long-term pressure on Fireworks comes from the improvement of open-source inference frameworks themselves. Baseten openly builds on SGLang, vLLM, and TensorRT. Snowflake has released Arctic Inference as an open-source vLLM plugin. NVIDIA is pushing NIM as a packaging and distribution layer for enterprise inference. As these components improve, the proprietary advantage embedded in FireAttention and FireOptimizer faces compression unless Fireworks keeps extending its stack faster than the ecosystem catches up. OpenPipe and Predibase attack the fine-tuning and post-training layer from the workflow and tooling side, while OpenRouter aggregates inference across providers at the routing layer, each narrowing a different slice of Fireworks' value proposition.

TAM Expansion

New products and modalities

Fireworks' clearest expansion path is moving from text inference into a full production AI runtime spanning multimodal inputs, voice, embeddings, reranking, and agentic orchestration.

The Voice Agent Platform is the most concrete new surface. It bundles transcription, language models, tool calling, and streaming deployment into a single stack targeting sub-500ms response latency, with documented use cases in drive-thrus, contact centers, and field operations. That shifts Fireworks from developer tooling into budget categories historically owned by contact center software, BPO services, and enterprise telephony, a materially larger spending pool than API inference alone.

Reinforcement fine-tuning for agentic tasks is another high-upside wedge. By letting customers train models to improve on tool use, coding, and multi-step reasoning through programmatic evaluation, Fireworks becomes the adaptation layer for domain-specific agents that can outperform generic frontier APIs on specialized workflows. That expands the addressable budget from inference spend into the broader enterprise automation and workflow software market.

Customer base expansion

Fireworks started with AI-native developers and has moved progressively upmarket. Its enterprise product now includes zero data retention by default, SSO, audit logs, data residency controls, HIPAA and SOC2 compliance posture, and airgapped EKS deployments, capabilities that open regulated verticals including healthcare, financial services, and government-adjacent workloads that were previously inaccessible.

The more important expansion dynamic is moving from point workloads to platform-of-record status inside existing accounts. A customer that starts with serverless inference for a single feature can expand into dedicated deployments, fine-tuning, reinforcement fine-tuning, embeddings for RAG stacks, voice agents, and cross-region reserved capacity. Hebbia's usage pattern, prioritizing fast access to newly released open models, unified API surface, high-concurrency latency guarantees, and enterprise-grade data handling, illustrates how a single inference relationship can anchor a broader infrastructure dependency over time.

The AWS alliance accelerates this motion by letting Fireworks reach buyers inside existing AWS workflows through funded proofs-of-concept and startup acceleration support, rather than requiring a standalone vendor evaluation.

Geographic expansion

Fireworks now operates a global multi-region fleet with deployments in Frankfurt, Iceland, Tokyo, and across US, Europe, and APAC regions. That infrastructure enables it to sell to customers with latency requirements or data residency mandates outside the United States, a prerequisite for standardization in European enterprises subject to GDPR and in APAC markets with local hosting requirements.

Geographic expansion is not purely a sales motion for Fireworks, it is also a product capability. Regional deployment options are part of the enterprise product, and the ability to offer data residency guarantees in specific jurisdictions differentiates Fireworks from inference providers that operate primarily from US-based infrastructure.

Vertical integration up the stack

Fireworks' highest-upside path is moving from GPU-efficient inference into higher-value control points above and beside raw serving. FireOptimizer, prompt caching, direct routing, reserved capacity, and reinforcement fine-tuning all let Fireworks capture value from latency optimization, quality optimization, and workload-specific adaptation, not just token throughput.

Enterprise buyers increasingly care about application-level outcomes, faster coding copilots, lower call-center latency, higher tool-call accuracy, rather than who exposes the cheapest model endpoint. If Fireworks keeps owning the adaptation layer between open models and production use cases, its addressable market expands from inference spend into performance engineering, post-training, and application-enablement budgets. The acquisition of Hathora to deepen real-time and global compute orchestration is consistent with this trajectory.

Risks

Inference commoditization: As open-source serving frameworks like vLLM and SGLang improve and quality-adjusted AI prices continue to fall, the proprietary performance advantage in FireAttention and FireOptimizer is likely to compress. Fireworks must continue expanding into tuning, agents, voice, and enterprise governance to avoid a scenario in which its value proposition narrows to GPU resale economics with 50% gross margins and no durable differentiation.

Hyperscaler capture: Fireworks benefits from its AWS and Microsoft partnerships, but those same relationships create structural dependency risk if inference, agent deployment, observability, and security are absorbed into larger cloud-native control planes like Bedrock AgentCore and Azure Foundry. If enterprise buyers increasingly prefer single-vendor procurement and existing cloud relationships over specialized inference vendors, Fireworks could be pushed toward optimization add-on status rather than serving as the primary AI platform.

Hardware concentration: Fireworks' performance claims and margin improvement thesis are closely tied to access to leading-edge NVIDIA hardware, including H100, H200, and Blackwell B200 GPUs. The company does not own its GPU fleet but sources capacity from third parties, which creates exposure to allocation constraints, supply bottlenecks, and hardware transition timing, particularly as NVIDIA has entered the inference market through its acquisition of Lepton and launch of a competing GPU cloud marketplace.

News

DISCLAIMERS

This report is for information purposes only and is not to be used or considered as an offer or the solicitation of an offer to sell or to buy or subscribe for securities or other financial instruments. Nothing in this report constitutes investment, legal, accounting or tax advice or a representation that any investment or strategy is suitable or appropriate to your individual circumstances or otherwise constitutes a personal trade recommendation to you.

This research report has been prepared solely by Sacra and should not be considered a product of any person or entity that makes such report available, if any.

Information and opinions presented in the sections of the report were obtained or derived from sources Sacra believes are reliable, but Sacra makes no representation as to their accuracy or completeness. Past performance should not be taken as an indication or guarantee of future performance, and no representation or warranty, express or implied, is made regarding future performance. Information, opinions and estimates contained in this report reflect a determination at its original date of publication by Sacra and are subject to change without notice.

Sacra accepts no liability for loss arising from the use of the material presented in this report, except that this exclusion of liability does not apply to the extent that liability arises under specific statutes or regulations applicable to Sacra. Sacra may have issued, and may in the future issue, other reports that are inconsistent with, and reach different conclusions from, the information presented in this report. Those reports reflect different assumptions, views and analytical methods of the analysts who prepared them and Sacra is under no obligation to ensure that such other reports are brought to the attention of any recipient of this report.

All rights reserved. All material presented in this report, unless specifically indicated otherwise is under copyright to Sacra. Sacra reserves any and all intellectual property rights in the report. All trademarks, service marks and logos used in this report are trademarks or service marks or registered trademarks or service marks of Sacra. Any modification, copying, displaying, distributing, transmitting, publishing, licensing, creating derivative works from, or selling any report is strictly prohibited. None of the material, nor its content, nor any copy of it, may be altered in any way, transmitted to, copied or distributed to any other party, without the prior express written permission of Sacra. Any unauthorized duplication, redistribution or disclosure of this report will result in prosecution.