Cartesia funding, news & analysis

Home > Companies > Cartesia

Cartesia

Real-time expressive text-to-speech API and voice AI model for low-latency conversational agents

#ai

Funding

$91.00M

2025

View PDF

Details

Headquarters

San Francisco, CA

CEO

Karan Goel

Website

cartesia.ai

Milestones

FOUNDING YEAR

2023

Listed In

#ai

Valuation & Funding

Cartesia has raised $191M in total funding across multiple rounds.

The company raised a $27M seed round in December 2024, backed by Index Ventures. In March 2025, Kleiner Perkins led a $64M Series A, with participation from Index Ventures, Lightspeed Venture Partners, NVIDIA, A* Capital, Factory, Greycroft, Dell Technologies Capital, Samsung Ventures, Conviction, General Catalyst, and SV Angel. In October 2025, Cartesia closed a $100M Series A-II round, bringing total capital raised to $191M.

Product

Cartesia is a voice infrastructure platform built around three core products: Sonic for text-to-speech, Ink for speech-to-text, and Line for building and deploying complete voice agents.

The simplest way to understand Cartesia is as the layer that gives an AI brain a mouth and ears with latency low enough for a real phone conversation.

Sonic takes text and returns audio with a time-to-first-audio of under 90 milliseconds, meaning the gap between when a voice agent finishes processing and when a caller hears the first sound is shorter than a human blink. That matters because voice agents can feel robotic and frustrating when there's a noticeable pause before they respond. Sonic-3 supports 40-plus languages and gives developers fine-grained control over speed, volume, and emotional tone, so a healthcare scheduling bot can sound calm and reassuring while a sales agent sounds upbeat, all via API parameters.

Ink is Cartesia's streaming speech-to-text model, designed specifically for the messy conditions of live phone calls: background noise, telephony compression artifacts, accents, and the disfluencies of natural speech. Unlike batch transcription tools built for clean audio files, Ink is optimized to return a transcript quickly enough that the reasoning layer can begin formulating a response before the caller has fully finished speaking.

Line is where Cartesia moves from selling model endpoints to selling a complete voice-agent development and deployment environment. A developer starts in the Playground UI: they write a system prompt, pick a voice, set an opening greeting, and optionally add background sound. From there, they move into code, editing a main Python file to add business logic, tool calls, database lookups, or custom prompts, and deploy via CLI. Line handles the audio orchestration, telephony routing, concurrency management, call logging, evaluations, and rollbacks. The result is that a team with an existing LLM-based chatbot can bring it into voice without rebuilding its reasoning layer from scratch.

Cartesia also offers a voice tooling layer on top of the core models. Instant voice cloning lets a developer upload a short audio sample and generate a custom voice identity immediately. Professional voice cloning trains a higher-fidelity model on a larger sample set, suitable for brand personas or virtual avatars. Voice localization takes an existing cloned voice and adapts it to sound native in a target language, rather than keeping the original accent while reading foreign text. A voice changer transforms uploaded audio into a different voice identity, useful for content production, dubbing workflows, or agent prototyping.

The platform is built for production rather than demos. Cartesia offers snapshot versioning so teams can prototype on a rolling base model and then pin a specific release in production to prevent unexpected quality changes. Enterprise deployments can run on-premises or in a private VPC, with SOC 2 Type II, HIPAA, PCI, and GDPR compliance, zero data retention options, SSO, and custom SLAs.

Customers like Vapi and Retell use Cartesia as the default speech layer inside their own voice-agent orchestration platforms, meaning Cartesia's infrastructure powers many downstream applications built by those platforms' customers. ServiceNow, Maven AGI, Forethought, and healthcare operators like Assort Health and Arini use Cartesia directly for enterprise voice workflows.

Business Model

Cartesia operates as a B2B developer-first infrastructure platform with a hybrid usage-based and subscription monetization model.

The go-to-market motion is bottoms-up: developers discover Cartesia through API documentation, the Playground prototyping environment, and ecosystem integrations with platforms like Vapi, Retell, LiveKit, and Together AI. They start on a free or low-cost plan, build a working voice agent, and then expand as call volume grows.

The subscription tiers, Pro, Startup, Scale, and Enterprise, function primarily as access gates and included-credit bundles rather than the primary revenue driver. Monetization scales with consumption: TTS at one credit per character, STT at one credit per second of audio, voice cloning at one million credits to train a professional voice plus 1.5 credits per character generated, and Line telephony at $0.014 per minute on paid tiers.

That structure creates a land-and-expand motion. A startup lands on a Startup plan to prototype a customer support bot. As the bot goes live and call volume grows, TTS and STT spend compounds. The team then adds professional voice cloning for a branded persona, upgrades to Scale for higher concurrency, and eventually adopts Line for deployment, observability, and call analytics, at which point Cartesia captures a larger share of the customer's voice infrastructure stack.

The B2B2B channel through platforms like Vapi and Together AI is a second distribution motion that runs in parallel. When Vapi selects Cartesia as a default TTS provider, every application built on Vapi that uses the default voice configuration becomes a Cartesia revenue source without requiring a direct sales relationship. Together AI's 300,000-plus developer base similarly gives Cartesia distribution that would be expensive to replicate through direct acquisition alone.

The vertical integration of Sonic, Ink, and Line into a single owned stack is the primary structural differentiator in the business model. A competitor selling only TTS competes on cents per character and faces price pressure. Cartesia's argument is that voice quality in production is a system property, determined by the handoff between telephony, transcription, reasoning, and synthesis, and that owning all of those layers lets it optimize across boundaries that multi-vendor stacks cannot. That integration also raises switching costs: a customer who has adopted Line for deployment, observability, and telephony is not replacing a single model endpoint when they consider switching; they are replacing an operational control plane.

The enterprise tier adds a third monetization layer through compliance, security, and deployment flexibility. SOC 2 Type II, HIPAA, PCI, on-prem deployment, custom SLAs, and zero data retention options convert Cartesia from a developer tool into a procurement-eligible enterprise vendor, enabling larger contract values and longer sales cycles with healthcare, financial services, and regulated enterprise buyers.

Competition

The voice AI infrastructure market has shifted from a simple TTS API comparison to a multi-layer competitive landscape where the primary point of competition is control of the real-time conversational stack end-to-end.

Full-stack bundlers

ElevenLabs is Cartesia's most visible direct rival. It competes on both voice quality and platform breadth, offering low-latency Flash and Turbo TTS models, broad language coverage, instant and professional voice cloning, and an agent platform now branded ElevenAgents. ElevenLabs cut conversational AI pricing to ten cents per minute in early 2026, framing the discount as enabled by controlling both the research and product layers. That move narrows the historical distinction between best-of-breed voice model and all-in-one agent stack, and directly pressures Cartesia's cost advantage, which Sacra has previously estimated at roughly 5x cheaper than ElevenLabs on a per-minute basis. ElevenLabs' broader consumer and developer mindshare is its main advantage. Cartesia's tighter optimization around real-time conversational latency and enterprise deployment is its main counter.

Deepgram started in speech-to-text and has expanded into a unified speech infrastructure play with its Aura-2 TTS model, Flux conversational STT, and a Voice Agent API. Deepgram's pitch is enterprise speech infrastructure with shared runtime across STT and TTS, sub-200ms latency claims, and cross-sell leverage from existing STT relationships in contact center and IVR environments. Its installed base and procurement familiarity in regulated enterprise accounts make it a strong competitor in deals where buyers prefer fewer vendors. Cartesia's launch of Ink was a direct response to this dynamic. Without a credible STT offering, Cartesia was structurally disadvantaged in any account where Deepgram could offer a single-vendor story.

Native speech-to-speech models

OpenAI's Realtime API represents the clearest architectural threat to Cartesia's modular positioning. Rather than chaining separate STT and TTS models, the Realtime API processes and generates audio through a single multimodal model, bypassing the component layer entirely. For developers already standardized on OpenAI's ecosystem, the simplicity of one API for reasoning and voice is a strong pull. OpenAI's current constraint is that the Realtime API uses preset voices for safety reasons, which limits custom brand-voice use cases where Cartesia's cloning and localization tools matter. If OpenAI extends voice customization, the architectural threat becomes more acute.

Hume competes less on raw latency and more on emotional intelligence and prosody control. Its EVI product unifies real-time voice interaction, while its Octave TTS model focuses on context-sensitive expression and affective adaptation. Hume is most threatening in coaching, wellness, companion, and premium assistant use cases where emotional naturalness matters more than transactional speed. If those capabilities migrate into mainstream support and sales agents, Hume could pressure Cartesia's quality narrative even in its core verticals.

Specialist TTS rivals and pricing disruptors

Rime's Arcana v3 targets the same developer and enterprise voice-agent workloads as Cartesia with aggressive pricing, HIPAA and SOC 2 compliance materials, and cloud, VPC, or on-prem deployment options. Because Rime positions as a premium TTS substitute inside Vapi and Retell-style stacks rather than a full orchestration layer, it can win accounts where buyers treat TTS as a swappable module, exactly the dynamic that makes Cartesia's vertical integration into Line strategically important.

PlayHT and Smallest.ai compete primarily on price and multilingual coverage. PlayHT's Play 3.0 Mini targets streaming TTS with 30-plus language support and websocket delivery. Smallest.ai pitches aggressive per-character pricing for outbound sales and SMB voice agents where unit economics dominate quality considerations. Neither is a direct threat in enterprise voice-agent orchestration, but both can attract developer experimentation and erode Cartesia's self-serve funnel.

Resemble AI differentiates on security and provenance, bundling TTS, voice agents, STT, voice changer, and deepfake detection into a single commercial package with on-prem enterprise options. In media, regulated enterprise, and trust-sensitive deployments, Resemble's provenance and detection capabilities give it a differentiated angle that Cartesia does not currently match.

Channel dynamics and platform encroachment

Vapi and Retell sit above Cartesia in the stack as voice-agent orchestration platforms, and both support multiple TTS providers with explicit fallback routing across vendors. That makes them powerful distribution partners. Vapi's selection of Cartesia as a default provider drives significant downstream traffic, but also creates structural risk. As Cartesia moves upward into orchestration with Line, it increasingly competes with the same platforms it relies on for distribution. Those platforms can respond by multi-sourcing aggressively, negotiating harder on price, or routing traffic toward alternatives. Cartesia's defense is that owning Sonic, Ink, and Line together enables on-prem and air-gapped enterprise deployments that orchestration-only platforms cannot match.

TAM Expansion

New products and vertical integration

Cartesia's expansion from a single TTS API into a three-product stack, Sonic, Ink, and Line, is the clearest near-term TAM expansion already underway. Each layer captures a different budget line: Sonic addresses voice generation spend, Ink addresses transcription spend, and Line addresses deployment, orchestration, and observability spend that previously went to internal engineering or third-party platforms like Vapi and Retell.

The on-prem and on-device deployment capability opens a further expansion path into environments where cloud-only voice models are hard to adopt. Automotive, field operations, hospitality, kiosks, and defense-adjacent workflows represent use cases where latency, privacy, and offline operation requirements make cloud APIs impractical. Cartesia's stated ability to deploy models on-device and in private VPCs allows it to address those markets without requiring a separate product line.

Customer base expansion

Cartesia's current customer references span voice-native infrastructure buyers like Vapi, Retell, LiveKit, and Together AI, as well as enterprise workflow vendors including ServiceNow, Forethought, and Maven AGI, and vertical software operators in healthcare like Assort Health and Arini, sales like Rox, and multilingual workforce communication like toby.

Healthcare is a particularly large expansion surface. Phone-based workflows in scheduling, patient intake, eligibility verification, appointment reminders, and care navigation remain heavily labor-intensive, and Cartesia's HIPAA compliance and telephony-optimized voice quality give it an enterprise pitch in that vertical. Forethought's reference to over one billion monthly customer interactions on its platform indicates the scale of contact center volume that Cartesia could capture as it moves deeper into enterprise CX infrastructure.

The QA and observability layer is an adjacent expansion opportunity. Cartesia's partnership with Cekura for automated voice agent testing points to a broader platform play around evaluation, monitoring, and quality assurance for production voice systems. Owning that layer would increase switching costs and give Cartesia more data to improve its underlying models, a self-reinforcing dynamic that pure model vendors cannot replicate.

Geographic expansion

Cartesia has built out dedicated regional infrastructure and product positioning for India, Western Europe, and Asia Pacific. The India expansion is supported by localization across Hindi, Tamil, Bengali, Telugu, Gujarati, Kannada, Malayalam, Marathi, and Punjabi, languages that represent hundreds of millions of potential voice-agent interactions in customer service, financial services, and healthcare. Western Europe is supported by GDPR compliance and on-prem deployment flexibility, which reduces the primary regulatory friction for enterprise buyers in that region.

Multilingual voice localization, where Cartesia adapts a cloned voice to sound native in a target language rather than keeping the original accent, is the key product capability enabling geographic expansion. Without localization, multilingual deployment produces voice agents that sound foreign to local callers, which undermines the naturalness that makes voice agents effective. With it, a single enterprise customer can deploy a consistent brand voice across multiple markets without rebuilding their voice identity for each language.

Risks

Platform compression: OpenAI's Realtime API and similar native speech-to-speech systems bypass the modular STT-plus-TTS architecture that Cartesia is built around, processing and generating audio through a single model rather than chaining separate components. If enough developers decide that a bundled multimodal voice model is sufficient for their use case, Cartesia's component-level latency and quality advantages become less relevant as a buying criterion.

Channel dependence: A meaningful share of Cartesia's distribution runs through ecosystem partners like Vapi, Retell, LiveKit, and Together AI, which mediate provider selection and support fallback routing across multiple TTS vendors. As Cartesia climbs the stack with Line and competes more directly with those same orchestration platforms, partners have both the incentive and the technical capability to steer traffic toward alternatives or build their own speech layers over time.

Cloning liability: Cartesia's expansion into instant and professional voice cloning puts it in a regulatory environment where the FTC and FCC have both signaled enforcement focus on voice impersonation and AI-generated robocall abuse. As Cartesia sells deeper into outbound calling and enterprise communications, misuse by downstream customers or gaps in consent verification could create regulatory, reputational, and enterprise-trust costs that compound faster than the revenue those products generate.

News

DISCLAIMERS

This report is for information purposes only and is not to be used or considered as an offer or the solicitation of an offer to sell or to buy or subscribe for securities or other financial instruments. Nothing in this report constitutes investment, legal, accounting or tax advice or a representation that any investment or strategy is suitable or appropriate to your individual circumstances or otherwise constitutes a personal trade recommendation to you.

This research report has been prepared solely by Sacra and should not be considered a product of any person or entity that makes such report available, if any.

Information and opinions presented in the sections of the report were obtained or derived from sources Sacra believes are reliable, but Sacra makes no representation as to their accuracy or completeness. Past performance should not be taken as an indication or guarantee of future performance, and no representation or warranty, express or implied, is made regarding future performance. Information, opinions and estimates contained in this report reflect a determination at its original date of publication by Sacra and are subject to change without notice.

Sacra accepts no liability for loss arising from the use of the material presented in this report, except that this exclusion of liability does not apply to the extent that liability arises under specific statutes or regulations applicable to Sacra. Sacra may have issued, and may in the future issue, other reports that are inconsistent with, and reach different conclusions from, the information presented in this report. Those reports reflect different assumptions, views and analytical methods of the analysts who prepared them and Sacra is under no obligation to ensure that such other reports are brought to the attention of any recipient of this report.

All rights reserved. All material presented in this report, unless specifically indicated otherwise is under copyright to Sacra. Sacra reserves any and all intellectual property rights in the report. All trademarks, service marks and logos used in this report are trademarks or service marks or registered trademarks or service marks of Sacra. Any modification, copying, displaying, distributing, transmitting, publishing, licensing, creating derivative works from, or selling any report is strictly prohibited. None of the material, nor its content, nor any copy of it, may be altered in any way, transmitted to, copied or distributed to any other party, without the prior express written permission of Sacra. Any unauthorized duplication, redistribution or disclosure of this report will result in prosecution.