Stack War for Voice Agents

ElevenLabs at $90M ARR

all of these tools are now converging on the common feature set of speech-to-text, text-to-speech, and real-time conversational AI.

Analyzed 8 sources

Convergence means voice infrastructure is starting to sell as a bundle, not a single model. A buyer building an AI receptionist or support bot no longer wants one vendor for transcription, another for voice output, and a third for turn taking. They want one stack that can listen, decide, speak, and be monitored in real time. That shifts competition from best point solution to best full workflow, where latency, reliability, and procurement simplicity matter as much as raw model quality.

1 sacra 2 sacra 3 sacra 4 cartesia 5 cartesia 6 deepgram

Deepgram began as speech recognition, then added Aura text-to-speech and a Voice Agent API that combines speech-to-text, text-to-speech, and orchestration. Cartesia followed the same path from Sonic text-to-speech into Ink speech-to-text and then Line for building and deploying complete agents. The product map is becoming symmetrical across rivals.

3 sacra 4 cartesia 5 cartesia 6 deepgram
This is happening because the real customer job is a live conversation. In production, a voice agent must detect when a caller stops speaking, transcribe noisy telephony audio, send the text to an LLM, generate audio fast enough to avoid awkward pauses, and handle interruptions. Owning more of that chain lets vendors cut latency and reduce integration pain.

2 sacra 3 sacra 5 cartesia 6 deepgram
For ElevenLabs, adding speech-to-text and agents is both offensive and defensive. It opens a larger spend pool than voice generation alone, and it prevents platforms like Deepgram or Cartesia from owning the customer relationship with a one vendor story. ElevenLabs already offers Scribe for speech-to-text and Agents for conversational AI on top of its core text-to-speech base.

1 sacra 2 sacra 7 elevenlabs 8 elevenlabs

The next phase is a stack war around who becomes the default runtime for voice agents. The winners will be the vendors that can package fast transcription, natural speech, interruption handling, testing, and enterprise deployment into one dependable system, then use that wedge to move from developer tool into core call center and customer interaction infrastructure.

2 sacra 3 sacra 4 cartesia 5 cartesia 6 deepgram 8 elevenlabs