ElevenLabs expands into full voice stack

ElevenLabs

all of these tools are now converging on the common feature set of speech-to-text, text-to-speech, and real-time conversational AI.

Analyzed 8 sources

Convergence means voice infrastructure is stopping at model APIs and moving up into full call handling systems. ElevenLabs started in premium text-to-speech, Deepgram in transcription, and Cartesia in low latency voice generation, but each is now bundling the adjacent layers needed to run an actual agent, which turns a developer choice from best single model into best integrated runtime for latency, control, and cost.

1 sacra 2 sacra 3 elevenlabs 4 elevenlabs 5 sacra 6 deepgram

ElevenLabs now spans the whole stack. It offers low latency text-to-speech, realtime speech-to-text via Scribe v2 Realtime, and a Conversational AI platform that combines speech-to-text, an LLM, text-to-speech, and turn taking into a deployable voice agent product.

1 sacra 2 sacra 3 elevenlabs 4 elevenlabs
Deepgram made the same move from transcription into a unified stack. Its docs now position Speech to Text, Text to Speech, and a Voice Agent API together, with built in barge in detection, turn taking prediction, and orchestration so teams do not have to stitch separate vendors together.

5 sacra 6 deepgram
Cartesia is following the same pattern with Sonic for text-to-speech, Ink for speech-to-text, and Line for complete voice agents. That matters because once every vendor offers the same core boxes, competition shifts to whose voices sound better, whose stack responds faster, and whose all in workflow is cheaper to operate.

7 sacra 8 cartesia

The next phase is fewer point solutions and more bundled voice stacks sold on production outcomes. Providers that can own transcription, synthesis, turn timing, telephony, and monitoring in one system will be best positioned to win enterprise voice agent budgets, while standalone model vendors get pushed toward lower margin commodity pricing.

2 sacra 4 elevenlabs 6 deepgram 8 cartesia