ElevenLabs expands into full voice stack
ElevenLabs
Convergence means voice infrastructure is stopping at model APIs and moving up into full call handling systems. ElevenLabs started in premium text-to-speech, Deepgram in transcription, and Cartesia in low latency voice generation, but each is now bundling the adjacent layers needed to run an actual agent, which turns a developer choice from best single model into best integrated runtime for latency, control, and cost.
-
ElevenLabs now spans the whole stack. It offers low latency text-to-speech, realtime speech-to-text via Scribe v2 Realtime, and a Conversational AI platform that combines speech-to-text, an LLM, text-to-speech, and turn taking into a deployable voice agent product.
-
Deepgram made the same move from transcription into a unified stack. Its docs now position Speech to Text, Text to Speech, and a Voice Agent API together, with built in barge in detection, turn taking prediction, and orchestration so teams do not have to stitch separate vendors together.
-
Cartesia is following the same pattern with Sonic for text-to-speech, Ink for speech-to-text, and Line for complete voice agents. That matters because once every vendor offers the same core boxes, competition shifts to whose voices sound better, whose stack responds faster, and whose all in workflow is cheaper to operate.
The next phase is fewer point solutions and more bundled voice stacks sold on production outcomes. Providers that can own transcription, synthesis, turn timing, telephony, and monitoring in one system will be best positioned to win enterprise voice agent budgets, while standalone model vendors get pushed toward lower margin commodity pricing.