Voice AI as Building Blocks

Diving deeper into

Sandbar

Company Report
speech-to-text and text-to-speech are becoming building blocks rather than moats.
Analyzed 6 sources

This shifts the battle away from raw voice AI and toward owning the workflow around it. If speech-to-text and text-to-speech can be rented as APIs, a wearable like Sandbar cannot win just by turning speech into text or text into a voice. It has to win on where it fits into daily behavior, how fast it feels, and whether the notes, memory, and follow up loop become more useful than using a phone or existing meeting software.

  • Voice startups already mix and match providers. Vapi lets developers swap in Deepgram for transcription and ElevenLabs for voice, then charges its own orchestration fee on top. That is what building block software looks like in practice, the speech layer is a component inside a larger product, not the whole product.
  • The winning products often capture value one layer up. Gong used call recording and transcription to build a $285M ARR system of record for sales activity, and Outreach reached about $250M ARR by baking call intelligence into a broader workflow product. The moat came from workflow ownership, not the transcript alone.
  • The suppliers are still valuable, but for different reasons. ElevenLabs grew from an estimated $90M ARR in October 2024 to about $330M by December 2025 by selling premium voice quality and moving into higher level audio tools, while Deepgram positions around speech APIs and real time voice understanding. That shows infrastructure can scale, but it is competing on cost, latency, and quality rather than exclusivity.

From here, voice products will keep getting easier to launch and harder to defend. The durable winners will be the ones that either become the default infrastructure inside many apps, or package commodity voice models into a product that owns a recurring user workflow, data history, and distribution point that cheaper copycats cannot easily dislodge.