Multi-Model Routing for LLMs

Diving deeper into

Towaki Takikawa, CEO and co-founder of Outerport, on the rise of DevOps for LLMs

Interview
In a future where huge models dominate everything, running these models will be very expensive - certainly more expensive than running smaller models.
Analyzed 5 sources

Inference cost is what keeps the model market from collapsing into one giant model. Bigger models need more GPU memory, more memory movement, and longer load times, so the price gap shows up both in API bills and in self hosted serving. That is why practical stacks already mix frontier models for hard reasoning with smaller or specialized models for retrieval, classification, extraction, and workflow steps where speed and cost matter more than maximum intelligence.

  • The cost problem is not just tokens. Large models are huge files that must be loaded from storage into CPU memory and then into GPU memory. In the Outerport interview, even a relatively small 15GB model can take up to a minute to load, and multi model pipelines compound that delay and GPU waste.
  • Teams already operate this way in practice. Heyday describes running a mix of small embedding models and larger fine tuned Llama or Mixtral based models on GPUs, and using OpenAI only for some tasks. That is a concrete example of segmentation by task value, latency need, and cost tolerance.
  • The pricing ladder reinforces the split. OpenAI prices GPT-4.1 above GPT-4.1 mini and nano, and notes efficiency work was needed to lower inference costs. At the same time, the Anthropic company research highlights smaller, cheaper, specialized models as a structural risk to frontier model vendors.

The next phase of the market looks like a routing layer, not a winner take all model layer. More applications will send easy, repetitive, high volume work to cheap local or smaller hosted models, and reserve the largest models for rare, high stakes steps. That shift makes deployment, caching, and model swapping infrastructure more central, not less.