Model Weights as Live Infrastructure

Towaki Takikawa, CEO and co-founder of Outerport, on the rise of DevOps for LLMs

The growth in computer processing speed has significantly outpaced memory speed, which creates an inherent imbalance.

Analyzed 5 sources

This bottleneck makes model loading and swapping a durable infrastructure problem, not a temporary quirk of early LLMs. Outerport sits in the path between storage, CPU RAM, and accelerator memory, where large model files spend real time moving before they can answer a request. In the interview, even a 15GB model can take up to a minute to load, and multi model workflows compound that delay one model at a time.

1 sacra

The mismatch is concrete. H100 delivers about 3.35TB/s of GPU memory bandwidth, but the operational problem Outerport describes starts earlier, moving weights from disk or object storage into CPU memory and then into GPU memory on cloud machines where storage to CPU bandwidth is often the real choke point.

1 sacra 4 nvidia
This is why LLM deployment looks less like shipping normal code and more like orchestrating giant binary assets. The interview describes 7B models at roughly 17GB, versus older ResNet class models at 170MB, large enough that Argo, Flux, and similar software deployment tools are not designed for their startup times and movement costs.

1 sacra
New chips can ease the memory wall, but they do not remove the deployment layer. Cerebras built its pitch around extreme on chip memory bandwidth and single chip model loading, first for training and later for inference, yet that still creates a heterogeneous hardware world where a neutral software layer that can load models across GPUs, ASICs, and edge devices becomes more valuable, not less.

1 sacra 2 sacra 3 sacra 5 cerebras

As inference spreads from one model chatbots to multi step agent and media pipelines, the winning stack will treat model weights like live infrastructure that must be staged, swapped, and updated continuously. That pushes deployment software closer to a universal control plane for heterogeneous AI hardware, whether the accelerator is an Nvidia GPU, a Cerebras wafer, or custom edge silicon.

1 sacra 2 sacra 3 sacra