Hot-Swapping Giant LLM Weights
Towaki Takikawa, CEO and co-founder of Outerport, on the rise of DevOps for LLMs
This gap is why model deployment is becoming its own infrastructure category, not just another feature inside existing DevOps tools. Web software rollout tools assume code and containers are small enough to swap quickly. AI deployment has to move 10GB to 20GB model files from storage into CPU memory and then GPU memory, where even a small model can take up to a minute to load, so live updates need different machinery.
-
The missing piece is not versioning in the abstract, it is hot swapping giant model weights on running hardware. Outerport describes a daemon that keeps multiple model weights resident in CPU memory, then moves them into GPU memory when needed, so teams can avoid full container replacement and long cold starts.
-
The closest existing products solve adjacent problems. MLflow and Databricks help track models, artifacts, training runs, and deployment workflows. Replicate gives developers an API to run and fine tune open source models. But those platforms are primarily about packaging, hosting, and access, not zero disruption model handoffs on already running GPU infrastructure.
-
This matters most in compound AI systems, where several models run in sequence. In a workflow with multiple LLMs, diffusion models, databases, and APIs, each extra model load adds delay and GPU cost. That is why image generation workflows like ComfyUI adopted model chaining early, and why deployment tooling becomes strategic as enterprise stacks get more multi model.
The market is heading toward a split. Simple AI apps will keep using hosted APIs and higher level platforms, while enterprises running custom or open models will need infrastructure that treats model weights like first class production artifacts. As multi model workflows spread, the winning deployment layer will be the one that makes swapping models feel as routine as rolling out code.