Nyckel's Shared Inference Economics
Oscar Beijbom, co-founder and CTO of Nyckel, on the opportunites in the AI/ML tooling market
This reveals that Nyckel is building its economics around shared inference, not one model per customer. The expensive part is the large base models that sit in memory on GPUs so they can answer instantly, rather than waiting to load after each request. By letting every customer use the same always on backbone, then adding tiny customer specific layers on top, Nyckel keeps latency low without paying the full hosting cost for each account separately.
-
Keeping a model warm means paying to keep compute allocated even when traffic is uneven. AWS describes this directly in serverless inference, where cold starts appear when compute spins down, and provisioned concurrency keeps endpoints warm to answer in milliseconds. The same logic applies to GPU hosted model serving.
-
Nyckel says customers upload a small labeled dataset, often around 100 examples, then the system trains and deploys in seconds. That works because the heavy lifting is done by shared pre trained nets, while the customer specific part is a much smaller model that is cheap to create and run.
-
The contrast with full fine tuning is economic. OpenAI prices fine tuned models separately for training and inference, and its docs produce a distinct output model ID after tuning. That reflects a more dedicated artifact per use case, which is why per customer customization can push costs up fast if the whole large model must be specialized.
This architecture points toward AI application companies separating the stack into one shared foundation model layer and one very cheap personalization layer. The winners in AI tooling are likely to be the companies that hide that split from users, while turning shared model utilization into better margins, faster response times, and simpler self serve deployment.