Home  >  Companies  >  Gimlet Labs
Gimlet Labs
Applied research lab that builds serverless inference for AI agents and develops autonomous kernel-generation, compiler, and scheduling technologies for heterogeneous AI hardware

Funding

$92.00M

2026

View PDF
Details
Headquarters
San Francisco, CA
CEO
Zain Asgar
Website
Milestones
FOUNDING YEAR
2023
Listed In

Valuation & Funding

Gimlet Labs' most recent funding round was an $80 million Series A announced on March 23, 2026, led by Menlo Ventures, with participation from Eclipse Ventures, Factory, Prosperity7, and Triatomic.

Before the Series A, Gimlet Labs raised a $12 million seed round led by Factory, announced on October 22, 2025 alongside the company's public launch.

Notable angels and individual backers include Bill Coughran, Nick McKeown, Raghu Raghuram, Lip-Bu Tan, and Dylan Field.

Total funding raised stands at $92 million across both rounds.

Product

Gimlet Labs is an AI inference platform built on the view that modern AI agents are not a single monolithic workload, they are a chain of distinct computational jobs, and each job should run on the hardware best suited for it.

A developer can bring in an existing agentic workflow, for example a coding agent that ingests a user prompt, retrieves relevant code, runs a large-context prefill, iterates through decode steps, executes tool calls in a sandbox, and assembles a final response. Instead of forcing all of that onto one type of chip, Gimlet models the workflow as a multi-stage compute graph and routes each fragment to the hardware where it performs best. Developers can point existing PyTorch, LangChain, or LangGraph workflows at Gimlet's managed API, or import models via Hugging Face Transformers, while Gimlet handles compilation, scheduling, and production serving.

The stack has three layers. The workload orchestrator translates an agent into a compute graph and dynamically distributes fragments across available hardware under latency and throughput constraints. An MLIR-based compiler applies general and device-aware optimizations and lowers fragments to implementations tailored to specific accelerators. kforge, Gimlet's autonomous kernel generation toolkit, generates optimized low-level kernels directly from PyTorch across CUDA, ROCm, and Metal using a multi-agent system with correctness checks and search over candidate implementations.

Gimlet runs across NVIDIA, AMD, Intel, ARM, Cerebras, and d-Matrix accelerators, treating them as one inference system inside multi-silicon datacenters that Gimlet manages itself. For customers with their own infrastructure, the same software stack can be deployed into private datacenters, making Gimlet available as either a managed API or an infrastructure software layer installed in-house.

The company says the platform can speed up inference by three to ten times for the same cost and power. It attributes those gains to fine-grained decomposition that can split work at the stage, layer, or operator level, including techniques like running speculative decoding on SRAM-centric hardware while keeping other phases on GPU.

Business Model

Gimlet Labs sells to frontier labs, hyperscalers, AI-native companies, and large enterprises running latency-sensitive agentic workloads. Its go-to-market is high-touch and technically consultative, with founders and engineering staff directly involved in onboarding sophisticated buyers focused on infrastructure performance and datacenter design.

Monetization comes through two channels: consumption of Gimlet Cloud, the managed inference API, and enterprise deployments of the same software stack into customer-owned datacenters. The managed cloud is usage-linked, metered against inference workloads running on Gimlet's infrastructure. On-prem and private deployments align more closely with enterprise licensing and strategic capacity agreements.

The economic pitch is outcome-based, even if billing is not explicitly structured that way. Gimlet's value to customers is lower latency, higher throughput, better performance per watt, and better utilization of owned or available hardware. That gives the company pricing power closer to performance infrastructure than commodity compute. If Gimlet materially improves a customer's latency and throughput, it can capture part of that value even when raw hardware costs are similar.

The cost structure is heavier than classic SaaS. Gimlet operates specialized multi-silicon datacenters, employs hard-to-hire talent across compilers, kernels, and distributed systems, and carries meaningful infrastructure costs. Gross margins are likely lower than pure software peers, but vertical control across orchestration, compiler, and kernel generation means the company does not depend on external serving stacks for the hardest optimization work.

The key self-reinforcing dynamic is between hardware breadth and workload quality. More hardware integrations give Gimlet more ways to route work efficiently, improving customer economics and attracting larger, more demanding workloads. Those workloads, in turn, improve the scheduling and compiler heuristics that make Gimlet more valuable to the next hardware vendor trying to win inference share.

Competition

The inference market is shifting from single-model API serving to compound agentic workflows, and competition spans inference clouds, compiler tooling, and vertically integrated silicon stacks.

Inference clouds

The closest commercial rivals are inference platforms like Fireworks AI, Baseten, Together AI, Modal, and RunPod, which are extending from model serving into compound and agentic workloads.

Fireworks AI competes in disaggregated serving, prompt-aware routing, and long-session optimization, the same areas where Gimlet Labs is building its product, while also offering broader model access and a clearer path from experimentation to production. Baseten's Chains framework lets each step in a compound AI system use its own hardware and autoscaling policy, which overlaps with Gimlet's multi-stage graph approach, though Baseten is optimized primarily for GPU-centric cloud capacity rather than heterogeneous accelerators. Together AI and RunPod compete more on distribution and price than on infrastructure design, and their lower adoption friction can appeal to budget-sensitive teams that do not need Gimlet's heterogeneity.

Modal is a meaningful rival for the long tail of agent developers. Its code-first serverless model, sub-second GPU cold starts, and sandboxed code execution can be sufficient for startups building custom agent backends, where developer experience matters more than infrastructure specialization.

Compiler and heterogeneity-first tooling

Modular, with its Mojo language and hardware-retargetable compilation stack, is the most prominent independent compiler-layer rival, and its positioning around hardware portability overlaps directly with Gimlet Labs' MLIR-based compiler thesis.

Luminal is a closer startup analog. It markets compiled inference, hardware-aware optimization for GPUs and ASICs, dynamic scheduling across heterogeneous compute nodes, and both managed serverless and on-prem deployment. The overlap with Gimlet Labs is high, and if the category matures, Luminal could reduce Gimlet's differentiation for sophisticated buyers. Kernelize approaches portability from the kernel and toolchain layer, offering a Triton-based platform for portable AI inference with chip-specific extensions, which could weaken Gimlet's kernel-generation differentiation by making heterogeneous bring-up easier for competing inference stacks.

Tiny Corp is another startup working in the broader area of running AI across different chip types, competing for some of the same developer mindshare in multi-hardware portability.

Vertical integration from silicon

NVIDIA is the main structural risk. Its Dynamo framework offers disaggregated prefill and decode, dynamic GPU scheduling, and LLM-aware request routing across major inference backends, productizing the orchestration and scheduling layers that Gimlet Labs is building and bundling them into the dominant accelerator ecosystem. TensorRT-LLM adds aggressive inference optimization. If customers can get enough of the benefit inside an NVIDIA-first stack, Gimlet's addressable wedge narrows to workloads that require non-NVIDIA heterogeneity.

Groq offers the opposite architecture, purpose-built LPU-based inference with predictable latency and, increasingly, agentic orchestration through its Compound product. Where Gimlet Labs argues no single chip is universally best, Groq argues a tightly integrated chip-plus-cloud stack can outperform general-purpose alternatives for many production use cases. AWS, Google Cloud, and Azure add procurement-led substitution risk by bundling model serving with custom silicon like Trainium2 and TPU v6e inside existing cloud relationships that enterprises already use.

TAM Expansion

Gimlet Labs sits at the intersection of two expanding markets, agentic AI infrastructure and heterogeneous compute orchestration, with multiple paths to expand beyond its current frontier-lab and hyperscaler base.

New products

kforge, Gimlet's autonomous kernel generation toolkit, is already a distinct product surface that can extend beyond the inference cloud into a standalone hardware-enablement layer. AI chip startups, enterprise infrastructure teams, and model developers all need portable optimization as hardware fragments across NVIDIA, AMD, Intel, Apple, and newer accelerator vendors. kforge's ability to autoport workloads to new devices without code changes addresses that need independently of whether a customer uses Gimlet Cloud.

The company's MLIR-based compiler and SLA-aware scheduling research also point to a licensable infrastructure software path, selling the compiler and scheduler layer to enterprises, model labs, hardware vendors, and OEMs that want the optimization stack without fully adopting the managed cloud. That would add a software revenue stream alongside the consumption-based cloud business.

Customer base expansion

The current customer base skews toward frontier labs and hyperscalers, but the larger long-term market is any enterprise running latency-sensitive agentic workloads in production, including coding agents, customer support systems, internal copilots, multimodal search, and retrieval-augmented workflows. As agentic AI adoption rises across industries, the infrastructure constraints Gimlet Labs targets become relevant to a broader set of buyers than the frontier-scale operators it serves today.

Hardware vendors are a second customer class, not just partners. Chip companies like Cerebras and d-Matrix need software that makes their accelerators usable in production workloads, and Gimlet's compiler and kernel generation stack can serve as the enablement layer that helps emerging hardware compete for inference share against NVIDIA-heavy incumbency. That creates a revenue channel through enablement software, reference deployments, and joint benchmarking outside the traditional inference cloud model.

Geographic and vertical expansion

Gimlet's product architecture is inherently global. The buyers that most need heterogeneous inference operate across North America, Europe, the Middle East, and Asia. Sovereign and regional AI infrastructure efforts are a natural adjacency because regions that cannot access the newest homogeneous GPU fleets at scale are often the ones most likely to assemble inference capacity from mixed silicon, older GPU generations, and non-NVIDIA accelerators.

Deeper vertical integration into physical infrastructure is also an expansion path. Gimlet Labs is researching headless hardware architectures using DPUs paired to accelerators and building a new type of datacenter fabric to connect mixed accelerators over high-speed networks. If that continues, the company can capture more of the value chain, from software scheduling and compilation to reference rack design, integrated inference appliances, and datacenter architecture services, moving from an infrastructure software vendor toward a full-stack AI systems company.

Risks

NVIDIA absorption: As NVIDIA productizes disaggregated prefill and decode, dynamic GPU scheduling, and LLM-aware routing through Dynamo and TensorRT-LLM, while also acquiring companies like OctoAI and Run:ai in adjacent orchestration layers, Gimlet Labs' compiler, scheduler, and kernel generation stack could be subsumed into the accelerator ecosystem faster than the company can differentiate on cross-vendor heterogeneity.

Capital intensity: Because Gimlet Labs operates specialized multi-silicon datacenters, employs some of the hardest-to-hire profiles in tech, and is researching new datacenter fabric architectures that connect hardware not designed to interoperate, scaling the business likely requires sustained capital deployment at a pace and cost structure that diverges from the software-like margins investors typically associate with AI infrastructure companies.

Customer concentration: Gimlet Labs' early revenue base appears concentrated among a small number of frontier labs, hyperscalers, and large-scale inference operators, sophisticated buyers with the engineering resources to build orchestration and compiler capabilities internally, so if even one or two anchor accounts internalize the heterogeneous scheduling layer or shift to a vertically integrated silicon stack, the company's revenue base could be disproportionately affected.

News

DISCLAIMERS

This report is for information purposes only and is not to be used or considered as an offer or the solicitation of an offer to sell or to buy or subscribe for securities or other financial instruments. Nothing in this report constitutes investment, legal, accounting or tax advice or a representation that any investment or strategy is suitable or appropriate to your individual circumstances or otherwise constitutes a personal trade recommendation to you.

This research report has been prepared solely by Sacra and should not be considered a product of any person or entity that makes such report available, if any.

Information and opinions presented in the sections of the report were obtained or derived from sources Sacra believes are reliable, but Sacra makes no representation as to their accuracy or completeness. Past performance should not be taken as an indication or guarantee of future performance, and no representation or warranty, express or implied, is made regarding future performance. Information, opinions and estimates contained in this report reflect a determination at its original date of publication by Sacra and are subject to change without notice.

Sacra accepts no liability for loss arising from the use of the material presented in this report, except that this exclusion of liability does not apply to the extent that liability arises under specific statutes or regulations applicable to Sacra. Sacra may have issued, and may in the future issue, other reports that are inconsistent with, and reach different conclusions from, the information presented in this report. Those reports reflect different assumptions, views and analytical methods of the analysts who prepared them and Sacra is under no obligation to ensure that such other reports are brought to the attention of any recipient of this report.

All rights reserved. All material presented in this report, unless specifically indicated otherwise is under copyright to Sacra. Sacra reserves any and all intellectual property rights in the report. All trademarks, service marks and logos used in this report are trademarks or service marks or registered trademarks or service marks of Sacra. Any modification, copying, displaying, distributing, transmitting, publishing, licensing, creating derivative works from, or selling any report is strictly prohibited. None of the material, nor its content, nor any copy of it, may be altered in any way, transmitted to, copied or distributed to any other party, without the prior express written permission of Sacra. Any unauthorized duplication, redistribution or disclosure of this report will result in prosecution.