Full Stack Approach to AI Video
Cristóbal Valenzuela, CEO of Runway, on the state of generative AI in video
Video complexity is what lets Runway defend itself as more than a wrapper on a single foundation model. Text models mostly turn one stream of tokens into another, but video products must keep objects stable across frames, handle decoding and streaming, and let users change only one region, motion path, or camera move without breaking the rest of the clip. That is why Runway built its own rendering and deployment stack, not just the model layer.
-
A video request is rarely one simple generation call. Users may want background removal, inpainting, camera control, subtitle timing, or character consistency across scenes. Each job touches different parts of the clip and has different compute and product requirements, unlike text where many tasks can ride on one general model path.
-
The hard part is not just making frames, it is making editable footage. Runway's earlier tools automated rotoscoping, inpainting, transcription, and noise removal, which came from watching editors spend most of their time on repetitive frame by frame work. That pushed the company toward a full stack system that understands video at the pixel level and can ship fast enough for creative feedback loops.
-
That complexity also shapes the market. Horizontal AI products can bundle text to video as one feature, while Runway competes by owning the production workflow, from model to browser editor to collaboration. The payoff is visible in products like Gen-3, which added camera direction and scene consistent characters, and helped scale Runway from $25M ARR in 2023 to $70M in 2024 and $90M by mid 2025.
The next phase of AI video will be won by whoever turns complex model output into reliable creative control. As costs fall, the bottleneck shifts from raw generation to workflow quality, how quickly a team can iterate, keep scenes coherent, and turn rough generations into usable shots. That favors companies with their own video stack, product surface, and distribution into real editing work.