What are the key similarities and differences between using generative AI to create videos versus text and images?

Working with language models is slightly different from working with generative models for images and videos. Video is notably harder, given the temporal consistency component that needs to be considered. Making sure that you maintain the relationship between frames of objects moving in that frame.

The human eye has been trained to detect the slightest imperfection in a video frame. If you're generating a video from scratch or editing a video with the help of an automated system, the final result needs to work really well to retain the magical illusion of movements in frames. Those subtleties are one of the biggest challenges when working with video models.

There is also the iteration speed factor at which video models can get transferred into products. Video has this added complexity for decoding, encoding, streaming, and a multitude of small optimizations that have to happen. In addition, there are unit economics that also need to make sense since it’s traditionally a more expensive medium than working with text tokens.

Natural language has seen faster and more rapid improvements, but now, images and video are catching up. I expect video to be pretty much the center of research in the next couple of years when it comes to generative models.