Sacra Logo Sign In

Hassaan Raza, CEO of Tavus, on building the AI avatar developer platform

Jan-Erik Asplund
None

Background

Hassan Raza is the CEO and co-founder of Tavus. We talked with Hassaan about the major use cases driving adoption of AI avatar videos, using custom models rather than foundation models, and the research challenges in generating highly realistic digital humans.

Questions

  1. What’s the big use case that you see driving adoption of AI replicas or talking head videos?
  2. Being able to have your videos instantly translated into dozens of languages has become a key selling point for AI avatar platforms. Can you talk about how big of an impact translation will have as a feature? What use case do you see it being most useful for—marketing, sales, learning & development, etc?
  3. How do you think about the gross margins for a company like Tavus? We've already seen GPT-4 fall 85% in price after roughly a year. Do you see compute costs coming down for you in a similar way?
  4. What do you make of the comparison that folks have made between where AI infrastructure companies are today with CDNs during the early web? Is there a parallel there in terms of how the price of all models are going down?
  5. With AI avatars, the once-human actor has become a configurable part of the video player. How is Tavus approaching the redesign of the traditional video player and creation interface to support granular control over avatar appearance, branding, and settings?
  6. How do you expect AI talking heads, text-to-video and video editing via editing text to impact the overall volume and frequency of video content creation by businesses?
  7. Today, people might use Synthesia to generate a talking head, generate B-roll via Stable Diffusion, edit it in Descript, and turn it into clips for social media with Opus Clip. What do these platforms have in common and how are they different? Do you anticipate AI talking head platforms expanding into some of these use cases that are adjacent to video creation?
  8. Can you talk about the “face” vs. the “body” in AI talking heads and text-to-video? How much more difficult is it to generate a believable, non-canny valley face and what’s the current state of the art here?
  9. What role do you see text-to-video models playing in Tavus' product roadmap, and how might they intersect with your AI avatar technology to enable end-to-end video generation?
  10. Tavus uses its own models for talking head generation and for lip-syncing and dubbing. Can you talk about the advantages there over using a foundational model? In what way does using your own model allow you to improve the results over time and make Tavus models the best for avatar generation?
  11. Say everything goes right for Tavus over the next 5 years. What does Tavus look like and how is the world different as a result?

Interview

What’s the big use case that you see driving adoption of AI replicas or talking head videos?

Right now it's in industries that are already used to using video and know its value. Salespeople were early adopters and that market continues to grow. The persona of sales leaders is always sort of trying to figure out what's the next thing that's gonna really work. We’ve also seen adoption by influencer marketing and social media advertising platforms. More recently, we're moving into some really interesting new use cases: like doctor-patient communication. We're seeing a lot more use cases where people are trying to connect with one another in a more human or more localized way.

There's a lot of different early adopters, but part of our mission right now is to make this as easily applicable to different use cases as possible, because I think we haven't even figured out what's going to be the absolutely transformative use cases in the future.

There's certainly really high usage in learning and development, sales, marketing, but is that gonna be the transformative, world-bending use case? I think we're still in the early stages of figuring that out.

Our vision for the future is that AI replicas become part of all key workloads. We see the idea of extending your likeness as being something that's ever-present in every interaction you have digitally.

Being able to have your videos instantly translated into dozens of languages has become a key selling point for AI avatar platforms. Can you talk about how big of an impact translation will have as a feature? What use case do you see it being most useful for—marketing, sales, learning & development, etc?

Yes, absolutely. I think translation and dubbing is a really interesting space. Especially it helps incumbents expand their market. Anyone who has a video library can instantly expand their market with dubbing. For example, TED can translate all their content into different languages. Customer support, education, entertainment—all these allow you to have a much wider audience.

We definitely see this being an absolutely massive area. It all sort of comes down to how you can extend your likeness beyond the skills and the knowledge and the scale that you might be able to do yourself.

How do you think about the gross margins for a company like Tavus? We've already seen GPT-4 fall 85% in price after roughly a year. Do you see compute costs coming down for you in a similar way?

Yeah. Part of our research focus is optimization. Because we're supporting so many different customers, our goal is to offer continued scale pricing discounts. As our customer volume grows, we're able to get them scale pricing. But also on the model side, the research side, we're really focused on optimizing architecture always—reducing complexity in the architectures as new methodologies come out.

For example, Gaussians are a more advanced, yet less complex architecture compared to NeRFs. So we're always constantly reevaluating how we structure our models. As you make certain components of a pipeline or a machine learning model more robust, other components that you had created to compensate for instability can be removed.

So naturally, what ends up happening is you reduce training time, you reduce inference time, and that directly affects what it actually costs to produce a replica or produce a training video. Things like quantization, architecture complexity reduction—these have direct impacts on price. It's something we're really focused on because, again, our goal is how can we get this into the hands of as many developers as possible so that they can power as many unique, really awesome experiences as possible.

What do you make of the comparison that folks have made between where AI infrastructure companies are today with CDNs during the early web? Is there a parallel there in terms of how the price of all models are going down?

I think there's a little bit of that, but at Tavus, our strategy is to democratize access to these really cutting edge foundational models. It’s an ongoing question of how you can replicate and clone humans most accurately, and how you can make the experiences even more immersive?

It's different in the sense that there's still an absolutely insane amount of research to be done, and the product experience is an ever-evolving one that continues.

Today, you might have a replica or a video that's generated that maybe is 95% of the way there, but it's still constrained to a certain use case. There's still a ton to be done on full face expressions, being able to manipulate more parts of the body—all of those things are a long tail of things that need to happen.

And so eventually some of this can be commoditized. And you can just look at companies like AWS and GCP, where they continued to be these really large companies. I think there is an aspect of that that will be true for foundational model companies.

It’s a commodity in some sense, but then there's also a ton of experiences, additional components, features and functionality that'll be essential for being able to leverage AI life at scale.

So it's like we actually think it's a completely new computer-human interface, and therefore there's so much to be done where you can uniquely have advantages for how naturally it converses, for how quickly it does, and all the things.

There is commoditization that will happen, and there will be price erosion to some degree, but also there will be clear differentiating advantages that one model will have over another that will continue to remain true and be better suited to certain use cases over others.

With AI avatars, the once-human actor has become a configurable part of the video player. How is Tavus approaching the redesign of the traditional video player and creation interface to support granular control over avatar appearance, branding, and settings?

Our approach generally is that we're experts on the model and our developer customers come to us to get the bleeding edge of video models. We believe our customers are the experts on the front end. But I think what we do see is our technology is changing the way that people interact with video players.

So for example, language switching. We've seen players that now allow you to select not just the closed captioning, but to have this video spoken to you in a different language. We're already starting to see changes where, because you can do these personalized sort of generation of videos, you can actually select that. We’re also seeing, for example, because you can now personalize in different ways, that you can have certain clips that are more relevant to you, highlighted.

Certainly, although we're not a front end company—we're more so a model research company, and our job is to be the research house for all the different companies—I think naturally we see an evolution of the experience in video players.

How do you expect AI talking heads, text-to-video and video editing via editing text to impact the overall volume and frequency of video content creation by businesses?

I think there's two parts to it. I think we know that people have challenges with recording themselves, they're camera shy, they have fear of public speaking, they're concerned of their appearance. So by getting one good replica, you can essentially lower the barrier for creating video and more video. We're already seeing this to be true.

Some of the enterprise customers that we're working with that specifically use our replica technology to bring this into their products are seeing increased engagement, more videos produced from people using the replicas because they're like, "Oh, I can just do another take, another take, and it takes me no effort. I can just do another script." And so certainly, I think the barrier to creating video content is lowered. The other piece of it is around the cost.

Naturally, I think our job is to reduce costs, but also, the instantaneous nature of it will also be improved. You might not need to pre-translate all of them. You can maybe just do it live. And so naturally, that will change.

Today, people might use Synthesia to generate a talking head, generate B-roll via Stable Diffusion, edit it in Descript, and turn it into clips for social media with Opus Clip. What do these platforms have in common and how are they different? Do you anticipate AI talking head platforms expanding into some of these use cases that are adjacent to video creation?

We do see a future where there's some consolidation of tooling for sure because I think that point solutions for just doing the one thing will make less sense. I think generally the market is moving towards that. And so that's our bet. It's why we're like, okay, we want replicas to be present in all these suites, and we're gonna be the platform provider to them.

Now, will these other features and functionalities become APIs as well? It depends on the complexity of them.

I think that digital twins are a uniquely complex problem that has a very, very large research effort behind it. It just depends on the complexity of the problem. Problems that are less complex, companies choose to just build them because it might be easier or it might be more customizable, versus more complex problems that require setting up research teams, like replica generation and human cloning and a lot of stuff, are better fits for API-driven workloads.

Can you talk about the “face” vs. the “body” in AI talking heads and text-to-video? How much more difficult is it to generate a believable, non-canny valley face and what’s the current state of the art here?

I think that we're at the point where videos that are generated with some constraints are incredibly realistic.

This isn't generally true for all models; there are a few models for which this is true. Tavus is one of them.

What still is an active ongoing research focus is capturing and then generating more of those small nuances that make for a very natural experience.

That's what we're really focused on—a lot like what makes a video feel natural? What makes it human-like? There are subtle things about production. If it's overly produced, it has a disconnect. If it feels like you're talking to someone in a studio, you question if that guy actually went to a studio to create that video for me.

The research focus includes capturing small nuances in your face, how your face moves naturally, how you gesture, and bringing that to life with the content you want to generate. I think more facial control, more expressive emotions, and how you move are active research areas that contribute. For certain use cases and content, it's imperceptible if you don't know what to look for.

What role do you see text-to-video models playing in Tavus' product roadmap, and how might they intersect with your AI avatar technology to enable end-to-end video generation?

I think those two worlds do converge, but not in the way that you might think.

Sora is a generalized video model. A lot of the focus is on generalized video models that can do a good job of producing a scene but being able to accurately replicate you and your emotion and how you look is a unique problem that isn't necessarily solved with generalized large video models.

Also, the computational intensity is very different. Sora is very computationally intense and it might take a lot of power to generate 15 seconds versus our models are built to be basically real-time.

Architecturally, they are different things, but where they come together is that you can use a 3D model to create a video of you speaking and then put it into a scene. That's where you start to have this convergence—using a generalized video model to create scenes where reproduction doesn't matter and using replica models for higher reproducibility, combining them together.

Tavus uses its own models for talking head generation and for lip-syncing and dubbing. Can you talk about the advantages there over using a foundational model? In what way does using your own model allow you to improve the results over time and make Tavus models the best for avatar generation?

The models do get better with more scale. Training data itself and driving down costs, but a lot of it's about when you understand what it means to create a human-like video and the nuances between that, which is a massive research area. There's so much that goes into it, and it's not a single model.

Even our pipeline today is not just a single model doing everything. It's multiple components, like something looking at eye gaze, another at gestures. The value is in creating something that is truly realistic, and what it takes to get there.

What we're working on right now that you'll see coming soon is more evident that natural endpoints need to be achieved for a convincing experience. From a research perspective, there are many unique architectures and many unique areas of research required to produce great results.

Say everything goes right for Tavus over the next 5 years. What does Tavus look like and how is the world different as a result?

I think that if we look at five years out, Tavus becomes the backbone for digital twin experiences across the ecosystem. It becomes commonplace to have your digital replica taking meetings for you and acting on your behalf. It becomes standard to have digital twins talking to each other and available in all applications. This will transform how we ultimately communicate with computers and each other. That's the future we see, and Tavus is going to bring that future.

Disclaimers

This transcript is for information purposes only and does not constitute advice of any type or trade recommendation and should not form the basis of any investment decision. Sacra accepts no liability for the transcript or for any errors, omissions or inaccuracies in respect of it. The views of the experts expressed in the transcript are those of the experts and they are not endorsed by, nor do they represent the opinion of Sacra. Sacra reserves all copyright, intellectual property rights in the transcript. Any modification, copying, displaying, distributing, transmitting, publishing, licensing, creating derivative works from, or selling any transcript is strictly prohibited.

Read more from

Read more from

HeyGen revenue, growth, and valuation

lightningbolt_icon Unlocked Report
Continue Reading

Read more from

Databricks at $2.4B ARR growing 60%

lightningbolt_icon Unlocked Report
Continue Reading

Databricks revenue, growth, and valuation

lightningbolt_icon Unlocked Report
Continue Reading

Discord at $600M/year

lightningbolt_icon Unlocked Report
Continue Reading

Read more from