
Background
Coco Mao is co-founder and CEO at OpenArt. We chatted about how the AI image generation market has evolved since Midjourney (2022), OpenArt's growth from $1M to $10M ARR in one year, and how OpenArt positions itself among other generative AI art tools like OpenAI's Sora, Pika, Runway, and Photoroom.
Key points from Sacra AI:
- The explosion of post-Midjourney ($200M in ARR in 2023) generative AI image generators like Ideogram, Playground AI and Craiyon that grew fast in a rapidly expanding market now face diminishing differentiation and margin pressure as the market has been saturated by tools with increasingly similar capabilities. “[We grew] from $1 million ARR to over $10 million ARR within 12 months. Last year, I would have described OpenArt as a platform for AI image generation and editing. People and investors often ask how we're different from Midjourney or Ideogram. In terms of core functionality and technology, all the products are quite similar in my opinion”
- In generative AI, the market is bifurcating between (1) foundation model-based companies like Sora, Pika, and Runway that build for power-users that want deep control over AI generations, (2) product companies like OpenArt and Photoroom that use the best open-source models available for businesses and creators that want a push-button experience for generating social media content and product images. “Tools like Sora, Pika, or Runway function more like manual cars - you need to do substantial work, generating clips individually, adding audio, and managing many elements. You have extensive control, but it requires significant effort. People who can't ‘drive’, meaning they struggle to develop a simple idea into a video script, [need] self-driving cars, helping with script writing, storyboard generation, and video creation.”
- Creating a workflow that brings together many tools—fine-tuning of open source models (Stability AI, Flux), character persistence (img2img), voice generation (ElevenLabs), image-to-video (Kling, Hailuo), and others—builds value on top of the foundation models even as they individually improve by allowing users to orchestrate end-to-end creations without extensive prompt engineering. “The difference between early 2024 and late 2024 [in open source] is substantial… Videos are becoming smoother, with fewer errors like extra limbs… Image models previously struggled with text generation, but newer models like Flux perform much better…Recently, models have emerged that can generate videos complete with audio and music integration… This evolution makes it crucial for platforms like OpenArt to provide value beyond what better models alone can offer. If we simply relied on model capabilities, users might bypass platforms entirely when improved models become available.”
Questions
- Who were your first customers at OpenArt and what do people typically use it for?
- What are the key challenges or problems in AI video generation today that you’re looking to solve with OpenArt? How do you think about OpenArt’s unique angle on the problem?
- You've talked about minimizing prompt engineering requirements when it comes to helping people generate high-quality images. How do you think about the importance of this, both with OpenArt as it is now and as it relates to video?
- What is people’s creation workflow of creating a video story today?
- How do you think about where the AI image generation market is going between products like Midjourney, DALL-E, Grok, Playground, Photoroom and others alongside OpenArt?
- Today, companies like Photoroom and OpenArt are not competitive—Photoroom is B2B and e-commerce-centric. Do you see these categories staying separate, or converging over time as they all add image generation and editing, eventually colliding into some AI-native Photoshop?
- We've seen some common AI-powered features emerge across tools in addition to text to image like inpainting, outpainting, edit & browse by vibe / stylistic similarity and more. What do you see as the core primitives like this for video?
- OpenArt is built on top of many open-source models like Flux, Stable Diffusion etc. Why did you choose to build on top of an open source model? How does that impact your degrees of freedom in terms of what you can build going forward with OpenArt, the images you can generate, the features you can offer and your cost structure & pricing?
- Instagram, Snapchat and TikTok have filters and lenses and increasingly powerful creation and editing workflows. What is the opportunity for social platforms and the risk to a company like OpenArt in those platforms increasingly building AI tooling into their workflows?
- Are image models improving at the same rate as text-based models and how should we reason about their trajectory? What will be possible with image models that's not possible today?
- Can you talk a bit about pricing and how you monetize? How is the OpenArt subscription different from that of a tool like Midjourney?
- What is ComfyUI, why did OpenArt integrate with it and how do you think about its complimentary-ness or competitiveness going forward?
- Assuming everything you're talking about goes well, how do you see the world changing in the next five years, and what role do you see OpenArt playing in that future?
Interview
Who were your first customers at OpenArt and what do people typically use it for?
OpenArt as a platform has gone through many iterations, and my description of it differs from last year to this year. As an early-stage startup just over two years old, we're still working to find solid product-market fit. Last year, we experienced significant growth, expanding from under $1 million ARR to over $10 million ARR within 12 months. Despite this growth, several issues still concern me, which is why we're currently undergoing a micro-pivot.
Last year, I would have described OpenArt as a platform for AI image generation and editing. People and investors often ask how we're different from Midjourney or Ideogram. In terms of core functionality and technology, all the products are quite similar in my opinion. Our growth came from an expanding market, smart go-to-market strategies, and our focus on making AI image generation easier to use. Those factors drove our success — even though, truthfully, we aren’t as differentiated as we might like to be. Many people were searching for these types of image solutions, and generative AI has really opened up possibilities by offering more unique, personalized, and customized options. We managed to create a good product, but due to the lack of differentiation and a clear long-term plan, our team engaged in some deep reflection last year. What emerged from this process was a new vision: to transform OpenArt into a platform for visual storytelling.
Storytelling is deeply inherited in human nature. From our earliest days, humans created folktales, drew murals on walls, and found various ways to tell stories. This fundamental human drive to share stories has evolved over time. Text-based storytelling has been democratized - most people can read and write, allowing anyone with a story to share it. However, visual storytelling hasn't experienced the same democratization until now.
Previously, creating visual stories, whether in book or video format, required high production values and professional knowledge or resources. But AI is making visual storytelling accessible for the first time. Last week, we sponsored the MIT AI Film Hackathon, where students used OpenArt along with other tools to create amazing videos that traditionally would have taken an entire studio months to produce. For example, this particular project, almost a Pixar level short film, was produced by 5 NYU film school students in just two days.
Our vision is to transform simple ideas directly into full visual stories, whether in video format or potentially other formats in the future. We want to enable the seamless conversion from text to visual storytelling.
What are the key challenges or problems in AI video generation today that you’re looking to solve with OpenArt? How do you think about OpenArt’s unique angle on the problem?
Let me use an analogy: In real life, when you want to get from point A to point B, you have multiple options. If you can't drive, the easiest solution is calling Uber or Lyft - you simply request a ride and reach your destination. If you want more control, you can drive a car yourself. You could use a self-driving car like Tesla for a middle-ground approach, or you could drive a manual car for maximum control.
This analogy is relevant to our situation. Consider point A as your simple idea and point B as a full-fledged video story. Currently, tools like Sora, Pika, or Runway function more like manual cars - you need to do substantial work, generating clips individually, adding audio, and managing many elements. You have extensive control, but it requires significant effort.
Many people who can't "drive" - meaning they struggle to develop a simple idea into a visual story - need assistance. More products are emerging with AI assistance, similar to self-driving cars, helping with script writing, storyboard generation, and video creation. However, most people trying to get from A to B need the equivalent of Uber or Lyft - they need a simple solution.
That's the missing piece we're trying to figure out: how to become the Uber and Lyft for people transforming their simple ideas into full-fledged videos. We'll still provide control for those who want to edit, but to truly democratize this process, we need to create the simplest possible solution for the majority of users.
You've talked about minimizing prompt engineering requirements when it comes to helping people generate high-quality images. How do you think about the importance of this, both with OpenArt as it is now and as it relates to video?
I can give both a short and long answer. The short answer is that most platforms today have prompting assistance - when you enter something simple, you can enhance your prompt.
The models are getting better and better, so even short prompts can produce good results.
For the long answer: since we focus on visual storytelling, images are just one intermediate step. We don't want users to diligently write prompts for each frame and then convert them into videos. Instead, we want to automate the whole process.
Let's say you have a simple story idea - you can describe this idea, and we'll automate all the scripting, image generation, and video generation work to create a complete video. By focusing on the storytelling aspect, all these intermediate steps can be automated away.
What is people’s creation workflow of creating a video story today?
Today's visual storytelling process is largely manual and labor-intensive, consisting of several distinct steps:
- Script Development - Content creators typically use ChatGPT or develop scripts independently
- Storyboard Generation - Major frames are created using platforms like OpenArt or similar image generation tools
- Video Transformation - These key frames are converted into video clips using our image-to-video feature
- Post-Production - Clips are combined and enhanced with sound, music, and additional elements
Besides the complicated workflow, there’re other challenges too. For example, character consistency throughout a narrative represents one of the most persistent hurdles for creators. Recognizing this pain point, we developed our consistent character feature, which recently earned the #1 spot on Product Hunt a few days ago. This technology is the result of extensive research and development aimed at solving a fundamental problem in image creation. Before this innovation, creators had to rely on various workarounds to maintain visual continuity across scenes. Now, users can preserve character consistency throughout their projects, resulting in more authentic and natural storytelling.
Looking ahead to the rest of 2025, our vision tackles both automation and fundamental creative challenges. Character consistency was just our first breakthrough. We're now focused on solving the technical barriers that have limited visual storytellers. At the same time, we're making these powerful tools accessible to everyone—regardless of film experience. We're addressing creative obstacles like consistent characters and coherent story arcs. We're also removing technical barriers with simplified workflows and professional-quality sound. Through these efforts, we're democratizing content creation for a new generation of storytellers.
How do you think about where the AI image generation market is going between products like Midjourney, DALL-E, Grok, Playground, Photoroom and others alongside OpenArt?
Recently, I've been reading a book called "Play Bigger," and one concept I particularly value is about category definition - whether a company defines a category or simply follows within an existing one.
I applaud Midjourney for defining the AI image generation category. While DALL-E and other platforms exist, Midjourney really established AI image generation and made it widely known to the public.
OpenArt benefited significantly from this new category of AI image generation, which helped drive our growth last year. However, operating in a category defined by someone else means you're just one of many players, potentially a smaller one.
My vision for OpenArt is to become a category-defining company ourselves.
We're focusing on defining the visual storytelling category, where users can transform text stories into visual stories. Currently, this manifests in video format, but future iterations could include more innovative formats such as interactive videos or even games - essentially, any transformation from text to visual storytelling.
Today, companies like Photoroom and OpenArt are not competitive—Photoroom is B2B and e-commerce-centric. Do you see these categories staying separate, or converging over time as they all add image generation and editing, eventually colliding into some AI-native Photoshop?
PhotoRoom achieved its scale and success because it clearly defined its category: creative tools for e-commerce. Similarly, we want to define our category as creative tools for storytellers.
Regarding our target audience in the B2B space, I'll share our internal breakdown of personas: content creators, fantasy enthusiasts, SMBs, film studios, and others.
This year, we're heavily focusing on social media content creators - helping them easily produce stories they can share on social media to increase their followers and engagement.
We also serve fantasy enthusiasts - people who play RPGs, create fan art, and love anime. While these first two segments are more B2C-oriented, there are significant B2B opportunities with SMB marketing agencies creating various types of ads.
There's also a lot of interesting content we can help small and medium-sized businesses create for social media to increase their engagement. Just yesterday, I interviewed a user named Dan who works as a TV producer. He uses OpenArt for many of his personal creative projects because TV channels often have copyright concerns to navigate.
For his personal creative projects, Dan created interesting promotional material for a local coffee shop. He took a photo at the coffee shop and used OpenArt to train a model of himself, which he incorporated into the photo. The image showed him holding a candle, and he used the video feature to animate the entire composition.
When he posted it on social media, the brand was delighted and reposted it, generating significant engagement. The entire process took him only twenty-five minutes, which is remarkably quick compared to traditional methods. This demonstrates how small and medium-sized businesses can efficiently create engaging content.
Regarding film studios, their current process involves development, pre-production, production, post-production, and distribution marketing. While AI technology may not yet be advanced enough to significantly impact production and post-production, many film producers are currently using platforms similar to OpenArt to build storyboards and create concept movies to secure funding for their actual films.
We've seen some common AI-powered features emerge across tools in addition to text to image like inpainting, outpainting, edit & browse by vibe / stylistic similarity and more. What do you see as the core primitives like this for video?
Starting as an image platform gives us a significant advantage in maintaining control and consistency.
We've discovered that generating images first and then converting them into videos has been crucial for achieving high-quality results.
This approach mirrors traditional filmmaking, where large projects typically begin with storyboarding before full video production. Images provide more precise control - you can position each frame exactly where you want it within the video sequence.
Many filmmakers follow this process: they create and perfect individual images, ensure each frame is exactly what they want, make necessary edits, and only then proceed with video production. This workflow advantage stems from our foundation as an image-based platform.
OpenArt is built on top of many open-source models like Flux, Stable Diffusion etc. Why did you choose to build on top of an open source model? How does that impact your degrees of freedom in terms of what you can build going forward with OpenArt, the images you can generate, the features you can offer and your cost structure & pricing?
There are two aspects to consider. First, regarding our approach to models: we don't build our own foundational models, though we do offer fine-tuning services, which is essentially personalization.
Model training is extremely resource-intensive, and as a relatively neutral platform, we can integrate with various state-of-the-art models from different providers while building strong relationships with them.
On the business side, being a neutral platform gives us many opportunities to integrate with state-of-the-art models.
As for open source, OpenArt wouldn't exist without it.
The open-source nature of models and technology (with proper licensing when needed) allows us to stay at the forefront of developments. New open-source technologies emerge monthly, and since we're solving specific problems, we can seamlessly swap in better models behind the scenes.
Users simply notice improved quality without needing to understand the technical details. This leverage of open-source technology has made OpenArt a fast-moving, iterative platform that consistently maintains the highest quality.
Instagram, Snapchat and TikTok have filters and lenses and increasingly powerful creation and editing workflows. What is the opportunity for social platforms and the risk to a company like OpenArt in those platforms increasingly building AI tooling into their workflows?
This is a great question that I've thought about extensively. While there's definitely an opportunity to build the next TikTok in this generative AI wave, I don't believe right now is the time for many reasons - but that time is coming. We hope OpenArt can be the one to build it.
Current platforms are what I'd call AI-assisted social media - they're adding AI technologies and features to enhance their existing content. Because they need to support traditional formats, the content remains relatively similar. The video I showed you could be posted on YouTube or TikTok without much difference.
However, a fundamental format change could create an opportunity for a truly AI-native platform. While I can't predict exactly what this format will be, one possibility is much more interactive video content. On YouTube today, everyone watches the exact same video.
But imagine if audiences could see themselves in the videos or make choices at different decision points to direct the story's path. Since AI makes creating these films much cheaper than traditional media production, we could develop more AI-native creation and consumption formats. That breakthrough could lead to revolutionary AI-native social media.
Are image models improving at the same rate as text-based models and how should we reason about their trajectory? What will be possible with image models that's not possible today?
Several key developments are happening with AI media models. First, the quality and consistency are improving significantly. Videos are becoming smoother, with fewer errors like extra limbs, and the difference between early 2024 and late 2024 is already substantial.
Second, personalization is becoming increasingly important. On OpenArt, one of our most popular features allows users to fine-tune their own models or create consistent characters. This trend toward personalization is reflected across all foundational models, which now include fine-tuning features to meet users' customization needs.
Third, multimodal integration is advancing rapidly. For example, image models previously struggled with text generation, but newer models like Flux perform much better because they incorporate transformer architecture that effectively combines text and images. Recently, models have emerged that can generate videos complete with audio and music integration.
This evolution makes it crucial for platforms like OpenArt to provide value beyond what better models alone can offer. If we simply relied on model capabilities, users might bypass platforms entirely when improved models become available. By positioning OpenArt as a solution for story visualization, where users maintain their characters, stories, and templates, we benefit from advancing technology while remaining essential to our users. We're focused on solving a specific problem space rather than just implementing new technology.
Visual storytelling is one way to describe the space, but a more detailed explanation would be: visualizing and transforming tactical stories into video or other visual formats.
Can you talk a bit about pricing and how you monetize? How is the OpenArt subscription different from that of a tool like Midjourney?
We have different subscription tiers that provide varying benefits throughout the membership journey.
Our approach is more user-friendly with a credit system, and we clearly indicate how many credits each model requires.
This pricing structure is important for us because AI software is fundamentally different from traditional software - it's much more resource-intensive compared to traditional SaaS. However, while infrastructure companies are in a race to the bottom, we can maintain better margins at the application layer.
What is ComfyUI, why did OpenArt integrate with it and how do you think about its complimentary-ness or competitiveness going forward?
As you know, we were exploring different directions over these two years, making many micro-pivots that people don't even see. ComfyUI workflow was one direction we explored.
We saw great potential in people creating different workflows and even the possibility of a workflow ecosystem where advanced users could create workflows that newer users could access through a more friendly interface. However, we didn't pursue that direction further.
As you pointed out, ComfyUI workflow is more suited for advanced users. Remember my analogy of A to B? ComfyUI users fall into a fourth category where they essentially buy all the engines and wheels to assemble a car themselves before driving.
I see many advanced users have achieved impressive results with ComfyUI, and ComfyUI is fully open source. Our backend uses a lot of ComfyUI workflows to support our features and I see a lot of B2B or enterprise opportunities for the ComfyUI team!
Assuming everything you're talking about goes well, how do you see the world changing in the next five years, and what role do you see OpenArt playing in that future?
Canva really inspired me because they democratized visual communication. They enabled small and medium-sized businesses that previously had to hire designers or use Photoshop to easily create posters by themselves using numerous templates. Similar to how Canva democratized visual communication, I hope to see OpenArt democratize visual storytelling.
Imagine having a simple idea for a Pixar-style movie about someone who grew up in a poor country and came to the United States to pursue their American dream. You could have this idea intact, even in voice format, and instantly generate a full-fledged animated movie. If you don't like certain parts, you can edit them. I want to democratize this technology to the point where even a five-year-old can simply express their idea and immediately create their own movie.
As I mentioned, there's opportunity for more AI-native social media platforms. OpenArt currently focuses on being a tool, which is the right approach. Even Instagram started as a tool with their famous filter.
Similar to TikTok, which started with lip-syncing, all social media platforms began with specific tools. There's a famous saying: "Come for the tools, stay for the network." Currently, we're focusing on the tooling side, but we hope to expand into a content platform in the future. The priority is to nail down the tooling first.
Disclaimers
This transcript is for information purposes only and does not constitute advice of any type or trade recommendation and should not form the basis of any investment decision. Sacra accepts no liability for the transcript or for any errors, omissions or inaccuracies in respect of it. The views of the experts expressed in the transcript are those of the experts and they are not endorsed by, nor do they represent the opinion of Sacra. Sacra reserves all copyright, intellectual property rights in the transcript. Any modification, copying, displaying, distributing, transmitting, publishing, licensing, creating derivative works from, or selling any transcript is strictly prohibited.