How has usage-based pricing and recession affected data labeling and can few-shot learning help?

The data annotation space isn’t going to keep growing like it did earlier because pre-trained models are going to get better and better. The rise of those companies, Scale, in particular, was tied to the autonomous driving industry that was extremely data-hungry.

There are two trends pulling in different directions. There is the fact that if you think about a self-driving car that operates in an open world, there is a combinatorial number of edge cases. What I mean by that specifically is you imagine, you might struggle with people that are dressed in some sort of glossy rain gear. That might be a problem if your AI is a computer vision system. You might also struggle if it's raining. You might also struggle if it's kind of dusk or if the sun is shining right into the camera.

Then, you get a hundred of those kinds of edge cases, and I have all the combinations of all those hundreds. That's a combinatorial problem, and that’s a good thing for Scale because it means that these companies are going to keep sending you an infinite amount of data. That's one trend.

The other trend is that, as ML gets better at architecture and pre-training, it generalizes better. You have much fewer of these edge cases or situations where it degrades performance catastrophically.

I think the autonomous vehicle industry is struggling to grow. Well, Cruise and Waymo may pull it off, but clearly, there's not as much money being put into it. So if I was running a pure data annotation company, I would probably want to either go into other parts of the stack because a lot of stuff that Scale was offering, like offshore labelers for hire, APIs to define instructions and versioning, labeling, and annotation, you don't need it anymore.

If you are building in a future world, and that’s where we see Nyckel, you have the domain expert like the person building the product, the product manager, or the developer labeling themselves because they know the data intimately and they can get rid of so much overhead in terms of writing instructions and payment incentives to get these outsourced data labelers to be motivated.