Do companies bring their own data to Nyckel, train models with it and then take those models to production?

Yes, that's right. They all bring their own data. I believe very strongly that this is important. Here is why:

Before you put any ML model in production, you should at least test it on your own data because the state of AI, as glorious as GPT-3 may seem, is still always an issue with domain shift. My position is always to provide enough data to convince yourself that it is working now, and that's maybe ten or 100 or so data points.

What we realized with Nyckel is once you have 100 points, you can use cross-validation and actually train on those points as well. You can split it up into chunks, train on a subset, and evaluate. Now you've got two things at once— you fine-tune the model to your data and you convince yourself that it works for your data.

That's why we tell our customers that you have to upload your own data. You have to actually sit down and annotate it. Some of them have their own labels from some database, but more often than not, they just annotate in the UI. It takes 20 minutes or so for 100 samples and then, we train in a few seconds, deploy it immediately, and they're done.

I think one of my biggest pet peeves with certain aspects of GPT-3 is that if you're doing generative modeling like you want something to start generating content for you, maybe it's okay to use it directly. But if you're using it for classification, you need a data engine on top of it. You need some way to check if it's working for you. This is not really sustainable to just put it into a prompt and hope that it works for other situations. You need a way to define—here's what I want it to do, now go do it.