Tahoe 100M 50x Larger Dataset

Tahoe Therapeutics

This dataset is 50 times larger than previously available public perturbation datasets

Analyzed 4 sources

The real advantage is not just more data, it is a step change in how many drug and cell combinations can be seen directly instead of guessed from small public studies. Tahoe-100M covers 100 million cells, roughly 60,000 drug cell interactions, 50 cancer models, and 1,100 to 1,200 drug perturbations, which makes it useful for training virtual cell models that can learn response patterns across many genetic backgrounds rather than a few narrow experiments.

1 arcinstitute 2 sacra

Most public single cell perturbation resources before Tahoe were fragmented collections of dozens of studies, or much smaller focused experiments. scPerturb harmonized 44 public single cell perturbation datasets, while Arc described Tahoe-100M as 50x larger than all public drug perturbed data combined. That gap helps explain why model builders treated data availability as the bottleneck.

1 arcinstitute 3 nature
What matters in practice is interaction density, not just raw cells. Arc says Tahoe-100M maps about 60,000 drug cell interactions across 50 cancer cell lines, so researchers can compare how the same molecule shifts gene expression in many tumor contexts, which is exactly the supervision needed to predict which patient subgroups may respond differently.

1 arcinstitute 2 sacra
This also explains the business model. Once a wet lab system can generate perturbation maps at this scale, the output behaves like model training infrastructure. The same dataset can support pharma partnerships, external model builders like Arc, and Tahoe's own internal drug programs, while competitors such as Xaira and Recursion are also racing to build proprietary data moats.

2 sacra

The next phase is a shift from large datasets to foundation datasets. As Tahoe moves toward billion cell scale and denser perturbation coverage, the winners in virtual biology are likely to be the companies that own the best real world training data, not just the best model architecture, because better supervision is what makes prediction reliable enough for drug development decisions.

2 sacra 5 arcinstitute