Preprint details how CZ CELLxGENE Discover uses collaborative curation to balance scale and data quality
A bioRxiv preprint from the CZ CELLxGENE team describes a submission model in which data contributors partner with dedicated curators, enabling the resource to grow rapidly while maintaining metadata quality for AI-scale analysis.
A preprint posted to bioRxiv on 5 June 2026 describes the collaborative submission model underpinning CZ CELLxGENE Discover, a community data resource that aggregates single-cell and spatial transcriptomics datasets across studies for large-scale biomedical research and AI model development.
The authors, drawn from Chan Zuckerberg Initiative and collaborating institutions, outline how a fundamental tension in building community resources — the desire for a large data corpus versus the need for high-quality, richly annotated metadata — has been addressed by partnering data contributors directly with dedicated resource curators. Rather than requiring contributors to conform to standards independently, curators work alongside submitters to harmonise data and metadata, reducing errors and improving downstream usability.
The preprint reports that this model has enabled CELLxGENE Discover to become a widely used infrastructure resource, supporting large-scale re-analysis and the training of foundation models in genomics. The authors discuss practical lessons for other community resources facing the same scale-versus-quality tension.
The work has not yet been peer-reviewed. It will be of primary interest to researchers building or contributing to genomic data infrastructures, and to those developing or evaluating AI models trained on single-cell data.
Sources
Read the original reporting — these are the public sources this summary draws from.
-
Primary sourcePreprint bioRxiv (Cold Spring Harbor Laboratory) · 2026-06-05A collaborative submission model for building high-quality data resources at scale through partnership