🧬 OpenGenomeLLM-70B just released β€” State-of-the-art on GeneTuring benchmark  Β·  Download now β†’

πŸ“Š Genomic Datasets

Curated, open-access datasets for training and evaluating genomic AI models β€” variants, sequences, clinical trials, and more.

Filter:
🧬
deepcog-ai Β· βœ“ Verified
GenomeIndia-10K-WGS

Whole-genome sequences from 10,000 Indian individuals across 100+ ethnic groups. Includes SNP, INDEL, and structural variant annotations. De-identified and ethics-approved.

Whole GenomePopulationVCF
10,247Samples
2.4MVariant pairs
4.8 TBSize
Apache 2.0License
πŸ”¬
deepcog-ai Β· βœ“ Verified
ClinVar-Pathogenic-2024

Curated subset of ClinVar with 450,000 pathogenic and likely-pathogenic variants, enriched with functional evidence, ACMG classifications, and literature links.

VariantClinicalJSON
450KVariants
82KGenes
12 GBSize
CC BY 4.0License
πŸ“„
deepcog-ai Β· aiims-delhi
BioMed-Papers-42M

42 million biomedical research papers from PubMed, PMC Open Access, and preprints. Preprocessed for LLM training with structured metadata, abstracts, and full texts where available.

LiteraturePretrainingJSONL
42MPapers
1995–2024Coverage
820 GBSize
MixedLicense
πŸ’Š
deepcog-ai
DrugTarget-SMILES-2M

2 million drug-target interaction pairs with SMILES notation, protein sequences, binding affinities, ADMET properties, and clinical outcome data from ChEMBL and BindingDB.

Drug DiscoverySMILESCSV
2.1MPairs
340KCompounds
8.3 GBSize
Apache 2.0License
🧫
iit-madras-bioai Β· βœ“ Verified
HistoPath-India-2M

2 million annotated histopathology images from Indian cancer centers, covering 18 cancer types. Includes WHO grading, tumor boundaries, and pathologist consensus labels.

PathologyVisionTIFF
2.1MImages
18Cancer types
14 TBSize
CC BY 4.0License
πŸ†
deepcog-ai Β· βœ“ Benchmark
GeneTuring-Benchmark-v2

The GeneTuring benchmark suite for evaluating genomic AI models across 12 tasks including variant classification, gene function prediction, and clinical report generation.

BenchmarkEvaluationJSON
12Tasks
48KTest cases
2.1 GBSize
Apache 2.0License