Disease & Association Databases¶

Open Targets¶

The Open Targets Platform aggregates target-disease-drug evidence from 20+ data sources, scoring each association with a 0–1 overall score. Provides pre-computed Parquet datasets covering targets, diseases, evidence, and drugs — the richest single source for drug target validation.

Dataset	Key Columns	Description
`targets`	`id`, `approvedSymbol`, `approvedName`, `biotype`, `tractability`	Target catalog (~63K entries)
`associations`	`targetId`, `diseaseId`, `score`, `datatypeScores`	Target-disease association scores
`drugs`	`id`, `name`, `drugType`, `mechanismsOfAction`, `indications`	Drug catalog with MOA

Disease identifiers: EFO IDs (EFO_0000311) — join via disease_map.efo_id Target identifiers: Ensembl gene IDs (ENSG00000146648) — join via protein_map.ensembl_gene_id

uv run bioingest download opentargets --all
# Athena — top disease associations for EGFR
SELECT diseaseId, score FROM bioingest.opentargets__associations
WHERE targetId = 'ENSG00000146648' ORDER BY score DESC LIMIT 20;

DISEASES 2.0¶

JensenLab's gene-disease associations from three channels: curated knowledge, experimental evidence, and text-mining. Each association includes a confidence score (0–5 scale). Useful for broad disease enrichment beyond Open Targets.

Dataset	Key Columns	Description
`knowledge`	`gene_id`, `gene_name`, `disease_id`, `disease_name`, `confidence`	Curated associations
`experiments`	`gene_id`, `gene_name`, `disease_id`, `disease_name`, `confidence`	Experimental evidence
`textmining`	`gene_id`, `gene_name`, `disease_id`, `disease_name`, `zscore`	Text-mined co-mentions

Disease identifiers: DOID (DOID:9351) — join via disease_map.doid Gene identifiers: Ensembl gene ID or HGNC symbol — join via protein_map.ensembl_gene_id or protein_map.gene_name

uv run bioingest download diseases --all
# Athena
SELECT gene_name, disease_name, confidence FROM bioingest.diseases__knowledge
WHERE gene_name = 'EGFR' ORDER BY confidence DESC;

ClinVar¶

NCBI's archive of relationships between human variation and observed health phenotypes. Reports pathogenicity classifications (pathogenic, likely pathogenic, VUS, benign) for variants in the context of specific conditions.

Dataset	Key Columns	Description
`variant_summary`	`GeneSymbol`, `ClinicalSignificance`, `PhenotypeIDS`, `Assembly`, `Chromosome`, `Start`, `ReferenceAllele`, `AlternateAllele`	All submitted variant interpretations
`gene_condition_source_id`	`GeneSymbol`, `ConceptID`, `DiseaseName`, `SourceName`	Gene-condition mapping with MedGen/OMIM IDs

Disease identifiers: MedGen CUI, OMIM ID — join via disease_map.medgen_id Gene identifiers: HGNC symbol — join via protein_map.gene_name

uv run bioingest download clinvar --all
# Athena — pathogenic variants in BRCA1
SELECT GeneSymbol, ClinicalSignificance, PhenotypeIDS FROM bioingest.clinvar__variant_summary
WHERE GeneSymbol = 'BRCA1' AND ClinicalSignificance LIKE '%Pathogenic%';

ClinicalTrials.gov¶

US National Library of Medicine registry of clinical studies. Provides structured data on trial design, endpoints, interventions, conditions, sponsor, and status. Useful for identifying which biomarkers are being tested in active trials.

Dataset	Key Columns	Description
`all_studies`	`nctId`, `briefTitle`, `conditions`, `interventions`, `primaryOutcomes`, `phase`, `status`	Full trial registry (~500K studies)
`biomarker_studies`	`nctId`, `briefTitle`, `conditions`, `interventions`, `biomarkers`	Filtered subset with biomarker mentions

Disease identifiers: MeSH terms in conditions — join via disease_map.mesh_id Identifiers: NCT ID (NCT04128436)

uv run bioingest download clinical_trials --all
# Athena
SELECT nctId, briefTitle, conditions, phase FROM bioingest.clinical_trials__all_studies
WHERE conditions LIKE '%lung cancer%' AND interventions LIKE '%pembrolizumab%';

TTD (Therapeutic Target Database)¶

Curated database of therapeutic protein and nucleic acid targets, drugs directed at each target, and corresponding diseases. Includes clinical status of drug-target pairs. Also provides the raw data for drug_map.

Dataset	Key Columns	Description
`targets`	`TargetID`, `Name`, `Type`, `UniProtID`, `Function`	Therapeutic targets
`drugs`	`DrugID`, `Name`, `HighestStatus`, `Targets`, `Indications`	Drug pipeline data

Disease identifiers: ICD codes, free-text indications — manual mapping to disease_map Target identifiers: TTD target ID, UniProt accession — join via protein_map.uniprot_id

uv run bioingest download ttd --all
# Athena
SELECT Name, HighestStatus, Indications FROM bioingest.ttd__drugs WHERE Targets LIKE '%EGFR%';

MarkerDB¶

Curated collection of molecular biomarkers linked to conditions, body fluids, and clinical applications. Covers protein, chemical, and genetic markers with evidence levels.

Dataset	Key Columns	Description
`proteins`	`biomarker_name`, `uniprot_id`, `condition`, `biofluid`, `application`	Protein biomarkers
`chemicals`	`biomarker_name`, `hmdb_id`, `condition`, `biofluid`	Chemical/metabolite biomarkers
`conditions`	`condition_name`, `markers`, `category`	Condition → marker index

Disease identifiers: Free-text condition names — text match to disease_map.mondo_name Protein identifiers: UniProt accession — join via protein_map.uniprot_id

uv run bioingest download markerdb --all
# Athena
SELECT biomarker_name, condition, biofluid, application FROM bioingest.markerdb__proteins
WHERE condition LIKE '%cardiovascular%';