Skip to content

Disease & Association Databases

Open Targets

The Open Targets Platform aggregates target-disease-drug evidence from 20+ data sources, scoring each association with a 0–1 overall score. Provides pre-computed Parquet datasets covering targets, diseases, evidence, and drugs — the richest single source for drug target validation.

Dataset Key Columns Description
targets id, approvedSymbol, approvedName, biotype, tractability Target catalog (~63K entries)
associations targetId, diseaseId, score, datatypeScores Target-disease association scores
drugs id, name, drugType, mechanismsOfAction, indications Drug catalog with MOA

Disease identifiers: EFO IDs (EFO_0000311) — join via disease_map.efo_id Target identifiers: Ensembl gene IDs (ENSG00000146648) — join via protein_map.ensembl_gene_id

uv run bioingest download opentargets --all
# Athena — top disease associations for EGFR
SELECT diseaseId, score FROM bioingest.opentargets__associations
WHERE targetId = 'ENSG00000146648' ORDER BY score DESC LIMIT 20;

DISEASES 2.0

JensenLab's gene-disease associations from three channels: curated knowledge, experimental evidence, and text-mining. Each association includes a confidence score (0–5 scale). Useful for broad disease enrichment beyond Open Targets.

Dataset Key Columns Description
knowledge gene_id, gene_name, disease_id, disease_name, confidence Curated associations
experiments gene_id, gene_name, disease_id, disease_name, confidence Experimental evidence
textmining gene_id, gene_name, disease_id, disease_name, zscore Text-mined co-mentions

Disease identifiers: DOID (DOID:9351) — join via disease_map.doid Gene identifiers: Ensembl gene ID or HGNC symbol — join via protein_map.ensembl_gene_id or protein_map.gene_name

uv run bioingest download diseases --all
# Athena
SELECT gene_name, disease_name, confidence FROM bioingest.diseases__knowledge
WHERE gene_name = 'EGFR' ORDER BY confidence DESC;

ClinVar

NCBI's archive of relationships between human variation and observed health phenotypes. Reports pathogenicity classifications (pathogenic, likely pathogenic, VUS, benign) for variants in the context of specific conditions.

Dataset Key Columns Description
variant_summary GeneSymbol, ClinicalSignificance, PhenotypeIDS, Assembly, Chromosome, Start, ReferenceAllele, AlternateAllele All submitted variant interpretations
gene_condition_source_id GeneSymbol, ConceptID, DiseaseName, SourceName Gene-condition mapping with MedGen/OMIM IDs

Disease identifiers: MedGen CUI, OMIM ID — join via disease_map.medgen_id Gene identifiers: HGNC symbol — join via protein_map.gene_name

uv run bioingest download clinvar --all
# Athena — pathogenic variants in BRCA1
SELECT GeneSymbol, ClinicalSignificance, PhenotypeIDS FROM bioingest.clinvar__variant_summary
WHERE GeneSymbol = 'BRCA1' AND ClinicalSignificance LIKE '%Pathogenic%';

ClinicalTrials.gov

US National Library of Medicine registry of clinical studies. Provides structured data on trial design, endpoints, interventions, conditions, sponsor, and status. Useful for identifying which biomarkers are being tested in active trials.

Dataset Key Columns Description
all_studies nctId, briefTitle, conditions, interventions, primaryOutcomes, phase, status Full trial registry (~500K studies)
biomarker_studies nctId, briefTitle, conditions, interventions, biomarkers Filtered subset with biomarker mentions

Disease identifiers: MeSH terms in conditions — join via disease_map.mesh_id Identifiers: NCT ID (NCT04128436)

uv run bioingest download clinical_trials --all
# Athena
SELECT nctId, briefTitle, conditions, phase FROM bioingest.clinical_trials__all_studies
WHERE conditions LIKE '%lung cancer%' AND interventions LIKE '%pembrolizumab%';

TTD (Therapeutic Target Database)

Curated database of therapeutic protein and nucleic acid targets, drugs directed at each target, and corresponding diseases. Includes clinical status of drug-target pairs. Also provides the raw data for drug_map.

Dataset Key Columns Description
targets TargetID, Name, Type, UniProtID, Function Therapeutic targets
drugs DrugID, Name, HighestStatus, Targets, Indications Drug pipeline data

Disease identifiers: ICD codes, free-text indications — manual mapping to disease_map Target identifiers: TTD target ID, UniProt accession — join via protein_map.uniprot_id

uv run bioingest download ttd --all
# Athena
SELECT Name, HighestStatus, Indications FROM bioingest.ttd__drugs WHERE Targets LIKE '%EGFR%';

MarkerDB

Curated collection of molecular biomarkers linked to conditions, body fluids, and clinical applications. Covers protein, chemical, and genetic markers with evidence levels.

Dataset Key Columns Description
proteins biomarker_name, uniprot_id, condition, biofluid, application Protein biomarkers
chemicals biomarker_name, hmdb_id, condition, biofluid Chemical/metabolite biomarkers
conditions condition_name, markers, category Condition → marker index

Disease identifiers: Free-text condition names — text match to disease_map.mondo_name Protein identifiers: UniProt accession — join via protein_map.uniprot_id

uv run bioingest download markerdb --all
# Athena
SELECT biomarker_name, condition, biofluid, application FROM bioingest.markerdb__proteins
WHERE condition LIKE '%cardiovascular%';