Disease & Association Databases¶
Open Targets¶
The Open Targets Platform aggregates target-disease-drug evidence from 20+ data sources, scoring each association with a 0–1 overall score. Provides pre-computed Parquet datasets covering targets, diseases, evidence, and drugs — the richest single source for drug target validation.
| Dataset | Key Columns | Description |
|---|---|---|
targets |
id, approvedSymbol, approvedName, biotype, tractability |
Target catalog (~63K entries) |
associations |
targetId, diseaseId, score, datatypeScores |
Target-disease association scores |
drugs |
id, name, drugType, mechanismsOfAction, indications |
Drug catalog with MOA |
Disease identifiers: EFO IDs (EFO_0000311) — join via disease_map.efo_id
Target identifiers: Ensembl gene IDs (ENSG00000146648) — join via protein_map.ensembl_gene_id
uv run bioingest download opentargets --all
# Athena — top disease associations for EGFR
SELECT diseaseId, score FROM bioingest.opentargets__associations
WHERE targetId = 'ENSG00000146648' ORDER BY score DESC LIMIT 20;
DISEASES 2.0¶
JensenLab's gene-disease associations from three channels: curated knowledge, experimental evidence, and text-mining. Each association includes a confidence score (0–5 scale). Useful for broad disease enrichment beyond Open Targets.
| Dataset | Key Columns | Description |
|---|---|---|
knowledge |
gene_id, gene_name, disease_id, disease_name, confidence |
Curated associations |
experiments |
gene_id, gene_name, disease_id, disease_name, confidence |
Experimental evidence |
textmining |
gene_id, gene_name, disease_id, disease_name, zscore |
Text-mined co-mentions |
Disease identifiers: DOID (DOID:9351) — join via disease_map.doid
Gene identifiers: Ensembl gene ID or HGNC symbol — join via protein_map.ensembl_gene_id or protein_map.gene_name
uv run bioingest download diseases --all
# Athena
SELECT gene_name, disease_name, confidence FROM bioingest.diseases__knowledge
WHERE gene_name = 'EGFR' ORDER BY confidence DESC;
ClinVar¶
NCBI's archive of relationships between human variation and observed health phenotypes. Reports pathogenicity classifications (pathogenic, likely pathogenic, VUS, benign) for variants in the context of specific conditions.
| Dataset | Key Columns | Description |
|---|---|---|
variant_summary |
GeneSymbol, ClinicalSignificance, PhenotypeIDS, Assembly, Chromosome, Start, ReferenceAllele, AlternateAllele |
All submitted variant interpretations |
gene_condition_source_id |
GeneSymbol, ConceptID, DiseaseName, SourceName |
Gene-condition mapping with MedGen/OMIM IDs |
Disease identifiers: MedGen CUI, OMIM ID — join via disease_map.medgen_id
Gene identifiers: HGNC symbol — join via protein_map.gene_name
uv run bioingest download clinvar --all
# Athena — pathogenic variants in BRCA1
SELECT GeneSymbol, ClinicalSignificance, PhenotypeIDS FROM bioingest.clinvar__variant_summary
WHERE GeneSymbol = 'BRCA1' AND ClinicalSignificance LIKE '%Pathogenic%';
ClinicalTrials.gov¶
US National Library of Medicine registry of clinical studies. Provides structured data on trial design, endpoints, interventions, conditions, sponsor, and status. Useful for identifying which biomarkers are being tested in active trials.
| Dataset | Key Columns | Description |
|---|---|---|
all_studies |
nctId, briefTitle, conditions, interventions, primaryOutcomes, phase, status |
Full trial registry (~500K studies) |
biomarker_studies |
nctId, briefTitle, conditions, interventions, biomarkers |
Filtered subset with biomarker mentions |
Disease identifiers: MeSH terms in conditions — join via disease_map.mesh_id
Identifiers: NCT ID (NCT04128436)
uv run bioingest download clinical_trials --all
# Athena
SELECT nctId, briefTitle, conditions, phase FROM bioingest.clinical_trials__all_studies
WHERE conditions LIKE '%lung cancer%' AND interventions LIKE '%pembrolizumab%';
TTD (Therapeutic Target Database)¶
Curated database of therapeutic protein and nucleic acid targets, drugs directed at each target, and corresponding diseases. Includes clinical status of drug-target pairs. Also provides the raw data for drug_map.
| Dataset | Key Columns | Description |
|---|---|---|
targets |
TargetID, Name, Type, UniProtID, Function |
Therapeutic targets |
drugs |
DrugID, Name, HighestStatus, Targets, Indications |
Drug pipeline data |
Disease identifiers: ICD codes, free-text indications — manual mapping to disease_map
Target identifiers: TTD target ID, UniProt accession — join via protein_map.uniprot_id
uv run bioingest download ttd --all
# Athena
SELECT Name, HighestStatus, Indications FROM bioingest.ttd__drugs WHERE Targets LIKE '%EGFR%';
MarkerDB¶
Curated collection of molecular biomarkers linked to conditions, body fluids, and clinical applications. Covers protein, chemical, and genetic markers with evidence levels.
| Dataset | Key Columns | Description |
|---|---|---|
proteins |
biomarker_name, uniprot_id, condition, biofluid, application |
Protein biomarkers |
chemicals |
biomarker_name, hmdb_id, condition, biofluid |
Chemical/metabolite biomarkers |
conditions |
condition_name, markers, category |
Condition → marker index |
Disease identifiers: Free-text condition names — text match to disease_map.mondo_name
Protein identifiers: UniProt accession — join via protein_map.uniprot_id