Skip to content

Drug & Chemistry Databases

ChEMBL

EMBL-EBI's manually curated bioactivity database containing binding, functional, and ADMET data for drug-like molecules. Provides the chembl_target_id column in protein_map via its UniProt mapping file. The full SQLite dump enables complex queries on assay results and compound properties.

Dataset Key Columns Description
uniprot_mapping UniProt_Accession, ChEMBL_ID, Target_Name, Target_Type UniProt → ChEMBL target ID mapping
chembl_sqlite Full relational DB: compound_structures, activities, assays, target_dictionary Complete ChEMBL database

Identifiers: ChEMBL target ID (CHEMBL203), ChEMBL compound ID (CHEMBL941) Join key: uniprot_mapping.UniProt_Accessionprotein_map.uniprot_id; also protein_map.chembl_target_id

uv run bioingest download chembl --all
# Athena — targets with ChEMBL IDs
SELECT UniProt_Accession, ChEMBL_ID, Target_Name FROM bioingest.chembl__uniprot_mapping
WHERE Target_Name LIKE '%kinase%' LIMIT 50;

PubChem

NCBI's open chemistry database with >100M compounds. The bioingest datasets focus on compound-target interactions and physicochemical properties, complementing ChEMBL with additional bioactivity data from depositor assays.

Dataset Key Columns Description
compound_target cid, gene_symbol, uniprot_id, activity_type, activity_value, assay_id Compound-protein interactions
compound_properties cid, molecular_weight, xlogp, tpsa, rotatable_bonds, inchi Physicochemical descriptors

Identifiers: PubChem CID (2244 = aspirin), UniProt accession, gene symbol Join key: compound_target.uniprot_idprotein_map.uniprot_id

uv run bioingest download pubchem --all
# Athena — compounds active against EGFR
SELECT cid, activity_type, activity_value FROM bioingest.pubchem__compound_target
WHERE uniprot_id = 'P00533' AND activity_value < 100;

OpenFDA

FDA's public drug adverse event and labeling data. The drug_event dataset contains FAERS (FDA Adverse Event Reporting System) reports linking drugs to reported reactions. Useful for safety signal detection and post-market surveillance analysis.

Dataset Key Columns Description
drug_event safetyreportid, drug_name, reaction, serious, receivedate, patient_age Adverse event reports (~28M)
drug_label brand_name, generic_name, indications_and_usage, warnings, active_ingredient Structured product labels

Identifiers: Drug names (free text), NDC codes Join strategy: Text match drug_namedrug_map.drug_name; no direct protein join (drug-level data)

uv run bioingest download openfda --all
# Athena — adverse events for pembrolizumab
SELECT reaction, COUNT(*) as n FROM bioingest.openfda__drug_event
WHERE drug_name LIKE '%pembrolizumab%' GROUP BY reaction ORDER BY n DESC LIMIT 20;