Drug & Chemistry Databases¶
ChEMBL¶
EMBL-EBI's manually curated bioactivity database containing binding, functional, and ADMET data for drug-like molecules. Provides the chembl_target_id column in protein_map via its UniProt mapping file. The full SQLite dump enables complex queries on assay results and compound properties.
| Dataset | Key Columns | Description |
|---|---|---|
uniprot_mapping |
UniProt_Accession, ChEMBL_ID, Target_Name, Target_Type |
UniProt → ChEMBL target ID mapping |
chembl_sqlite |
Full relational DB: compound_structures, activities, assays, target_dictionary |
Complete ChEMBL database |
Identifiers: ChEMBL target ID (CHEMBL203), ChEMBL compound ID (CHEMBL941)
Join key: uniprot_mapping.UniProt_Accession → protein_map.uniprot_id; also protein_map.chembl_target_id
uv run bioingest download chembl --all
# Athena — targets with ChEMBL IDs
SELECT UniProt_Accession, ChEMBL_ID, Target_Name FROM bioingest.chembl__uniprot_mapping
WHERE Target_Name LIKE '%kinase%' LIMIT 50;
PubChem¶
NCBI's open chemistry database with >100M compounds. The bioingest datasets focus on compound-target interactions and physicochemical properties, complementing ChEMBL with additional bioactivity data from depositor assays.
| Dataset | Key Columns | Description |
|---|---|---|
compound_target |
cid, gene_symbol, uniprot_id, activity_type, activity_value, assay_id |
Compound-protein interactions |
compound_properties |
cid, molecular_weight, xlogp, tpsa, rotatable_bonds, inchi |
Physicochemical descriptors |
Identifiers: PubChem CID (2244 = aspirin), UniProt accession, gene symbol
Join key: compound_target.uniprot_id → protein_map.uniprot_id
uv run bioingest download pubchem --all
# Athena — compounds active against EGFR
SELECT cid, activity_type, activity_value FROM bioingest.pubchem__compound_target
WHERE uniprot_id = 'P00533' AND activity_value < 100;
OpenFDA¶
FDA's public drug adverse event and labeling data. The drug_event dataset contains FAERS (FDA Adverse Event Reporting System) reports linking drugs to reported reactions. Useful for safety signal detection and post-market surveillance analysis.
| Dataset | Key Columns | Description |
|---|---|---|
drug_event |
safetyreportid, drug_name, reaction, serious, receivedate, patient_age |
Adverse event reports (~28M) |
drug_label |
brand_name, generic_name, indications_and_usage, warnings, active_ingredient |
Structured product labels |
Identifiers: Drug names (free text), NDC codes
Join strategy: Text match drug_name → drug_map.drug_name; no direct protein join (drug-level data)