Mapping Tables¶
The mapping layer links all 28 sources through shared identifiers. Three Parquet tables serve as the bridge; six Athena views join sources automatically.
protein_map (26,499 rows)¶
Built from UniProt idmapping_human.tsv + swissprot_tsv + ChEMBL mapping.
| Column | Type | Example | Source |
|---|---|---|---|
uniprot_id |
string | P00533 |
UniProt accession (canonical) |
gene_name |
string | EGFR |
UniProt gene_names primary |
protein_name |
string | Epidermal growth factor receptor |
UniProt protein_names |
ensembl_gene_id |
string | ENSG00000146648 |
idmapping_human |
ensembl_protein_id |
string | ENSP00000275493 |
idmapping_human |
string_id |
string | 9606.ENSP00000275493 |
idmapping_human (STRING type) |
entrez_gene_id |
string | 1956 |
idmapping_human |
hgnc_id |
string | HGNC:3236 |
idmapping_human |
chembl_target_id |
string | CHEMBL203 |
chembl_uniprot_mapping |
disease_map (31,884 rows)¶
Built from mondo.obo cross-references.
| Column | Type | Example | Source |
|---|---|---|---|
mondo_id |
string | MONDO:0005015 |
MONDO term ID (canonical) |
mondo_name |
string | diabetes mellitus |
MONDO label |
disease_category |
string | metabolic |
Classified from ancestors |
efo_id |
string | EFO:0000400 |
MONDO xref |
doid |
string | DOID:9351 |
MONDO xref |
mesh_id |
string | MESH:D003920 |
MONDO xref |
icd10_code |
string | ICD10CM:E08-E13 |
MONDO xref |
medgen_id |
string | MEDGEN:8350 |
MONDO xref |
meddra_id |
string | MEDDRA:10012601 |
MONDO xref |
synonyms |
string | DM\|sugar diabetes |
Pipe-delimited |
Disease categories: cancer, cardiovascular, neurodegenerative, metabolic, autoimmune, infectious, respiratory, other.
drug_map (42,939 rows)¶
Built from TTD drug download file.
| Column | Type | Example | Source |
|---|---|---|---|
drug_id |
string | D00001 |
TTD drug ID |
drug_name |
string | Imatinib |
TTD DRUGNAME field |
targets |
string | T12345\|T67890 |
Pipe-delimited TTD target IDs |
indications |
string | CML\|GIST |
Pipe-delimited indications |
source |
string | TTD |
Always "TTD" |
Build All + Graph¶
bioingest map build-all # builds all 3 tables + mapping graph JSONL
bioingest map stats # show row counts
bioingest map publish # upload Parquet to S3
bioingest map create-views # deploy unified Athena views
Cross-Source Join Example¶
-- Find all proteins associated with Alzheimer's across ALL sources
SELECT uniprot_id, gene_name, source, score
FROM bioingest.unified_protein_disease
WHERE mondo_name LIKE '%Alzheimer%'
ORDER BY score DESC;
The 6 Unified Athena Views¶
1. unified_protein_disease¶
Joins gene-disease associations from Open Targets, DISEASES 2.0 (knowledge + experiments), ClinVar, and MarkerDB.
SELECT uniprot_id, gene_name, mondo_id, mondo_name, disease_category, source, score
FROM bioingest.unified_protein_disease
WHERE gene_name = 'EGFR';
2. unified_ppi¶
Protein-protein interactions from STRING (combined_score ≥ 700).
SELECT protein1_uniprot, protein1_name, protein2_uniprot, protein2_name, score
FROM bioingest.unified_ppi
WHERE protein1_name = 'EGFR' OR protein2_name = 'EGFR';
3. unified_protein_pathway¶
Protein-pathway membership from Reactome (human only).
SELECT uniprot_id, gene_name, pathway_id, pathway_name
FROM bioingest.unified_protein_pathway
WHERE gene_name = 'TP53';
4. unified_drug_target¶
Drug-target links from ChEMBL and TTD.
SELECT uniprot_id, gene_name, chembl_id, target_type, source
FROM bioingest.unified_drug_target
WHERE gene_name = 'EGFR';
5. unified_expression¶
Tissue expression from GTEx (TPM > 1.0).
SELECT uniprot_id, gene_name, tissue, tpm
FROM bioingest.unified_expression
WHERE gene_name = 'INS'
ORDER BY tpm DESC;
6. protein_annotations¶
Summary view with all IDs + association counts per protein.
SELECT * FROM bioingest.protein_annotations
WHERE gene_name = 'BRCA1';
-- Returns: all IDs + disease_associations count + interactions count + pathways count
Mapping Graph (201K edges)¶
The mapping graph encodes identity relationships as triples for Neptune/Neo4j.
| Relationship | Meaning | Source |
|---|---|---|
SAME_AS |
Two IDs refer to the same entity | UniProt idmapping, ChEMBL mapping, MONDO xrefs |
BROADER_THAN |
Parent encompasses child | MONDO is_a hierarchy |
bioingest map build-all # generates data/mapping/mapping_graph.jsonl
bioingest map push-graph # push to Neptune (requires NEPTUNE_ENDPOINT)
Triple format (JSONL):
{"source_id": "P00533", "source_ns": "uniprot", "target_id": "ENSG00000146648", "target_ns": "ensembl_gene", "rel_type": "SAME_AS", "evidence": "uniprot_idmapping"}
{"source_id": "MONDO:0004992", "source_ns": "mondo", "target_id": "MONDO:0005015", "target_ns": "mondo", "rel_type": "BROADER_THAN", "evidence": "mondo_is_a"}
Node labels assigned by namespace: uniprot→Protein, ensembl_gene→Gene, string→Protein, chembl→Target, mondo→Disease, efo→Disease, mesh→Disease.