Skip to content

Mapping Tables

The mapping layer links all 28 sources through shared identifiers. Three Parquet tables serve as the bridge; six Athena views join sources automatically.

protein_map (26,499 rows)

Built from UniProt idmapping_human.tsv + swissprot_tsv + ChEMBL mapping.

Column Type Example Source
uniprot_id string P00533 UniProt accession (canonical)
gene_name string EGFR UniProt gene_names primary
protein_name string Epidermal growth factor receptor UniProt protein_names
ensembl_gene_id string ENSG00000146648 idmapping_human
ensembl_protein_id string ENSP00000275493 idmapping_human
string_id string 9606.ENSP00000275493 idmapping_human (STRING type)
entrez_gene_id string 1956 idmapping_human
hgnc_id string HGNC:3236 idmapping_human
chembl_target_id string CHEMBL203 chembl_uniprot_mapping
bioingest map build-proteins   # build just protein_map

disease_map (31,884 rows)

Built from mondo.obo cross-references.

Column Type Example Source
mondo_id string MONDO:0005015 MONDO term ID (canonical)
mondo_name string diabetes mellitus MONDO label
disease_category string metabolic Classified from ancestors
efo_id string EFO:0000400 MONDO xref
doid string DOID:9351 MONDO xref
mesh_id string MESH:D003920 MONDO xref
icd10_code string ICD10CM:E08-E13 MONDO xref
medgen_id string MEDGEN:8350 MONDO xref
meddra_id string MEDDRA:10012601 MONDO xref
synonyms string DM\|sugar diabetes Pipe-delimited

Disease categories: cancer, cardiovascular, neurodegenerative, metabolic, autoimmune, infectious, respiratory, other.

bioingest map build-diseases   # build just disease_map

drug_map (42,939 rows)

Built from TTD drug download file.

Column Type Example Source
drug_id string D00001 TTD drug ID
drug_name string Imatinib TTD DRUGNAME field
targets string T12345\|T67890 Pipe-delimited TTD target IDs
indications string CML\|GIST Pipe-delimited indications
source string TTD Always "TTD"
bioingest map build-drugs      # build just drug_map

Build All + Graph

bioingest map build-all        # builds all 3 tables + mapping graph JSONL
bioingest map stats            # show row counts
bioingest map publish          # upload Parquet to S3
bioingest map create-views     # deploy unified Athena views

Cross-Source Join Example

-- Find all proteins associated with Alzheimer's across ALL sources
SELECT uniprot_id, gene_name, source, score
FROM bioingest.unified_protein_disease
WHERE mondo_name LIKE '%Alzheimer%'
ORDER BY score DESC;

The 6 Unified Athena Views

1. unified_protein_disease

Joins gene-disease associations from Open Targets, DISEASES 2.0 (knowledge + experiments), ClinVar, and MarkerDB.

SELECT uniprot_id, gene_name, mondo_id, mondo_name, disease_category, source, score
FROM bioingest.unified_protein_disease
WHERE gene_name = 'EGFR';

2. unified_ppi

Protein-protein interactions from STRING (combined_score ≥ 700).

SELECT protein1_uniprot, protein1_name, protein2_uniprot, protein2_name, score
FROM bioingest.unified_ppi
WHERE protein1_name = 'EGFR' OR protein2_name = 'EGFR';

3. unified_protein_pathway

Protein-pathway membership from Reactome (human only).

SELECT uniprot_id, gene_name, pathway_id, pathway_name
FROM bioingest.unified_protein_pathway
WHERE gene_name = 'TP53';

4. unified_drug_target

Drug-target links from ChEMBL and TTD.

SELECT uniprot_id, gene_name, chembl_id, target_type, source
FROM bioingest.unified_drug_target
WHERE gene_name = 'EGFR';

5. unified_expression

Tissue expression from GTEx (TPM > 1.0).

SELECT uniprot_id, gene_name, tissue, tpm
FROM bioingest.unified_expression
WHERE gene_name = 'INS'
ORDER BY tpm DESC;

6. protein_annotations

Summary view with all IDs + association counts per protein.

SELECT * FROM bioingest.protein_annotations
WHERE gene_name = 'BRCA1';
-- Returns: all IDs + disease_associations count + interactions count + pathways count

Mapping Graph (201K edges)

The mapping graph encodes identity relationships as triples for Neptune/Neo4j.

Relationship Meaning Source
SAME_AS Two IDs refer to the same entity UniProt idmapping, ChEMBL mapping, MONDO xrefs
BROADER_THAN Parent encompasses child MONDO is_a hierarchy
bioingest map build-all          # generates data/mapping/mapping_graph.jsonl
bioingest map push-graph         # push to Neptune (requires NEPTUNE_ENDPOINT)

Triple format (JSONL):

{"source_id": "P00533", "source_ns": "uniprot", "target_id": "ENSG00000146648", "target_ns": "ensembl_gene", "rel_type": "SAME_AS", "evidence": "uniprot_idmapping"}
{"source_id": "MONDO:0004992", "source_ns": "mondo", "target_id": "MONDO:0005015", "target_ns": "mondo", "rel_type": "BROADER_THAN", "evidence": "mondo_is_a"}

Node labels assigned by namespace: uniprot→Protein, ensembl_gene→Gene, string→Protein, chembl→Target, mondo→Disease, efo→Disease, mesh→Disease.