Data Model¶
How all 28 sources connect through the mapping tables.
Network Diagram¶
graph LR
%% Hub nodes
PM[("🧬 protein_map<br/>26,499 proteins")]:::hub
DM[("🏥 disease_map<br/>31,884 diseases")]:::hub
DrM[("💊 drug_map<br/>42,939 drugs")]:::hub
%% Protein sources (direct UniProt join)
UP[UniProt]:::direct --> PM
RT[Reactome]:::direct --> PM
CH[ChEMBL]:::direct --> PM
IP[InterPro]:::direct --> PM
CP[Complex Portal]:::direct --> PM
SL[SomaLogic]:::direct --> PM
MB[MarkerDB]:::direct --> PM
HPA[HPA/Olink]:::direct --> PM
PDB[PDB/SIFTS]:::direct --> PM
%% Protein sources (Ensembl bridge)
OT[Open Targets]:::bridge --> PM
ST[STRING]:::bridge --> PM
GT[GTEx]:::bridge --> PM
DIS[DISEASES 2.0]:::bridge --> PM
EN[Ensembl]:::bridge --> PM
GN[gnomAD]:::bridge --> PM
%% Disease sources
OT --> DM
DIS --> DM
CV[ClinVar]:::direct --> DM
CT[ClinicalTrials]:::direct --> DM
FDA[OpenFDA]:::direct --> DM
ICD[ICD-10-CM]:::direct --> DM
UKB[UK Biobank]:::direct --> DM
MH[MeSH]:::direct --> DM
%% Ontology hierarchy feeds disease_map
MO[MONDO]:::onto --> DM
DO[Disease Ontology]:::onto --> DM
EF[EFO]:::onto --> DM
GO[Gene Ontology]:::onto --> PM
%% Drug sources
CH --> DrM
TTD[TTD]:::direct --> DrM
PC[PubChem]:::direct --> DrM
FDA --> DrM
%% Competitors (fuzzy match via gene name)
MSD[MSD]:::comp -.-> PM
QX[Quanterix]:::comp -.-> PM
AL[Alamar]:::comp -.-> PM
NM[Nomic]:::comp -.-> PM
%% LLM-extracted KG
KG[("🤖 LLM-extracted<br/>Knowledge Graph")]:::kg --> PM
KG --> DM
%% Cross-links between hubs
PM <--> DM
PM <--> DrM
DM <--> DrM
%% Styles
classDef hub fill:#1a73e8,stroke:#fff,color:#fff,font-weight:bold
classDef direct fill:#34a853,stroke:#333,color:#fff
classDef bridge fill:#fbbc04,stroke:#333,color:#000
classDef onto fill:#9c27b0,stroke:#333,color:#fff
classDef comp fill:#ff6d00,stroke:#333,color:#fff
classDef kg fill:#e91e63,stroke:#333,color:#fff
Legend¶
| Color | Meaning | Join type |
|---|---|---|
| 🟢 Green | Direct UniProt/MONDO ID in the data | Exact ID match |
| 🟡 Yellow | Uses Ensembl IDs (bridged via protein_map) | ID mapping lookup |
| 🟣 Purple | Ontology (provides the hierarchy + xrefs) | Parsed from OBO |
| 🟠 Orange | Competitor data (protein names, not IDs) | Fuzzy name match |
| 🔴 Pink | LLM-extracted from papers | Entity resolution |
| 🔵 Blue | Mapping hub table | Central join point |
Join Paths¶
Protein → Disease (6 sources)¶
graph LR
P[Protein<br/>UniProt ID] --> PM[protein_map]
PM -->|ensembl_gene_id| OT[Open Targets]
PM -->|ensembl_protein_id| DIS[DISEASES 2.0]
PM -->|gene_name| CV[ClinVar]
PM -->|uniprot_id| MB[MarkerDB]
OT -->|efo_id| DM[disease_map]
DIS -->|doid| DM
CV -->|medgen_id| DM
MB -->|disease_name| DM
DM --> D[Disease<br/>MONDO ID]
Protein → Drug (4 sources)¶
graph LR
P[Protein] --> PM[protein_map]
PM -->|uniprot_id| CH[ChEMBL]
PM -->|uniprot_id| TTD[TTD]
PM -->|entrez_gene_id| PC[PubChem]
CH --> DrM[drug_map]
TTD --> DrM
PC --> DrM
DrM --> D[Drug]
Protein → Protein (PPI, 3 sources)¶
graph LR
P1[Protein A] --> PM[protein_map]
PM -->|string_id| ST[STRING<br/>score ≥ 700]
PM -->|uniprot_id| CP[Complex Portal<br/>same complex]
PM -->|uniprot_id| KG[LLM KG<br/>INTERACTS_WITH]
ST --> PM2[protein_map]
CP --> PM2
KG --> PM2
PM2 --> P2[Protein B]
Identifier Flow¶
flowchart TD
subgraph "Raw Data (28 sources)"
A[UniProt IDs]
B[Ensembl IDs]
C[Gene Names]
D[MONDO/EFO/DOID]
E[MeSH/ICD-10]
F[ChEMBL/PubChem]
end
subgraph "Mapping Layer"
PM[protein_map<br/>9 columns]
DM[disease_map<br/>10 columns]
DrM[drug_map<br/>5 columns]
end
subgraph "Unified Query"
V1[unified_protein_disease]
V2[unified_ppi]
V3[unified_drug_target]
V4[unified_expression]
end
subgraph "Knowledge Graph"
N[Neptune<br/>201K mapping edges<br/>+ LLM extractions]
end
A --> PM
B --> PM
C --> PM
D --> DM
E --> DM
F --> DrM
PM --> V1
PM --> V2
PM --> V3
PM --> V4
DM --> V1
DrM --> V3
PM --> N
DM --> N
DrM --> N