Protein & Target Databases¶
UniProt¶
The Universal Protein Resource — the canonical reviewed human proteome (Swiss-Prot) and the master cross-reference hub linking UniProt accessions to Ensembl, STRING, HGNC, Entrez, and ChEMBL IDs. The idmapping_human dataset is the backbone of protein_map.
| Dataset | Key Columns | Description |
|---|---|---|
swissprot_tsv |
Entry, Gene Names, Protein names, Organism, Length, EC number |
Reviewed human proteome (~20K entries) |
idmapping_human |
UniProtKB-AC, ID_type, ID |
Cross-reference table (UniProt → Ensembl, STRING, HGNC, etc.) |
Identifiers: UniProt accession (P00533), gene name (EGFR)
Join key: protein_map.uniprot_id
uv run bioingest download uniprot --all
# Athena
SELECT entry, gene_names, protein_names FROM bioingest.uniprot__swissprot_tsv WHERE gene_names LIKE '%EGFR%';
STRING¶
Protein-protein interaction network derived from experimental data, text-mining, co-expression, and genomic context. Provides combined confidence scores (0–1000) for each interaction pair. Human network contains ~12M scored interactions.
| Dataset | Key Columns | Description |
|---|---|---|
protein_links |
protein1, protein2, combined_score |
All interactions with combined score |
protein_links_detailed |
protein1, protein2, experimental, database, textmining, coexpression, combined_score |
Per-evidence-channel scores |
protein_info |
protein_external_id, preferred_name, annotation |
Protein metadata |
Identifiers: STRING ID (9606.ENSP00000275493)
Join key: protein_map.string_id
uv run bioingest download string --all
# Athena — top interactors for EGFR
SELECT protein1, protein2, combined_score FROM bioingest.string__protein_links
WHERE protein1 = '9606.ENSP00000275493' AND combined_score >= 700 ORDER BY combined_score DESC;
InterPro¶
Protein domain and family classification integrating Pfam, PROSITE, SMART, CDD, and other member databases. Maps each protein to its constituent domains, enabling functional annotation at the domain level.
| Dataset | Key Columns | Description |
|---|---|---|
protein2ipr |
uniprot_id, interpro_id, interpro_name, start, end |
Protein-to-domain mappings |
entry_list |
interpro_id, type, name |
InterPro entry catalog (domain/family/repeat) |
interpro2go |
interpro_id, go_id |
InterPro to Gene Ontology mappings |
Identifiers: UniProt accession, InterPro ID (IPR000719)
Join key: protein_map.uniprot_id (direct match on protein2ipr.uniprot_id)
uv run bioingest download interpro --datasets protein2ipr entry_list
# Athena — domains for EGFR
SELECT interpro_id, interpro_name FROM bioingest.interpro__protein2ipr WHERE uniprot_id = 'P00533';
Complex Portal¶
EBI-curated catalog of stable macromolecular complexes with stoichiometry, function, and disease annotations. Each complex lists its component proteins with UniProt accessions.
| Dataset | Key Columns | Description |
|---|---|---|
complexes |
Complex ac, Recommended name, Taxonomy identifier, Identifiers (cross-references), Participants |
Human protein complexes |
Identifiers: Complex Portal ID (CPX-1), UniProt accessions in Participants column
Join key: Parse UniProt accessions from Participants → protein_map.uniprot_id
uv run bioingest download complex_portal
# Athena
SELECT "Complex ac", "Recommended name", "Participants" FROM bioingest.complex_portal__complexes
WHERE "Participants" LIKE '%P00533%';
PDB / SIFTS¶
Structure Integration with Function, Taxonomy, and Sequence — maps PDB structures to UniProt residues, Pfam domains, SCOP folds, and Gene Ontology. Enables linking 3D structural information to sequence-level annotations.
| Dataset | Key Columns | Description |
|---|---|---|
sifts_mappings |
PDB, CHAIN, SP_PRIMARY, RES_BEG, RES_END, PDB_BEG, PDB_END |
PDB chain → UniProt residue ranges |
Identifiers: PDB ID (1ATP), UniProt accession (SP_PRIMARY)
Join key: sifts_mappings.SP_PRIMARY → protein_map.uniprot_id
uv run bioingest download pdb_complexes
# Athena — structures for EGFR
SELECT PDB, CHAIN, RES_BEG, RES_END FROM bioingest.pdb_complexes__sifts_mappings WHERE SP_PRIMARY = 'P00533';
AlphaFold¶
DeepMind's predicted protein structures covering nearly all known proteins. Provides per-residue confidence scores (pLDDT) and predicted aligned error (PAE). Complements experimental PDB structures for proteins without crystal/cryo-EM data.
| Dataset | Key Columns | Description |
|---|---|---|
predictions |
uniprot_id, af_id, plddt_mean, model_url |
AlphaFold structure predictions per protein |
Identifiers: UniProt accession, AlphaFold ID (AF-P00533-F1)
Join key: protein_map.uniprot_id