Skip to content

Protein & Target Databases

UniProt

The Universal Protein Resource — the canonical reviewed human proteome (Swiss-Prot) and the master cross-reference hub linking UniProt accessions to Ensembl, STRING, HGNC, Entrez, and ChEMBL IDs. The idmapping_human dataset is the backbone of protein_map.

Dataset Key Columns Description
swissprot_tsv Entry, Gene Names, Protein names, Organism, Length, EC number Reviewed human proteome (~20K entries)
idmapping_human UniProtKB-AC, ID_type, ID Cross-reference table (UniProt → Ensembl, STRING, HGNC, etc.)

Identifiers: UniProt accession (P00533), gene name (EGFR) Join key: protein_map.uniprot_id

uv run bioingest download uniprot --all
# Athena
SELECT entry, gene_names, protein_names FROM bioingest.uniprot__swissprot_tsv WHERE gene_names LIKE '%EGFR%';

STRING

Protein-protein interaction network derived from experimental data, text-mining, co-expression, and genomic context. Provides combined confidence scores (0–1000) for each interaction pair. Human network contains ~12M scored interactions.

Dataset Key Columns Description
protein_links protein1, protein2, combined_score All interactions with combined score
protein_links_detailed protein1, protein2, experimental, database, textmining, coexpression, combined_score Per-evidence-channel scores
protein_info protein_external_id, preferred_name, annotation Protein metadata

Identifiers: STRING ID (9606.ENSP00000275493) Join key: protein_map.string_id

uv run bioingest download string --all
# Athena — top interactors for EGFR
SELECT protein1, protein2, combined_score FROM bioingest.string__protein_links
WHERE protein1 = '9606.ENSP00000275493' AND combined_score >= 700 ORDER BY combined_score DESC;

InterPro

Protein domain and family classification integrating Pfam, PROSITE, SMART, CDD, and other member databases. Maps each protein to its constituent domains, enabling functional annotation at the domain level.

Dataset Key Columns Description
protein2ipr uniprot_id, interpro_id, interpro_name, start, end Protein-to-domain mappings
entry_list interpro_id, type, name InterPro entry catalog (domain/family/repeat)
interpro2go interpro_id, go_id InterPro to Gene Ontology mappings

Identifiers: UniProt accession, InterPro ID (IPR000719) Join key: protein_map.uniprot_id (direct match on protein2ipr.uniprot_id)

uv run bioingest download interpro --datasets protein2ipr entry_list
# Athena — domains for EGFR
SELECT interpro_id, interpro_name FROM bioingest.interpro__protein2ipr WHERE uniprot_id = 'P00533';

Complex Portal

EBI-curated catalog of stable macromolecular complexes with stoichiometry, function, and disease annotations. Each complex lists its component proteins with UniProt accessions.

Dataset Key Columns Description
complexes Complex ac, Recommended name, Taxonomy identifier, Identifiers (cross-references), Participants Human protein complexes

Identifiers: Complex Portal ID (CPX-1), UniProt accessions in Participants column Join key: Parse UniProt accessions from Participantsprotein_map.uniprot_id

uv run bioingest download complex_portal
# Athena
SELECT "Complex ac", "Recommended name", "Participants" FROM bioingest.complex_portal__complexes
WHERE "Participants" LIKE '%P00533%';

PDB / SIFTS

Structure Integration with Function, Taxonomy, and Sequence — maps PDB structures to UniProt residues, Pfam domains, SCOP folds, and Gene Ontology. Enables linking 3D structural information to sequence-level annotations.

Dataset Key Columns Description
sifts_mappings PDB, CHAIN, SP_PRIMARY, RES_BEG, RES_END, PDB_BEG, PDB_END PDB chain → UniProt residue ranges

Identifiers: PDB ID (1ATP), UniProt accession (SP_PRIMARY) Join key: sifts_mappings.SP_PRIMARYprotein_map.uniprot_id

uv run bioingest download pdb_complexes
# Athena — structures for EGFR
SELECT PDB, CHAIN, RES_BEG, RES_END FROM bioingest.pdb_complexes__sifts_mappings WHERE SP_PRIMARY = 'P00533';

AlphaFold

DeepMind's predicted protein structures covering nearly all known proteins. Provides per-residue confidence scores (pLDDT) and predicted aligned error (PAE). Complements experimental PDB structures for proteins without crystal/cryo-EM data.

Dataset Key Columns Description
predictions uniprot_id, af_id, plddt_mean, model_url AlphaFold structure predictions per protein

Identifiers: UniProt accession, AlphaFold ID (AF-P00533-F1) Join key: protein_map.uniprot_id

uv run bioingest download alphafold
# Athena
SELECT uniprot_id, af_id, plddt_mean FROM bioingest.alphafold__predictions WHERE uniprot_id = 'P00533';