Protein & Target Databases¶

UniProt¶

The Universal Protein Resource — the canonical reviewed human proteome (Swiss-Prot) and the master cross-reference hub linking UniProt accessions to Ensembl, STRING, HGNC, Entrez, and ChEMBL IDs. The idmapping_human dataset is the backbone of protein_map.

Dataset	Key Columns	Description
`swissprot_tsv`	`Entry`, `Gene Names`, `Protein names`, `Organism`, `Length`, `EC number`	Reviewed human proteome (~20K entries)
`idmapping_human`	`UniProtKB-AC`, `ID_type`, `ID`	Cross-reference table (UniProt → Ensembl, STRING, HGNC, etc.)

Identifiers: UniProt accession (P00533), gene name (EGFR) Join key: protein_map.uniprot_id

uv run bioingest download uniprot --all
# Athena
SELECT entry, gene_names, protein_names FROM bioingest.uniprot__swissprot_tsv WHERE gene_names LIKE '%EGFR%';

STRING¶

Protein-protein interaction network derived from experimental data, text-mining, co-expression, and genomic context. Provides combined confidence scores (0–1000) for each interaction pair. Human network contains ~12M scored interactions.

Dataset	Key Columns	Description
`protein_links`	`protein1`, `protein2`, `combined_score`	All interactions with combined score
`protein_links_detailed`	`protein1`, `protein2`, `experimental`, `database`, `textmining`, `coexpression`, `combined_score`	Per-evidence-channel scores
`protein_info`	`protein_external_id`, `preferred_name`, `annotation`	Protein metadata

Identifiers: STRING ID (9606.ENSP00000275493) Join key: protein_map.string_id

uv run bioingest download string --all
# Athena — top interactors for EGFR
SELECT protein1, protein2, combined_score FROM bioingest.string__protein_links
WHERE protein1 = '9606.ENSP00000275493' AND combined_score >= 700 ORDER BY combined_score DESC;

InterPro¶

Protein domain and family classification integrating Pfam, PROSITE, SMART, CDD, and other member databases. Maps each protein to its constituent domains, enabling functional annotation at the domain level.

Dataset	Key Columns	Description
`protein2ipr`	`uniprot_id`, `interpro_id`, `interpro_name`, `start`, `end`	Protein-to-domain mappings
`entry_list`	`interpro_id`, `type`, `name`	InterPro entry catalog (domain/family/repeat)
`interpro2go`	`interpro_id`, `go_id`	InterPro to Gene Ontology mappings

Identifiers: UniProt accession, InterPro ID (IPR000719) Join key: protein_map.uniprot_id (direct match on protein2ipr.uniprot_id)

uv run bioingest download interpro --datasets protein2ipr entry_list
# Athena — domains for EGFR
SELECT interpro_id, interpro_name FROM bioingest.interpro__protein2ipr WHERE uniprot_id = 'P00533';

Complex Portal¶

EBI-curated catalog of stable macromolecular complexes with stoichiometry, function, and disease annotations. Each complex lists its component proteins with UniProt accessions.

Dataset	Key Columns	Description
`complexes`	`Complex ac`, `Recommended name`, `Taxonomy identifier`, `Identifiers (cross-references)`, `Participants`	Human protein complexes

Identifiers: Complex Portal ID (CPX-1), UniProt accessions in Participants column Join key: Parse UniProt accessions from Participants → protein_map.uniprot_id

uv run bioingest download complex_portal
# Athena
SELECT "Complex ac", "Recommended name", "Participants" FROM bioingest.complex_portal__complexes
WHERE "Participants" LIKE '%P00533%';

PDB / SIFTS¶

Structure Integration with Function, Taxonomy, and Sequence — maps PDB structures to UniProt residues, Pfam domains, SCOP folds, and Gene Ontology. Enables linking 3D structural information to sequence-level annotations.

Dataset	Key Columns	Description
`sifts_mappings`	`PDB`, `CHAIN`, `SP_PRIMARY`, `RES_BEG`, `RES_END`, `PDB_BEG`, `PDB_END`	PDB chain → UniProt residue ranges

Identifiers: PDB ID (1ATP), UniProt accession (SP_PRIMARY) Join key: sifts_mappings.SP_PRIMARY → protein_map.uniprot_id

uv run bioingest download pdb_complexes
# Athena — structures for EGFR
SELECT PDB, CHAIN, RES_BEG, RES_END FROM bioingest.pdb_complexes__sifts_mappings WHERE SP_PRIMARY = 'P00533';

AlphaFold¶

DeepMind's predicted protein structures covering nearly all known proteins. Provides per-residue confidence scores (pLDDT) and predicted aligned error (PAE). Complements experimental PDB structures for proteins without crystal/cryo-EM data.

Dataset	Key Columns	Description
`predictions`	`uniprot_id`, `af_id`, `plddt_mean`, `model_url`	AlphaFold structure predictions per protein

Identifiers: UniProt accession, AlphaFold ID (AF-P00533-F1) Join key: protein_map.uniprot_id

uv run bioingest download alphafold
# Athena
SELECT uniprot_id, af_id, plddt_mean FROM bioingest.alphafold__predictions WHERE uniprot_id = 'P00533';