Skip to content

Data Explorer

All BioIngest data is available via Git LFS (clone the repo) or browse S3 directly.

Quick Access

Clone with Data Browse S3 Launch JupyterLab Launch RStudio Open Athena SQL

Git LFS — easiest way to get the data

git clone https://github.com/Olink-Proteomics/bioingest.git
cd bioingest/data/bulk/
ls  # all datasets available locally

Browse Data

Proteins & Targets
Dataset Description LFS (raw) S3 (parquet)
uniprot/swissprot_tsv.tsv 570K reviewed protein entries GitHub S3
uniprot/idmapping_human.tsv UniProt cross-references GitHub S3
opentargets/targets/ 63K drug targets GitHub S3
markerdb/proteins.tsv 4K biomarker proteins GitHub S3
complex_portal/homo_sapiens.tsv Human protein complexes GitHub S3
chembl/ ChEMBL-UniProt mapping GitHub S3
pdb_complexes/ PDB-UniProt mapping GitHub S3
Disease & Pathway Associations
Dataset Description LFS (raw) S3 (parquet)
diseases/ Disease-gene associations GitHub S3
reactome/ 2.5M pathway mappings GitHub S3
ttd/ Therapeutic targets & drugs GitHub S3
ukb_disease_assoc/ UK Biobank associations GitHub S3
Ontologies
Dataset Description LFS (raw) S3 (parquet)
mondo/mondo.obo 47K disease terms GitHub S3
disease_ontology/ 12K disease terms (DO) GitHub S3
efo/efo.obo 50K experimental factors GitHub S3
gene_ontology/ 45K GO terms GitHub S3
mesh/ MeSH descriptors GitHub S3
icd/ ICD-10-CM codes GitHub S3
Competitors
Dataset Description LFS (raw) S3 (parquet)
competitors_alamar/ Alamar NULISAseq panels GitHub S3
competitors_msd/msd_assays.tsv MSD assay specs GitHub S3
competitors_nomic/ Nomic Omni 1000 targets GitHub S3
competitors_quanterix/ Quanterix Simoa specs GitHub S3
somalogic/ SomaScan menus + panels GitHub S3
Olink & HPA Data
Dataset Description LFS (raw) S3 (parquet)
hpa_olink/proteinatlas_full.tsv.zip HPA full proteomics GitHub S3
hpa_olink/rna_tissue_consensus.tsv.zip RNA tissue expression GitHub S3
hpa_olink/rna_single_cell_type.tsv.zip Single-cell RNA GitHub S3
competitors_alamar/ Alamar NULISAseq panel targets Download
competitors_msd__msd_assays/ MSD assay specs (LOD, range) Download
competitors_nomic__omni_1000_targets/ Nomic Omni 1000 target list Download
competitors_quanterix__quanterix_assays/ Quanterix Simoa assay specs Download
somalogic__somascan_11k_menu/ SomaScan 11K menu Download
somalogic__somascan_7k_menu/ SomaScan 7K menu Download
somalogic__somascan_5k_menu/ SomaScan 5K menu Download
Internal / Curated (local files)

These are populated by placing files in data/bulk/{source_id}/ and running bioingest download {source_id}:

Source ID Description
olink_released_library Olink released assay library
olink_development_status Assays in development
maximus_screening Maximus screening data
maximus_assay_list Maximus assay list
maximus_lod_detectability Maximus LOD/detectability
clinical_biomarkers Clinical biomarker data
ms_publications Mass spec publications
publication_markers Publication-derived markers
competitors_library Competitor product library
marker_reports Marker reports
focus_panel_cvd Focus Panel CVD
ukb_metrics UK Biobank metrics
ms_thermo MS Thermo data
biosimilars Biosimilars data
detectability_frequency_ht Detectability frequency (HT)
royalty_antibodies Royalty antibodies
mab_pab_assays mAb/pAb assays
mab_in_development mAb in development
splenocyte_availability Splenocyte availability
kol_wishlist KOL wishlist
showcases Showcases

Download Data

Download All Data

CLI Download (fastest)

# Download everything (~5 GB)
aws s3 sync s3://bioingest-datalake-357836458011/parquet/ ./all_data/ --profile dsinternal

# Download one source
aws s3 sync s3://bioingest-datalake-357836458011/parquet/uniprot__swissprot_tsv/ ./uniprot/ --profile dsinternal

Query Examples

Python (in JupyterLab)

import pyarrow.parquet as pq
import s3fs

fs = s3fs.S3FileSystem()
df = pq.read_table("s3://bioingest-datalake-357836458011/parquet/uniprot__swissprot_tsv/", filesystem=fs).to_pandas()
df[df["gene_names"].str.contains("EGFR", na=False)]

R (in RStudio)

library(arrow)
library(dplyr)

open_dataset("s3://bioingest-datalake-357836458011/parquet/uniprot__swissprot_tsv/") |>
  filter(grepl("EGFR", gene_names)) |>
  collect()

SQL (in Athena)

SELECT entry, gene_names, protein_names
FROM bioingest.uniprot__swissprot_tsv
WHERE gene_names LIKE '%EGFR%';