Data Explorer
All BioIngest data is available via Git LFS (clone the repo) or browse S3 directly.
Quick Access
Clone with Data
Browse S3
Launch JupyterLab
Launch RStudio
Open Athena SQL
Git LFS — easiest way to get the data
git clone https://github.com/Olink-Proteomics/bioingest.git
cd bioingest/data/bulk/
ls # all datasets available locally
Browse Data
Proteins & Targets
| Dataset |
Description |
LFS (raw) |
S3 (parquet) |
uniprot/swissprot_tsv.tsv |
570K reviewed protein entries |
GitHub |
S3 |
uniprot/idmapping_human.tsv |
UniProt cross-references |
GitHub |
S3 |
opentargets/targets/ |
63K drug targets |
GitHub |
S3 |
markerdb/proteins.tsv |
4K biomarker proteins |
GitHub |
S3 |
complex_portal/homo_sapiens.tsv |
Human protein complexes |
GitHub |
S3 |
chembl/ |
ChEMBL-UniProt mapping |
GitHub |
S3 |
pdb_complexes/ |
PDB-UniProt mapping |
GitHub |
S3 |
Disease & Pathway Associations
| Dataset |
Description |
LFS (raw) |
S3 (parquet) |
diseases/ |
Disease-gene associations |
GitHub |
S3 |
reactome/ |
2.5M pathway mappings |
GitHub |
S3 |
ttd/ |
Therapeutic targets & drugs |
GitHub |
S3 |
ukb_disease_assoc/ |
UK Biobank associations |
GitHub |
S3 |
Ontologies
| Dataset |
Description |
LFS (raw) |
S3 (parquet) |
mondo/mondo.obo |
47K disease terms |
GitHub |
S3 |
disease_ontology/ |
12K disease terms (DO) |
GitHub |
S3 |
efo/efo.obo |
50K experimental factors |
GitHub |
S3 |
gene_ontology/ |
45K GO terms |
GitHub |
S3 |
mesh/ |
MeSH descriptors |
GitHub |
S3 |
icd/ |
ICD-10-CM codes |
GitHub |
S3 |
Competitors
| Dataset |
Description |
LFS (raw) |
S3 (parquet) |
competitors_alamar/ |
Alamar NULISAseq panels |
GitHub |
S3 |
competitors_msd/msd_assays.tsv |
MSD assay specs |
GitHub |
S3 |
competitors_nomic/ |
Nomic Omni 1000 targets |
GitHub |
S3 |
competitors_quanterix/ |
Quanterix Simoa specs |
GitHub |
S3 |
somalogic/ |
SomaScan menus + panels |
GitHub |
S3 |
Olink & HPA Data
| Dataset |
Description |
LFS (raw) |
S3 (parquet) |
hpa_olink/proteinatlas_full.tsv.zip |
HPA full proteomics |
GitHub |
S3 |
hpa_olink/rna_tissue_consensus.tsv.zip |
RNA tissue expression |
GitHub |
S3 |
hpa_olink/rna_single_cell_type.tsv.zip |
Single-cell RNA |
GitHub |
S3 |
competitors_alamar/ |
Alamar NULISAseq panel targets |
Download |
|
competitors_msd__msd_assays/ |
MSD assay specs (LOD, range) |
Download |
|
competitors_nomic__omni_1000_targets/ |
Nomic Omni 1000 target list |
Download |
|
competitors_quanterix__quanterix_assays/ |
Quanterix Simoa assay specs |
Download |
|
somalogic__somascan_11k_menu/ |
SomaScan 11K menu |
Download |
|
somalogic__somascan_7k_menu/ |
SomaScan 7K menu |
Download |
|
somalogic__somascan_5k_menu/ |
SomaScan 5K menu |
Download |
|
Internal / Curated (local files)
These are populated by placing files in data/bulk/{source_id}/ and running bioingest download {source_id}:
| Source ID |
Description |
olink_released_library |
Olink released assay library |
olink_development_status |
Assays in development |
maximus_screening |
Maximus screening data |
maximus_assay_list |
Maximus assay list |
maximus_lod_detectability |
Maximus LOD/detectability |
clinical_biomarkers |
Clinical biomarker data |
ms_publications |
Mass spec publications |
publication_markers |
Publication-derived markers |
competitors_library |
Competitor product library |
marker_reports |
Marker reports |
focus_panel_cvd |
Focus Panel CVD |
ukb_metrics |
UK Biobank metrics |
ms_thermo |
MS Thermo data |
biosimilars |
Biosimilars data |
detectability_frequency_ht |
Detectability frequency (HT) |
royalty_antibodies |
Royalty antibodies |
mab_pab_assays |
mAb/pAb assays |
mab_in_development |
mAb in development |
splenocyte_availability |
Splenocyte availability |
kol_wishlist |
KOL wishlist |
showcases |
Showcases |
Download Data
Download All Data
CLI Download (fastest)
# Download everything (~5 GB)
aws s3 sync s3://bioingest-datalake-357836458011/parquet/ ./all_data/ --profile dsinternal
# Download one source
aws s3 sync s3://bioingest-datalake-357836458011/parquet/uniprot__swissprot_tsv/ ./uniprot/ --profile dsinternal
Query Examples
Python (in JupyterLab)
import pyarrow.parquet as pq
import s3fs
fs = s3fs.S3FileSystem()
df = pq.read_table("s3://bioingest-datalake-357836458011/parquet/uniprot__swissprot_tsv/", filesystem=fs).to_pandas()
df[df["gene_names"].str.contains("EGFR", na=False)]
R (in RStudio)
library(arrow)
library(dplyr)
open_dataset("s3://bioingest-datalake-357836458011/parquet/uniprot__swissprot_tsv/") |>
filter(grepl("EGFR", gene_names)) |>
collect()
SQL (in Athena)
SELECT entry, gene_names, protein_names
FROM bioingest.uniprot__swissprot_tsv
WHERE gene_names LIKE '%EGFR%';