Skip to content

Getting Started

Two Commands

BioIngest ships as two entry points:

Command What it does
bioingest Interactive TUI — guided workflow with menus
bioingest-cli Non-interactive CLI — scriptable commands for automation
uv run bioingest          # launch interactive workflow navigator
uv run bioingest-cli --help   # see all CLI commands

The TUI calls the same commands under the hood — pick whichever fits your workflow.

Install

git clone https://github.com/Olink-Proteomics/bioingest.git
cd bioingest
uv sync                        # core (download + publish)
uv sync --extra pipeline       # + KG construction (Neo4j, LLM, PDF)
uv sync --extra scrape         # + web scraping (Firecrawl)
uv sync --extra notebook       # + Jupyter exploration

The Pipeline

flowchart LR
    F[① Fetch] --> T[② Transform] --> P[③ Publish] --> Q[④ Query]
    F -.- F1[download / scrape / pull]
    T -.- T1[scripts / normalize]
    P -.- P1[S3 Parquet + Athena]
    Q -.- Q1[SQL / Jupyter / TUI]

① Fetch — Acquire raw data

Download from 18 public databases, scrape competitor sites, or pull individual records.

# Download everything (skips files > 2 GB)
bioingest-cli download-all --all --max-size 2gb

# Single source
bioingest-cli download uniprot

# Scrape competitor assay specs
bioingest-cli scrape competitors quanterix

# Pull individual records
bioingest-cli pull pubmed --query "EGFR biomarker" --limit 50

② Transform — Edit before publishing

Run Python scripts to filter, merge, or reshape data in data/bulk/ before it goes to S3.

# Example: filter UniProt to human-only
python scripts/filter_uniprot.py

# Normalize cached API responses into the warehouse
bioingest-cli normalize

The interactive TUI (② Transform / Edit) lets you run scripts, preview files, and browse local data without memorizing paths.

③ Publish — Push to S3 & Athena

Convert local files to Parquet and create queryable Athena tables.

bioingest-cli publish --crawl              # all sources + update Glue catalog
bioingest-cli publish --source uniprot     # just one source
bioingest-cli publish --dry-run            # preview without uploading

See Publishing for format handling and infrastructure details.

④ Query — Explore your data lake

bioingest-cli query target EGFR            # query normalized records

Or use Athena SQL, Jupyter, or the TUI's query prompt. See Data Explorer for access methods and dataset catalog.

Interactive TUI

Launch with uv run bioingest:

BioIngest — Data Pipeline
─────────────────────────
  ① Fetch Data           download, scrape, or pull raw data
  ② Transform / Edit     run scripts, preview files, normalize
  ③ Publish to S3/Athena convert to Parquet, upload, create tables
  ④ Query / Explore      Athena SQL, Jupyter, browse tables
  ⑤ KG Ingestion         build knowledge graph from PDFs, PubMed, bioRxiv
  ⑥ Pipeline Status      overview of local data + AWS state
  Quit

⑤ KG Ingestion

  Ingest PubMed Abstracts         fetch + extract entities from PubMed
  Ingest bioRxiv Preprints        fetch + extract from bioRxiv API
  Ingest PMC Full-Text            fetch + extract from PMC open access
  Ingest Local PDFs               multi-strategy PDF extraction + LLM
  Ingest Local TSV/CSV            convert tabular data to KG triples
  Load Structured Data to Neptune STRING + Reactome + DISEASES → Neptune directly
  Score Evidence on Neptune Edges cross-source validation on existing edges
  Bulk Ingest (20 queries)        2000 papers across 20 biomarker queries
  View Last Run Report            show results from last ingestion run
  ← Back

② Transform (additions)

  ...existing items...
  Build Mapping Tables            protein/disease/drug cross-reference maps
  Push Mapping Graph to Neptune   201K SAME_AS + BROADER_THAN edges
  ← Back

⑥ Pipeline Status

  Local Data
    Sources: 20  |  Files: 94

  Mapping Tables
    protein_map: 26,499 rows
    disease_map: 31,884 rows
    drug_map:    42,939 rows

  Graph Triples: 201,961 (mapping_graph.jsonl)

  Last Ingestion Run
    Run: 20260607_165759
    Docs: 49 | Nodes: 461 | Rels: 394 | Errors: 0

  AWS Data Lake
    Connected  |  Tables: 66  |  Bucket: bioingest-datalake-...

Each menu option has guided prompts — no need to remember command flags.

Minimal .env

# AWS (for publish + query)
AWS_PROFILE=dsinternal
AWS_REGION=eu-north-1

# PubMed (optional, increases rate limit)
ENTREZ_API_KEY=your-key

# Firecrawl (optional, for scrape commands)
FIRECRAWL_API_KEY=fc-your-key

# KG construction (optional)
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=secret

Next Steps