Getting Started¶
Two Commands¶
BioIngest ships as two entry points:
| Command | What it does |
|---|---|
bioingest |
Interactive TUI — guided workflow with menus |
bioingest-cli |
Non-interactive CLI — scriptable commands for automation |
uv run bioingest # launch interactive workflow navigator
uv run bioingest-cli --help # see all CLI commands
The TUI calls the same commands under the hood — pick whichever fits your workflow.
Install¶
git clone https://github.com/Olink-Proteomics/bioingest.git
cd bioingest
uv sync # core (download + publish)
uv sync --extra pipeline # + KG construction (Neo4j, LLM, PDF)
uv sync --extra scrape # + web scraping (Firecrawl)
uv sync --extra notebook # + Jupyter exploration
The Pipeline¶
flowchart LR
F[① Fetch] --> T[② Transform] --> P[③ Publish] --> Q[④ Query]
F -.- F1[download / scrape / pull]
T -.- T1[scripts / normalize]
P -.- P1[S3 Parquet + Athena]
Q -.- Q1[SQL / Jupyter / TUI]
① Fetch — Acquire raw data¶
Download from 18 public databases, scrape competitor sites, or pull individual records.
# Download everything (skips files > 2 GB)
bioingest-cli download-all --all --max-size 2gb
# Single source
bioingest-cli download uniprot
# Scrape competitor assay specs
bioingest-cli scrape competitors quanterix
# Pull individual records
bioingest-cli pull pubmed --query "EGFR biomarker" --limit 50
② Transform — Edit before publishing¶
Run Python scripts to filter, merge, or reshape data in data/bulk/ before it goes to S3.
# Example: filter UniProt to human-only
python scripts/filter_uniprot.py
# Normalize cached API responses into the warehouse
bioingest-cli normalize
The interactive TUI (② Transform / Edit) lets you run scripts, preview files, and browse local data without memorizing paths.
③ Publish — Push to S3 & Athena¶
Convert local files to Parquet and create queryable Athena tables.
bioingest-cli publish --crawl # all sources + update Glue catalog
bioingest-cli publish --source uniprot # just one source
bioingest-cli publish --dry-run # preview without uploading
See Publishing for format handling and infrastructure details.
④ Query — Explore your data lake¶
Or use Athena SQL, Jupyter, or the TUI's query prompt. See Data Explorer for access methods and dataset catalog.
Interactive TUI¶
Launch with uv run bioingest:
BioIngest — Data Pipeline
─────────────────────────
① Fetch Data download, scrape, or pull raw data
② Transform / Edit run scripts, preview files, normalize
③ Publish to S3/Athena convert to Parquet, upload, create tables
④ Query / Explore Athena SQL, Jupyter, browse tables
⑤ KG Ingestion build knowledge graph from PDFs, PubMed, bioRxiv
⑥ Pipeline Status overview of local data + AWS state
Quit
⑤ KG Ingestion¶
Ingest PubMed Abstracts fetch + extract entities from PubMed
Ingest bioRxiv Preprints fetch + extract from bioRxiv API
Ingest PMC Full-Text fetch + extract from PMC open access
Ingest Local PDFs multi-strategy PDF extraction + LLM
Ingest Local TSV/CSV convert tabular data to KG triples
Load Structured Data to Neptune STRING + Reactome + DISEASES → Neptune directly
Score Evidence on Neptune Edges cross-source validation on existing edges
Bulk Ingest (20 queries) 2000 papers across 20 biomarker queries
View Last Run Report show results from last ingestion run
← Back
② Transform (additions)¶
...existing items...
Build Mapping Tables protein/disease/drug cross-reference maps
Push Mapping Graph to Neptune 201K SAME_AS + BROADER_THAN edges
← Back
⑥ Pipeline Status¶
Local Data
Sources: 20 | Files: 94
Mapping Tables
protein_map: 26,499 rows
disease_map: 31,884 rows
drug_map: 42,939 rows
Graph Triples: 201,961 (mapping_graph.jsonl)
Last Ingestion Run
Run: 20260607_165759
Docs: 49 | Nodes: 461 | Rels: 394 | Errors: 0
AWS Data Lake
Connected | Tables: 66 | Bucket: bioingest-datalake-...
Each menu option has guided prompts — no need to remember command flags.
Minimal .env¶
# AWS (for publish + query)
AWS_PROFILE=dsinternal
AWS_REGION=eu-north-1
# PubMed (optional, increases rate limit)
ENTREZ_API_KEY=your-key
# Firecrawl (optional, for scrape commands)
FIRECRAWL_API_KEY=fc-your-key
# KG construction (optional)
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=secret
Next Steps¶
- Publishing — Format conversion, S3 upload, Athena infrastructure
- Data Explorer — Browse all datasets, download links, query examples
- Ingestion Pipeline — KG extraction from documents
- Ontology Builder — Building ontology graphs
- Configuration — Full environment variables reference