Development¶

Setup¶

git clone https://github.com/Olink-Proteomics/bioingest.git
cd bioingest
uv sync --extra pipeline --extra scrape

Running Tests¶

uv run pytest                    # all tests
uv run pytest tests/test_pipeline.py -v   # pipeline tests only
uv run pytest tests/test_ontology_builder.py -v  # ontology tests only

Adding a New Data Source¶

Create bioingest/connectors/my_source.py inheriting from Connector
Implement list_datasets() and download()
Add entry to bioingest/sources.json
Register in bioingest/connectors/__init__.py

Adding a New LLM Backend¶

Create a class implementing LLMInterface in bioingest/pipeline/factories/llm_factory.py
Add a branch in get_llm() for the new service name
Document the required env vars

Project Extras¶

Extra	What it adds	Install
`pipeline`	neo4j, pymupdf, langchain-ollama	`uv sync --extra pipeline`
`scrape`	firecrawl-py	`uv sync --extra scrape`
`notebook`	jupyterlab, pandas, pyathena	`uv sync --extra notebook`

Architecture Decision: Why bioingest owns ingestion¶

See Architecture for the full rationale. In short:

Ingestion is batch — periodic, heavy compute (LLMs, PDF parsing)
Querying is live — always-on, low latency
Different dependencies — ingestion needs pymupdf, LLM libs; querying needs FastAPI, Redis
bioingest already has the data — downloads from 18+ sources, has all raw files
One pipeline — download → extract → build KG → publish, all in one tool