Skip to content

Development

Setup

git clone https://github.com/Olink-Proteomics/bioingest.git
cd bioingest
uv sync --extra pipeline --extra scrape

Running Tests

uv run pytest                    # all tests
uv run pytest tests/test_pipeline.py -v   # pipeline tests only
uv run pytest tests/test_ontology_builder.py -v  # ontology tests only

Adding a New Data Source

  1. Create bioingest/connectors/my_source.py inheriting from Connector
  2. Implement list_datasets() and download()
  3. Add entry to bioingest/sources.json
  4. Register in bioingest/connectors/__init__.py

Adding a New LLM Backend

  1. Create a class implementing LLMInterface in bioingest/pipeline/factories/llm_factory.py
  2. Add a branch in get_llm() for the new service name
  3. Document the required env vars

Project Extras

Extra What it adds Install
pipeline neo4j, pymupdf, langchain-ollama uv sync --extra pipeline
scrape firecrawl-py uv sync --extra scrape
notebook jupyterlab, pandas, pyathena uv sync --extra notebook

Architecture Decision: Why bioingest owns ingestion

See Architecture for the full rationale. In short:

  • Ingestion is batch — periodic, heavy compute (LLMs, PDF parsing)
  • Querying is live — always-on, low latency
  • Different dependencies — ingestion needs pymupdf, LLM libs; querying needs FastAPI, Redis
  • bioingest already has the data — downloads from 18+ sources, has all raw files
  • One pipeline — download → extract → build KG → publish, all in one tool