Development¶
Setup¶
git clone https://github.com/Olink-Proteomics/bioingest.git
cd bioingest
uv sync --extra pipeline --extra scrape
Running Tests¶
uv run pytest # all tests
uv run pytest tests/test_pipeline.py -v # pipeline tests only
uv run pytest tests/test_ontology_builder.py -v # ontology tests only
Adding a New Data Source¶
- Create
bioingest/connectors/my_source.pyinheriting fromConnector - Implement
list_datasets()anddownload() - Add entry to
bioingest/sources.json - Register in
bioingest/connectors/__init__.py
Adding a New LLM Backend¶
- Create a class implementing
LLMInterfaceinbioingest/pipeline/factories/llm_factory.py - Add a branch in
get_llm()for the new service name - Document the required env vars
Project Extras¶
| Extra | What it adds | Install |
|---|---|---|
pipeline |
neo4j, pymupdf, langchain-ollama | uv sync --extra pipeline |
scrape |
firecrawl-py | uv sync --extra scrape |
notebook |
jupyterlab, pandas, pyathena | uv sync --extra notebook |
Architecture Decision: Why bioingest owns ingestion¶
See Architecture for the full rationale. In short:
- Ingestion is batch — periodic, heavy compute (LLMs, PDF parsing)
- Querying is live — always-on, low latency
- Different dependencies — ingestion needs pymupdf, LLM libs; querying needs FastAPI, Redis
- bioingest already has the data — downloads from 18+ sources, has all raw files
- One pipeline — download → extract → build KG → publish, all in one tool