System Design
Package Structure
bioingest/
├── cli.py # CLI entry point
├── config.py # Base settings (.env loader)
├── connectors/ # 18 source-specific downloaders
├── download.py # Resumable file downloader
├── publish.py # S3/Parquet/Athena publisher
├── ontology_builder.py # OBO parsing + LLM ontology extraction
├── scraper.py # Firecrawl web scraping
└── pipeline/ # KG construction pipeline
├── config.py # Pipeline config (Neo4j + LLM + AWS)
├── models.py # Chunk, ProcessedDocument
├── extract.py # Content extraction (PDF, TSV, HTML, TXT)
├── kg.py # LLM extraction + Neo4j writer
├── sources.py # Remote fetchers (PubMed, bioRxiv, PMC)
└── factories/
├── llm_factory.py # Bedrock, Ollama, SageMaker
└── embedding_factory.py
Data Flow
flowchart TD
subgraph acquire["1. Acquire"]
DL[Bulk Download<br/>18 sources]
PM[PubMed API]
BR[bioRxiv API]
PMC[PMC API]
LP[Local PDFs/TSVs]
end
subgraph extract["2. Extract"]
CE[Content Extraction<br/>PDF→text, TSV→markdown]
CH[Text Chunking<br/>sentence-boundary-aware]
end
subgraph transform["3. Transform"]
LLM[LLM Extraction<br/>entities + relationships]
OBO[OBO Parser<br/>deterministic ontology]
end
subgraph load["4. Load"]
N4J[Neo4j Writer<br/>batched MERGE]
S3[S3 Publisher<br/>Parquet + Athena]
JL[JSONL Export<br/>graphrag_api compatible]
end
DL --> CE
PM --> CH
BR --> CH
PMC --> CH
LP --> CE
CE --> CH
CH --> LLM
DL --> OBO
LLM --> N4J
LLM --> JL
OBO --> N4J
OBO --> JL
DL --> S3
Relationship to graphrag_api
| Concern |
BioIngest |
graphrag_api |
| Data acquisition |
✅ Download, scrape, fetch |
❌ |
| KG construction |
✅ LLM extraction → Neo4j |
❌ (deprecated) |
| Ontology building |
✅ OBO + LLM |
❌ |
| Query API |
❌ |
✅ FastAPI |
| Query agents |
❌ |
✅ Cypher gen, search |
| Frontend |
❌ |
✅ React UI |
| Infrastructure |
S3/Athena (CDK) |
ECS/Neptune/Redis (CDK) |
Design Principles
- Local-first — all data stored locally with checksums and manifests
- Resumable — interrupted downloads/ingestions resume where they left off
- Idempotent — MERGE operations mean re-running is safe
- Dry-run — preview any operation without side effects
- Pluggable LLMs — Bedrock, Ollama, SageMaker with one flag
- Zero server dependencies — CLI tool, no API server needed to ingest