Skip to content

System Design

Package Structure

bioingest/
├── cli.py                    # CLI entry point
├── config.py                 # Base settings (.env loader)
├── connectors/               # 18 source-specific downloaders
├── download.py               # Resumable file downloader
├── publish.py                # S3/Parquet/Athena publisher
├── ontology_builder.py       # OBO parsing + LLM ontology extraction
├── scraper.py                # Firecrawl web scraping
└── pipeline/                 # KG construction pipeline
    ├── config.py             # Pipeline config (Neo4j + LLM + AWS)
    ├── models.py             # Chunk, ProcessedDocument
    ├── extract.py            # Content extraction (PDF, TSV, HTML, TXT)
    ├── kg.py                 # LLM extraction + Neo4j writer
    ├── sources.py            # Remote fetchers (PubMed, bioRxiv, PMC)
    └── factories/
        ├── llm_factory.py    # Bedrock, Ollama, SageMaker
        └── embedding_factory.py

Data Flow

flowchart TD
    subgraph acquire["1. Acquire"]
        DL[Bulk Download<br/>18 sources]
        PM[PubMed API]
        BR[bioRxiv API]
        PMC[PMC API]
        LP[Local PDFs/TSVs]
    end

    subgraph extract["2. Extract"]
        CE[Content Extraction<br/>PDF→text, TSV→markdown]
        CH[Text Chunking<br/>sentence-boundary-aware]
    end

    subgraph transform["3. Transform"]
        LLM[LLM Extraction<br/>entities + relationships]
        OBO[OBO Parser<br/>deterministic ontology]
    end

    subgraph load["4. Load"]
        N4J[Neo4j Writer<br/>batched MERGE]
        S3[S3 Publisher<br/>Parquet + Athena]
        JL[JSONL Export<br/>graphrag_api compatible]
    end

    DL --> CE
    PM --> CH
    BR --> CH
    PMC --> CH
    LP --> CE
    CE --> CH
    CH --> LLM
    DL --> OBO
    LLM --> N4J
    LLM --> JL
    OBO --> N4J
    OBO --> JL
    DL --> S3

Relationship to graphrag_api

Concern BioIngest graphrag_api
Data acquisition ✅ Download, scrape, fetch
KG construction ✅ LLM extraction → Neo4j ❌ (deprecated)
Ontology building ✅ OBO + LLM
Query API ✅ FastAPI
Query agents ✅ Cypher gen, search
Frontend ✅ React UI
Infrastructure S3/Athena (CDK) ECS/Neptune/Redis (CDK)

Design Principles

  • Local-first — all data stored locally with checksums and manifests
  • Resumable — interrupted downloads/ingestions resume where they left off
  • Idempotent — MERGE operations mean re-running is safe
  • Dry-run — preview any operation without side effects
  • Pluggable LLMs — Bedrock, Ollama, SageMaker with one flag
  • Zero server dependencies — CLI tool, no API server needed to ingest