Ingestion Pipeline¶

The bioingest ingest command extracts entities and relationships from documents using LLMs and writes them to Neo4j.

Sources¶

Source	Command	What it fetches
PubMed	`ingest pubmed -q "..."`	Abstracts via NCBI Entrez
bioRxiv	`ingest biorxiv -q "..."`	Preprints via bioRxiv API
PMC	`ingest pmc -q "..."`	Full-text OA articles via Entrez
Local files	`ingest path/`	PDF, TSV, CSV, TXT, Markdown

Pipeline Stages¶

flowchart LR
    A[Source] --> B[Extract Text]
    B --> C[Chunk]
    C --> D[LLM Extract]
    D --> E[Neo4j Write]

    B -.- B1["PDF→pymupdf<br/>TSV→markdown table<br/>HTML→strip tags"]
    C -.- C1["~3000 chars/chunk<br/>sentence boundaries<br/>200 char overlap"]
    D -.- D1["Bedrock / Ollama / SageMaker<br/>→ entities + relationships"]
    E -.- E1["Batched MERGE<br/>idempotent writes"]

LLM Backends¶

Backend	Flag	Notes
AWS Bedrock	`--service bedrock`	Default. Llama 3, Claude, Mistral
Ollama	`--service local`	No cloud. Any local model
SageMaker	`--service sagemaker`	Custom endpoints

# Override model
bioingest ingest pubmed -q "IL-6" --service bedrock --model-id us.anthropic.claude-3-5-sonnet-20241022-v2:0
bioingest ingest pubmed -q "IL-6" --service local --model-id llama3.1:8b-instruct-q8_0

Dry-Run Mode¶

Preview extraction without Neo4j:

bioingest ingest pubmed -q "BRCA1" --max-results 5 --dry-run

Outputs JSONL to stdout:

{"_type": "node", "id": "brca1", "name": "BRCA1", "type": "Protein", "doc_id": "pubmed_12345"}
{"_type": "relationship", "source_id": "brca1", "target_id": "breast_cancer", "type": "ASSOCIATED_WITH"}

Examples¶

# PubMed → Neo4j (Bedrock)
bioingest ingest pubmed -q "IL-6 inflammation biomarker" --max-results 200 --database olink3

# bioRxiv → Neo4j (local Ollama)
bioingest ingest biorxiv -q "proximity extension assay" --service local

# PMC full-text → JSONL
bioingest ingest pmc -q "BRCA1 breast cancer" --dry-run > extracted.jsonl

# Local directory
bioingest ingest data/bulk/ttd/ --database olink3

# Single PDF
bioingest ingest ~/Downloads/paper.pdf --service bedrock