Skip to content

Ingestion Pipeline

The bioingest ingest command extracts entities and relationships from documents using LLMs and writes them to Neo4j.

Sources

Source Command What it fetches
PubMed ingest pubmed -q "..." Abstracts via NCBI Entrez
bioRxiv ingest biorxiv -q "..." Preprints via bioRxiv API
PMC ingest pmc -q "..." Full-text OA articles via Entrez
Local files ingest path/ PDF, TSV, CSV, TXT, Markdown

Pipeline Stages

flowchart LR
    A[Source] --> B[Extract Text]
    B --> C[Chunk]
    C --> D[LLM Extract]
    D --> E[Neo4j Write]

    B -.- B1["PDF→pymupdf<br/>TSV→markdown table<br/>HTML→strip tags"]
    C -.- C1["~3000 chars/chunk<br/>sentence boundaries<br/>200 char overlap"]
    D -.- D1["Bedrock / Ollama / SageMaker<br/>→ entities + relationships"]
    E -.- E1["Batched MERGE<br/>idempotent writes"]

LLM Backends

Backend Flag Notes
AWS Bedrock --service bedrock Default. Llama 3, Claude, Mistral
Ollama --service local No cloud. Any local model
SageMaker --service sagemaker Custom endpoints
# Override model
bioingest ingest pubmed -q "IL-6" --service bedrock --model-id us.anthropic.claude-3-5-sonnet-20241022-v2:0
bioingest ingest pubmed -q "IL-6" --service local --model-id llama3.1:8b-instruct-q8_0

Dry-Run Mode

Preview extraction without Neo4j:

bioingest ingest pubmed -q "BRCA1" --max-results 5 --dry-run

Outputs JSONL to stdout:

{"_type": "node", "id": "brca1", "name": "BRCA1", "type": "Protein", "doc_id": "pubmed_12345"}
{"_type": "relationship", "source_id": "brca1", "target_id": "breast_cancer", "type": "ASSOCIATED_WITH"}

Examples

# PubMed → Neo4j (Bedrock)
bioingest ingest pubmed -q "IL-6 inflammation biomarker" --max-results 200 --database olink3

# bioRxiv → Neo4j (local Ollama)
bioingest ingest biorxiv -q "proximity extension assay" --service local

# PMC full-text → JSONL
bioingest ingest pmc -q "BRCA1 breast cancer" --dry-run > extracted.jsonl

# Local directory
bioingest ingest data/bulk/ttd/ --database olink3

# Single PDF
bioingest ingest ~/Downloads/paper.pdf --service bedrock