Ingestion Pipeline¶
The bioingest ingest command extracts entities and relationships from documents using LLMs and writes them to Neo4j.
Sources¶
| Source | Command | What it fetches |
|---|---|---|
| PubMed | ingest pubmed -q "..." |
Abstracts via NCBI Entrez |
| bioRxiv | ingest biorxiv -q "..." |
Preprints via bioRxiv API |
| PMC | ingest pmc -q "..." |
Full-text OA articles via Entrez |
| Local files | ingest path/ |
PDF, TSV, CSV, TXT, Markdown |
Pipeline Stages¶
flowchart LR
A[Source] --> B[Extract Text]
B --> C[Chunk]
C --> D[LLM Extract]
D --> E[Neo4j Write]
B -.- B1["PDF→pymupdf<br/>TSV→markdown table<br/>HTML→strip tags"]
C -.- C1["~3000 chars/chunk<br/>sentence boundaries<br/>200 char overlap"]
D -.- D1["Bedrock / Ollama / SageMaker<br/>→ entities + relationships"]
E -.- E1["Batched MERGE<br/>idempotent writes"]
LLM Backends¶
| Backend | Flag | Notes |
|---|---|---|
| AWS Bedrock | --service bedrock |
Default. Llama 3, Claude, Mistral |
| Ollama | --service local |
No cloud. Any local model |
| SageMaker | --service sagemaker |
Custom endpoints |
# Override model
bioingest ingest pubmed -q "IL-6" --service bedrock --model-id us.anthropic.claude-3-5-sonnet-20241022-v2:0
bioingest ingest pubmed -q "IL-6" --service local --model-id llama3.1:8b-instruct-q8_0
Dry-Run Mode¶
Preview extraction without Neo4j:
Outputs JSONL to stdout:
{"_type": "node", "id": "brca1", "name": "BRCA1", "type": "Protein", "doc_id": "pubmed_12345"}
{"_type": "relationship", "source_id": "brca1", "target_id": "breast_cancer", "type": "ASSOCIATED_WITH"}
Examples¶
# PubMed → Neo4j (Bedrock)
bioingest ingest pubmed -q "IL-6 inflammation biomarker" --max-results 200 --database olink3
# bioRxiv → Neo4j (local Ollama)
bioingest ingest biorxiv -q "proximity extension assay" --service local
# PMC full-text → JSONL
bioingest ingest pmc -q "BRCA1 breast cancer" --dry-run > extracted.jsonl
# Local directory
bioingest ingest data/bulk/ttd/ --database olink3
# Single PDF
bioingest ingest ~/Downloads/paper.pdf --service bedrock