Knowledge Graph Pipeline¶

The ingest command runs the full KG construction pipeline — from document fetching through LLM entity extraction to graph database writes.

uv sync --extra pipeline   # install Neo4j, pymupdf, langchain-ollama

Pipeline Stages¶

┌────────┐    ┌──────────┐    ┌────────┐    ┌────────┐    ┌──────────┐    ┌────────┐    ┌────────┐
│ fetch  │───▶│ extract  │───▶│ chunk  │───▶│  kg    │───▶│ resolve  │───▶│ write  │───▶│ embed  │
└────────┘    └──────────┘    └────────┘    └────────┘    └──────────┘    └────────┘    └────────┘
     │              │              │              │              │              │              │
 PubMed/       PDF→text       512 tokens     LLM extracts    Dedup +       Neo4j or      Vector
 bioRxiv/     (pymupdf →      64 overlap     entities +     UniProt/      Neptune        index
 PMC/local    pdftotext →     sentence-      relationships  MONDO         batched       (pgvector)
              vision LLM)     aware                         enrichment    MERGE

Stage	What it does	Output
`fetch`	Pull docs from PubMed/bioRxiv/PMC API or read local files	Raw documents
`extract`	Multi-strategy PDF extraction (pymupdf → pdftotext → vision LLM)	Plain text
`chunk`	Token-based splitting (512 tokens, 64 overlap, sentence boundaries)	Chunks
`kg`	LLM extraction of entities + relationships per chunk (async parallel)	Entity/Rel JSON
`resolve`	Entity dedup (synonyms, abbreviation expansion, fuzzy matching)	Canonical entities
`write`	Batched MERGE/CREATE into Neo4j or Neptune with provenance	Graph nodes/edges
`embed`	Generate chunk embeddings → Neo4j vector index or Aurora pgvector	Vectors

Running the Pipeline¶

CLI¶

# PubMed → Neo4j
uv run bioingest ingest pubmed -q "IL-6 inflammation" --max-results 200 --concurrency 5

# Local PDFs → Neptune
export NEPTUNE_ENDPOINT=cluster.xxx.neptune.amazonaws.com
uv run bioingest ingest ~/papers/ --target neptune --service bedrock --concurrency 10

# Dry run (JSONL output, no DB writes)
uv run bioingest ingest pmc -q "BRCA1 breast cancer" --dry-run

# Stage-by-stage debugging
uv run bioingest ingest ~/paper.pdf --stage chunk       # stop after chunking
uv run bioingest ingest ~/paper.pdf --stage kg --dry-run  # extract KG, print JSONL

Programmatic API¶

import asyncio
from bioingest.pipeline.bridge import ingest, PipelineResult

result: PipelineResult = await ingest(
    source_type="pubmed",
    query="EGFR biomarker",
    max_results=100,
    database="olink3",
    model="bedrock:us.amazon.nova-micro-v1:0",
    target="neptune",
    neptune_endpoint="cluster.xxx.neptune.amazonaws.com",
    concurrency=5,
    progress_callback=lambda **kw: print(kw),
    cancellation_event=asyncio.Event(),
)
# result.status: "completed" | "failed" | "cancelled"
# result.processed_count, result.errors, result.stage_timings

Graph Targets¶

Target	Graph writes	Embedding writes	Config
`neo4j`	Bolt driver, batched MERGE	Neo4j vector index property	`NEO4J_URI`, `NEO4J_USER`, `NEO4J_PASSWORD`
`neptune`	OpenCypher via boto3, batched UNWIND	Aurora pgvector (auto)	`NEPTUNE_ENDPOINT`, `AURORA_ENDPOINT`

Entity Types Extracted¶

Type	Examples
Protein	EGFR, p53, IL-6
Disease	Alzheimer's disease, breast cancer
Drug	imatinib, trastuzumab
Pathway	MAPK signaling, apoptosis
Biomarker	PSA, troponin I
Gene	BRCA1, TP53
CellType	T cell, macrophage
Tissue	liver, brain cortex
Organism	Homo sapiens, mouse

Relationship Types¶

Relationship	Meaning	Example
`ASSOCIATES_WITH`	General association	EGFR → lung cancer
`INTERACTS_WITH`	Physical/functional PPI	EGFR → ERBB2
`TREATS`	Drug treats disease	gefitinib → NSCLC
`UPREGULATES`	Increases expression	TNF → IL-6
`DOWNREGULATES`	Decreases expression	rapamycin → mTOR
`EXPRESSED_IN`	Tissue expression	albumin → liver
`PARTICIPATES_IN`	Pathway membership	EGFR → MAPK signaling

How Mapping Graph Connects to LLM Extractions¶

The mapping graph (201K SAME_AS + BROADER_THAN edges) provides identity resolution for LLM-extracted entities:

┌─────────────────────┐         ┌───────────────────────────┐
│  LLM Extraction     │         │  Mapping Graph            │
│                     │         │                           │
│  "EGFR" (Protein)  │───resolve───▶ uniprot:P00533        │
│  "lung cancer"     │───resolve───▶ mondo:MONDO:0008903   │
│  "gefitinib"       │───resolve───▶ chembl:CHEMBL939      │
└─────────────────────┘         └───────────────────────────┘
                                          │
                                    SAME_AS edges
                                          │
                                          ▼
                               ┌───────────────────┐
                               │ Structured sources │
                               │ Open Targets       │
                               │ STRING             │
                               │ Reactome           │
                               └───────────────────┘

Validation pipeline¶

LLM-extracted relationships are scored against structured data:

LLM Relationship	Validated Against	Method
`ASSOCIATES_WITH` (Protein→Disease)	Open Targets, DISEASES 2.0, ClinVar	Check gene-disease pair exists with score > threshold
`INTERACTS_WITH` (Protein→Protein)	STRING (score > 400)	Look up protein pair in STRING links
`TREATS` (Drug→Disease)	ChEMBL, TTD	Verify drug has known mechanism
`EXPRESSED_IN` (Gene→Tissue)	GTEx (TPM > 1)	Verify measurable expression
`PARTICIPATES_IN` (Protein→Pathway)	Reactome	Check UniProt→pathway mapping

bioingest map validate-kg   # score Neptune relationships against structured evidence

CLI Options Reference¶

Flag	Effect
`--query, -q`	Search query (required for pubmed/biorxiv/pmc)
`--max-results`	Max documents to fetch (default: 50)
`--service`	LLM backend: `bedrock`, `local` (Ollama), `sagemaker`
`--model-id`	Override model (e.g., `us.amazon.nova-micro-v1:0`)
`--database`	Neo4j database name (default: `olink3`)
`--target`	Graph backend: `neo4j` or `neptune`
`--neptune-endpoint`	Neptune cluster endpoint
`--dry-run`	Extract as JSONL without writing to DB
`--stage`	Stop after stage: `fetch`, `extract`, `chunk`, `kg`, `resolve`, `write`, `embed`
`--force`	Re-process already-ingested documents
`--concurrency`	Parallel LLM calls (default: 1, recommended: 5-10)
`--extensions`	File types for directories (e.g., `.tsv,.pdf`)