Skip to content

Knowledge Graph Pipeline

The ingest command runs the full KG construction pipeline — from document fetching through LLM entity extraction to graph database writes.

uv sync --extra pipeline   # install Neo4j, pymupdf, langchain-ollama

Pipeline Stages

┌────────┐    ┌──────────┐    ┌────────┐    ┌────────┐    ┌──────────┐    ┌────────┐    ┌────────┐
│ fetch  │───▶│ extract  │───▶│ chunk  │───▶│  kg    │───▶│ resolve  │───▶│ write  │───▶│ embed  │
└────────┘    └──────────┘    └────────┘    └────────┘    └──────────┘    └────────┘    └────────┘
     │              │              │              │              │              │              │
 PubMed/       PDF→text       512 tokens     LLM extracts    Dedup +       Neo4j or      Vector
 bioRxiv/     (pymupdf →      64 overlap     entities +     UniProt/      Neptune        index
 PMC/local    pdftotext →     sentence-      relationships  MONDO         batched       (pgvector)
              vision LLM)     aware                         enrichment    MERGE
Stage What it does Output
fetch Pull docs from PubMed/bioRxiv/PMC API or read local files Raw documents
extract Multi-strategy PDF extraction (pymupdf → pdftotext → vision LLM) Plain text
chunk Token-based splitting (512 tokens, 64 overlap, sentence boundaries) Chunks
kg LLM extraction of entities + relationships per chunk (async parallel) Entity/Rel JSON
resolve Entity dedup (synonyms, abbreviation expansion, fuzzy matching) Canonical entities
write Batched MERGE/CREATE into Neo4j or Neptune with provenance Graph nodes/edges
embed Generate chunk embeddings → Neo4j vector index or Aurora pgvector Vectors

Running the Pipeline

CLI

# PubMed → Neo4j
uv run bioingest ingest pubmed -q "IL-6 inflammation" --max-results 200 --concurrency 5

# Local PDFs → Neptune
export NEPTUNE_ENDPOINT=cluster.xxx.neptune.amazonaws.com
uv run bioingest ingest ~/papers/ --target neptune --service bedrock --concurrency 10

# Dry run (JSONL output, no DB writes)
uv run bioingest ingest pmc -q "BRCA1 breast cancer" --dry-run

# Stage-by-stage debugging
uv run bioingest ingest ~/paper.pdf --stage chunk       # stop after chunking
uv run bioingest ingest ~/paper.pdf --stage kg --dry-run  # extract KG, print JSONL

Programmatic API

import asyncio
from bioingest.pipeline.bridge import ingest, PipelineResult

result: PipelineResult = await ingest(
    source_type="pubmed",
    query="EGFR biomarker",
    max_results=100,
    database="olink3",
    model="bedrock:us.amazon.nova-micro-v1:0",
    target="neptune",
    neptune_endpoint="cluster.xxx.neptune.amazonaws.com",
    concurrency=5,
    progress_callback=lambda **kw: print(kw),
    cancellation_event=asyncio.Event(),
)
# result.status: "completed" | "failed" | "cancelled"
# result.processed_count, result.errors, result.stage_timings

Graph Targets

Target Graph writes Embedding writes Config
neo4j Bolt driver, batched MERGE Neo4j vector index property NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD
neptune OpenCypher via boto3, batched UNWIND Aurora pgvector (auto) NEPTUNE_ENDPOINT, AURORA_ENDPOINT

Entity Types Extracted

Type Examples
Protein EGFR, p53, IL-6
Disease Alzheimer's disease, breast cancer
Drug imatinib, trastuzumab
Pathway MAPK signaling, apoptosis
Biomarker PSA, troponin I
Gene BRCA1, TP53
CellType T cell, macrophage
Tissue liver, brain cortex
Organism Homo sapiens, mouse

Relationship Types

Relationship Meaning Example
ASSOCIATES_WITH General association EGFR → lung cancer
INTERACTS_WITH Physical/functional PPI EGFR → ERBB2
TREATS Drug treats disease gefitinib → NSCLC
UPREGULATES Increases expression TNF → IL-6
DOWNREGULATES Decreases expression rapamycin → mTOR
EXPRESSED_IN Tissue expression albumin → liver
PARTICIPATES_IN Pathway membership EGFR → MAPK signaling

How Mapping Graph Connects to LLM Extractions

The mapping graph (201K SAME_AS + BROADER_THAN edges) provides identity resolution for LLM-extracted entities:

┌─────────────────────┐         ┌───────────────────────────┐
│  LLM Extraction     │         │  Mapping Graph            │
│                     │         │                           │
│  "EGFR" (Protein)  │───resolve───▶ uniprot:P00533        │
│  "lung cancer"     │───resolve───▶ mondo:MONDO:0008903   │
│  "gefitinib"       │───resolve───▶ chembl:CHEMBL939      │
└─────────────────────┘         └───────────────────────────┘
                                    SAME_AS edges
                               ┌───────────────────┐
                               │ Structured sources │
                               │ Open Targets       │
                               │ STRING             │
                               │ Reactome           │
                               └───────────────────┘

Validation pipeline

LLM-extracted relationships are scored against structured data:

LLM Relationship Validated Against Method
ASSOCIATES_WITH (Protein→Disease) Open Targets, DISEASES 2.0, ClinVar Check gene-disease pair exists with score > threshold
INTERACTS_WITH (Protein→Protein) STRING (score > 400) Look up protein pair in STRING links
TREATS (Drug→Disease) ChEMBL, TTD Verify drug has known mechanism
EXPRESSED_IN (Gene→Tissue) GTEx (TPM > 1) Verify measurable expression
PARTICIPATES_IN (Protein→Pathway) Reactome Check UniProt→pathway mapping
bioingest map validate-kg   # score Neptune relationships against structured evidence

CLI Options Reference

Flag Effect
--query, -q Search query (required for pubmed/biorxiv/pmc)
--max-results Max documents to fetch (default: 50)
--service LLM backend: bedrock, local (Ollama), sagemaker
--model-id Override model (e.g., us.amazon.nova-micro-v1:0)
--database Neo4j database name (default: olink3)
--target Graph backend: neo4j or neptune
--neptune-endpoint Neptune cluster endpoint
--dry-run Extract as JSONL without writing to DB
--stage Stop after stage: fetch, extract, chunk, kg, resolve, write, embed
--force Re-process already-ingested documents
--concurrency Parallel LLM calls (default: 1, recommended: 5-10)
--extensions File types for directories (e.g., .tsv,.pdf)