Knowledge Graph Pipeline
The ingest command runs the full KG construction pipeline — from document fetching through LLM entity extraction to graph database writes.
uv sync --extra pipeline # install Neo4j, pymupdf, langchain-ollama
Pipeline Stages
┌────────┐ ┌──────────┐ ┌────────┐ ┌────────┐ ┌──────────┐ ┌────────┐ ┌────────┐
│ fetch │───▶│ extract │───▶│ chunk │───▶│ kg │───▶│ resolve │───▶│ write │───▶│ embed │
└────────┘ └──────────┘ └────────┘ └────────┘ └──────────┘ └────────┘ └────────┘
│ │ │ │ │ │ │
PubMed/ PDF→text 512 tokens LLM extracts Dedup + Neo4j or Vector
bioRxiv/ (pymupdf → 64 overlap entities + UniProt/ Neptune index
PMC/local pdftotext → sentence- relationships MONDO batched (pgvector)
vision LLM) aware enrichment MERGE
Stage
What it does
Output
fetch
Pull docs from PubMed/bioRxiv/PMC API or read local files
Raw documents
extract
Multi-strategy PDF extraction (pymupdf → pdftotext → vision LLM)
Plain text
chunk
Token-based splitting (512 tokens, 64 overlap, sentence boundaries)
Chunks
kg
LLM extraction of entities + relationships per chunk (async parallel)
Entity/Rel JSON
resolve
Entity dedup (synonyms, abbreviation expansion, fuzzy matching)
Canonical entities
write
Batched MERGE/CREATE into Neo4j or Neptune with provenance
Graph nodes/edges
embed
Generate chunk embeddings → Neo4j vector index or Aurora pgvector
Vectors
Running the Pipeline
CLI
# PubMed → Neo4j
uv run bioingest ingest pubmed -q "IL-6 inflammation" --max-results 200 --concurrency 5
# Local PDFs → Neptune
export NEPTUNE_ENDPOINT = cluster.xxx.neptune.amazonaws.com
uv run bioingest ingest ~/papers/ --target neptune --service bedrock --concurrency 10
# Dry run (JSONL output, no DB writes)
uv run bioingest ingest pmc -q "BRCA1 breast cancer" --dry-run
# Stage-by-stage debugging
uv run bioingest ingest ~/paper.pdf --stage chunk # stop after chunking
uv run bioingest ingest ~/paper.pdf --stage kg --dry-run # extract KG, print JSONL
Programmatic API
import asyncio
from bioingest.pipeline.bridge import ingest , PipelineResult
result : PipelineResult = await ingest (
source_type = "pubmed" ,
query = "EGFR biomarker" ,
max_results = 100 ,
database = "olink3" ,
model = "bedrock:us.amazon.nova-micro-v1:0" ,
target = "neptune" ,
neptune_endpoint = "cluster.xxx.neptune.amazonaws.com" ,
concurrency = 5 ,
progress_callback = lambda ** kw : print ( kw ),
cancellation_event = asyncio . Event (),
)
# result.status: "completed" | "failed" | "cancelled"
# result.processed_count, result.errors, result.stage_timings
Graph Targets
Target
Graph writes
Embedding writes
Config
neo4j
Bolt driver, batched MERGE
Neo4j vector index property
NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD
neptune
OpenCypher via boto3, batched UNWIND
Aurora pgvector (auto)
NEPTUNE_ENDPOINT, AURORA_ENDPOINT
Type
Examples
Protein
EGFR, p53, IL-6
Disease
Alzheimer's disease, breast cancer
Drug
imatinib, trastuzumab
Pathway
MAPK signaling, apoptosis
Biomarker
PSA, troponin I
Gene
BRCA1, TP53
CellType
T cell, macrophage
Tissue
liver, brain cortex
Organism
Homo sapiens, mouse
Relationship Types
Relationship
Meaning
Example
ASSOCIATES_WITH
General association
EGFR → lung cancer
INTERACTS_WITH
Physical/functional PPI
EGFR → ERBB2
TREATS
Drug treats disease
gefitinib → NSCLC
UPREGULATES
Increases expression
TNF → IL-6
DOWNREGULATES
Decreases expression
rapamycin → mTOR
EXPRESSED_IN
Tissue expression
albumin → liver
PARTICIPATES_IN
Pathway membership
EGFR → MAPK signaling
The mapping graph (201K SAME_AS + BROADER_THAN edges) provides identity resolution for LLM-extracted entities:
┌─────────────────────┐ ┌───────────────────────────┐
│ LLM Extraction │ │ Mapping Graph │
│ │ │ │
│ "EGFR" (Protein) │───resolve───▶ uniprot:P00533 │
│ "lung cancer" │───resolve───▶ mondo:MONDO:0008903 │
│ "gefitinib" │───resolve───▶ chembl:CHEMBL939 │
└─────────────────────┘ └───────────────────────────┘
│
SAME_AS edges
│
▼
┌───────────────────┐
│ Structured sources │
│ Open Targets │
│ STRING │
│ Reactome │
└───────────────────┘
Validation pipeline
LLM-extracted relationships are scored against structured data:
LLM Relationship
Validated Against
Method
ASSOCIATES_WITH (Protein→Disease)
Open Targets, DISEASES 2.0, ClinVar
Check gene-disease pair exists with score > threshold
INTERACTS_WITH (Protein→Protein)
STRING (score > 400)
Look up protein pair in STRING links
TREATS (Drug→Disease)
ChEMBL, TTD
Verify drug has known mechanism
EXPRESSED_IN (Gene→Tissue)
GTEx (TPM > 1)
Verify measurable expression
PARTICIPATES_IN (Protein→Pathway)
Reactome
Check UniProt→pathway mapping
bioingest map validate-kg # score Neptune relationships against structured evidence
CLI Options Reference
Flag
Effect
--query, -q
Search query (required for pubmed/biorxiv/pmc)
--max-results
Max documents to fetch (default: 50)
--service
LLM backend: bedrock, local (Ollama), sagemaker
--model-id
Override model (e.g., us.amazon.nova-micro-v1:0)
--database
Neo4j database name (default: olink3)
--target
Graph backend: neo4j or neptune
--neptune-endpoint
Neptune cluster endpoint
--dry-run
Extract as JSONL without writing to DB
--stage
Stop after stage: fetch, extract, chunk, kg, resolve, write, embed
--force
Re-process already-ingested documents
--concurrency
Parallel LLM calls (default: 1, recommended: 5-10)
--extensions
File types for directories (e.g., .tsv,.pdf)