LLM Ontology Extraction¶

Extract ontology terms and hierarchical relationships from any document using LLM.

Usage¶

# Single file
bioingest ontology extract data/bulk/markerdb/proteins.tsv

# Directory
bioingest ontology extract data/bulk/competitors_msd/ --llm-service bedrock

# Extract and write to Neo4j
bioingest ontology extract data/bulk/ttd/ --neo4j

# Use local Ollama
bioingest ontology extract notes.md --llm-service local --model-id llama3.1:8b-instruct-q8_0

How it Works¶

Read file — TSV/CSV → markdown table, PDF → pymupdf, text → direct
Chunk — split at paragraph boundaries (~4000 chars)
LLM prompt — asks for ontology terms with types and hierarchical relationships
Build graph — assemble terms and relationships into an OntologyGraph
Export/write — JSONL output or direct Neo4j write

Extracted Structure¶

The LLM extracts:

Terms:

{"id": "egfr", "name": "EGFR", "type": "Protein", "definition": "Epidermal growth factor receptor", "synonyms": ["ErbB1", "HER1"]}

Relationships (types):

Type	Meaning
IS_A	Hierarchical classification
PART_OF	Compositional relationship
REGULATES	Regulatory interaction
ASSOCIATED_WITH	General association
MEASURED_BY	Assay/measurement relationship
TREATS	Therapeutic relationship
CAUSES	Causal relationship
BIOMARKER_FOR	Biomarker indication

Supported File Types¶

Format	How it's processed
`.tsv`, `.csv`	Converted to markdown table for LLM context
`.pdf`	Text extracted with pymupdf
`.txt`, `.md`	Read directly

Options¶

Flag	Effect
`--llm-service`	LLM backend: `bedrock`, `local`, `sagemaker`
`--model-id`	Override model
`--output, -o`	Output directory for JSONL
`--neo4j`	Also write to Neo4j (uses `NEO4J_*` env vars)
`--graphrag-path`	Path to graphrag_api repo (auto-detected)

Example: Building a Custom Ontology¶

# 1. Download competitor data
bioingest download competitors_msd

# 2. Extract ontology terms from assay specs
bioingest ontology extract data/bulk/competitors_msd/msd_assays.tsv --neo4j

# 3. Check what was created
bioingest ontology status