Skip to content

LLM Ontology Extraction

Extract ontology terms and hierarchical relationships from any document using LLM.

Usage

# Single file
bioingest ontology extract data/bulk/markerdb/proteins.tsv

# Directory
bioingest ontology extract data/bulk/competitors_msd/ --llm-service bedrock

# Extract and write to Neo4j
bioingest ontology extract data/bulk/ttd/ --neo4j

# Use local Ollama
bioingest ontology extract notes.md --llm-service local --model-id llama3.1:8b-instruct-q8_0

How it Works

  1. Read file — TSV/CSV → markdown table, PDF → pymupdf, text → direct
  2. Chunk — split at paragraph boundaries (~4000 chars)
  3. LLM prompt — asks for ontology terms with types and hierarchical relationships
  4. Build graph — assemble terms and relationships into an OntologyGraph
  5. Export/write — JSONL output or direct Neo4j write

Extracted Structure

The LLM extracts:

Terms:

{"id": "egfr", "name": "EGFR", "type": "Protein", "definition": "Epidermal growth factor receptor", "synonyms": ["ErbB1", "HER1"]}

Relationships (types):

Type Meaning
IS_A Hierarchical classification
PART_OF Compositional relationship
REGULATES Regulatory interaction
ASSOCIATED_WITH General association
MEASURED_BY Assay/measurement relationship
TREATS Therapeutic relationship
CAUSES Causal relationship
BIOMARKER_FOR Biomarker indication

Supported File Types

Format How it's processed
.tsv, .csv Converted to markdown table for LLM context
.pdf Text extracted with pymupdf
.txt, .md Read directly

Options

Flag Effect
--llm-service LLM backend: bedrock, local, sagemaker
--model-id Override model
--output, -o Output directory for JSONL
--neo4j Also write to Neo4j (uses NEO4J_* env vars)
--graphrag-path Path to graphrag_api repo (auto-detected)

Example: Building a Custom Ontology

# 1. Download competitor data
bioingest download competitors_msd

# 2. Extract ontology terms from assay specs
bioingest ontology extract data/bulk/competitors_msd/msd_assays.tsv --neo4j

# 3. Check what was created
bioingest ontology status