LLM Ontology Extraction¶
Extract ontology terms and hierarchical relationships from any document using LLM.
Usage¶
# Single file
bioingest ontology extract data/bulk/markerdb/proteins.tsv
# Directory
bioingest ontology extract data/bulk/competitors_msd/ --llm-service bedrock
# Extract and write to Neo4j
bioingest ontology extract data/bulk/ttd/ --neo4j
# Use local Ollama
bioingest ontology extract notes.md --llm-service local --model-id llama3.1:8b-instruct-q8_0
How it Works¶
- Read file — TSV/CSV → markdown table, PDF → pymupdf, text → direct
- Chunk — split at paragraph boundaries (~4000 chars)
- LLM prompt — asks for ontology terms with types and hierarchical relationships
- Build graph — assemble terms and relationships into an OntologyGraph
- Export/write — JSONL output or direct Neo4j write
Extracted Structure¶
The LLM extracts:
Terms:
{"id": "egfr", "name": "EGFR", "type": "Protein", "definition": "Epidermal growth factor receptor", "synonyms": ["ErbB1", "HER1"]}
Relationships (types):
| Type | Meaning |
|---|---|
| IS_A | Hierarchical classification |
| PART_OF | Compositional relationship |
| REGULATES | Regulatory interaction |
| ASSOCIATED_WITH | General association |
| MEASURED_BY | Assay/measurement relationship |
| TREATS | Therapeutic relationship |
| CAUSES | Causal relationship |
| BIOMARKER_FOR | Biomarker indication |
Supported File Types¶
| Format | How it's processed |
|---|---|
.tsv, .csv |
Converted to markdown table for LLM context |
.pdf |
Text extracted with pymupdf |
.txt, .md |
Read directly |
Options¶
| Flag | Effect |
|---|---|
--llm-service |
LLM backend: bedrock, local, sagemaker |
--model-id |
Override model |
--output, -o |
Output directory for JSONL |
--neo4j |
Also write to Neo4j (uses NEO4J_* env vars) |
--graphrag-path |
Path to graphrag_api repo (auto-detected) |