Skip to content

OBO Parsing (Structured Mode)

Deterministic parsing of OBO ontology files. No LLM needed — fast and reproducible.

Usage

# Export as JSONL (for graphrag_api bulk loader)
bioingest ontology export

# Write directly to Neo4j
bioingest ontology build --uri bolt://localhost:7687 --user neo4j --password secret --database olink3

What Gets Parsed

From each OBO [Term] block:

Field Example Stored as
id MONDO:0000002 Node ID
name cardiovascular disease Node name
def "A disease of..." Node definition
synonym (EXACT) "CVD" Node synonyms list
xref DOID:1287 Node xrefs + XREF edge
is_a MONDO:0000001 IS_A edge
relationship part_of MONDO:0000001 Named edge

Obsolete terms (is_obsolete: true) are excluded.

JSONL Export Format

Compatible with graphrag_api's consolidated_state/ format:

ontology_nodes.jsonl:

{
  "id": "MONDO:0000002",
  "name": "cardiovascular disease",
  "type": "Disease",
  "definition": "A disease of the cardiovascular system.",
  "synonyms": ["CVD"],
  "xrefs": ["DOID:1287", "EFO:0000319"],
  "source": "mondo",
  "_doc_id": "ontology_mondo",
  "_chunk_id": "obo_MONDO:0000002",
  "publication_count": 0
}

ontology_relationships.jsonl:

{
  "source_id": "MONDO:0000002",
  "target_id": "MONDO:0000001",
  "type": "IS_A",
  "occurrence_count": 1,
  "confidence": 1.0,
  "consolidated": true,
  "evidence_doc_ids": ["ontology_mondo"]
}

Version Tracking

Each build creates an OntologyVersion node:

(:OntologyVersion {
  version_id: "a3f2b1c9...",   -- SHA-256 of all term IDs
  timestamp: "2026-05-15T...",
  node_count: 154000,
  sources: ["mondo", "disease_ontology", "efo", "gene_ontology"]
})

Re-running with the same OBO files is idempotent. Updated files produce a new version_id.