Skip to content

Local File Ingestion

Ingest any local file or directory into the knowledge graph.

# Single file
bioingest ingest paper.pdf
bioingest ingest data/bulk/markerdb/proteins.tsv

# Directory (recursive)
bioingest ingest data/bulk/ttd/
bioingest ingest ~/papers/ --extensions .pdf

Supported Formats

Format Extension Extraction Strategy
PDF .pdf pymupdf page-by-page text extraction
TSV .tsv Convert to markdown table
CSV .csv Convert to markdown table
Text .txt Direct read
Markdown .md Direct read
HTML .html, .htm Strip tags, remove script/style

Directory Processing

When given a directory, bioingest recursively finds all supported files:

# Default extensions: .tsv, .csv, .txt, .md, .pdf
bioingest ingest data/bulk/

# Only specific types
bioingest ingest data/bulk/ --extensions .tsv,.csv

Table Handling

TSV/CSV files are converted to markdown tables before LLM extraction. This gives the LLM better context about column relationships:

Input TSV:
Gene    Disease         Score
BRCA1   Breast Cancer   0.95
TP53    Lung Cancer     0.88

Becomes:
| Gene | Disease | Score |
| --- | --- | --- |
| BRCA1 | Breast Cancer | 0.95 |
| TP53 | Lung Cancer | 0.88 |

The LLM then extracts entities from cells and relationships from row associations.

Examples

# Ingest all TTD data
bioingest ingest data/bulk/ttd/ --database olink3

# Ingest competitor data
bioingest ingest data/bulk/competitors_msd/msd_assays.tsv --service bedrock

# Dry-run a PDF
bioingest ingest ~/Downloads/paper.pdf --dry-run