Skip to content

BioIngest

bioRxiv

dcolinmorgan/bioingest

bioRxiv Ingestion¶

Fetches preprints from bioRxiv using the public content API.

bioingest ingest biorxiv --query "proximity extension assay" --max-results 50

How it works¶

Queries the bioRxiv details API for recent papers (last 365 days)
Filters results by matching query against title and abstract
Each abstract becomes one chunk → LLM extraction

API Details¶

Endpoint: https://api.biorxiv.org/details/biorxiv/{from}/{to}/{cursor}
Rate limit: 1 request/second (self-imposed)
Search: Client-side text matching (bioRxiv has no server-side search)
Coverage: Last 365 days by default

Example¶

# Fetch preprints about proteomics methods
bioingest ingest biorxiv -q "proteomics" --max-results 100 --database olink3

# Preview without Neo4j
bioingest ingest biorxiv -q "Olink" --dry-run

Limitations¶

No server-side full-text search — relies on title/abstract matching
Only abstracts fetched (not full PDF text)
For full-text extraction from bioRxiv PDFs, download them first and use local file ingestion