bioRxiv Ingestion¶
Fetches preprints from bioRxiv using the public content API.
How it works¶
- Queries the bioRxiv details API for recent papers (last 365 days)
- Filters results by matching query against title and abstract
- Each abstract becomes one chunk → LLM extraction
API Details¶
- Endpoint:
https://api.biorxiv.org/details/biorxiv/{from}/{to}/{cursor} - Rate limit: 1 request/second (self-imposed)
- Search: Client-side text matching (bioRxiv has no server-side search)
- Coverage: Last 365 days by default
Example¶
# Fetch preprints about proteomics methods
bioingest ingest biorxiv -q "proteomics" --max-results 100 --database olink3
# Preview without Neo4j
bioingest ingest biorxiv -q "Olink" --dry-run
Limitations¶
- No server-side full-text search — relies on title/abstract matching
- Only abstracts fetched (not full PDF text)
- For full-text extraction from bioRxiv PDFs, download them first and use local file ingestion