Skip to content

bioRxiv Ingestion

Fetches preprints from bioRxiv using the public content API.

bioingest ingest biorxiv --query "proximity extension assay" --max-results 50

How it works

  1. Queries the bioRxiv details API for recent papers (last 365 days)
  2. Filters results by matching query against title and abstract
  3. Each abstract becomes one chunk → LLM extraction

API Details

  • Endpoint: https://api.biorxiv.org/details/biorxiv/{from}/{to}/{cursor}
  • Rate limit: 1 request/second (self-imposed)
  • Search: Client-side text matching (bioRxiv has no server-side search)
  • Coverage: Last 365 days by default

Example

# Fetch preprints about proteomics methods
bioingest ingest biorxiv -q "proteomics" --max-results 100 --database olink3

# Preview without Neo4j
bioingest ingest biorxiv -q "Olink" --dry-run

Limitations

  • No server-side full-text search — relies on title/abstract matching
  • Only abstracts fetched (not full PDF text)
  • For full-text extraction from bioRxiv PDFs, download them first and use local file ingestion