Skip to content

PubMed & PMC Ingestion

PubMed Abstracts

Fetches abstracts from NCBI PubMed using the Entrez API.

bioingest ingest pubmed --query "EGFR biomarker proteomics" --max-results 200

How it works

  1. Entrez.esearch — search PubMed for matching PMIDs
  2. Entrez.efetch — batch-fetch article XML (100 per batch)
  3. Parse title, abstract, publication year
  4. Each abstract becomes one chunk → LLM extraction

Configuration

ENTREZ_EMAIL=your@email.com     # Required by NCBI
ENTREZ_API_KEY=your-key         # Optional, 10x rate limit (3→10 req/sec)

Rate Limits

  • Without API key: 3 requests/second
  • With API key: 10 requests/second
  • Automatic retry (3 attempts) on transient failures

PMC Full-Text

Fetches full-text open-access articles from PubMed Central.

bioingest ingest pmc --query "Olink proteomics cardiovascular" --max-results 20

How it works

  1. Entrez.esearch on PMC database with open access[filter]
  2. Entrez.efetch — fetch full XML for each article
  3. Parse sections (abstract, body paragraphs)
  4. Each section becomes a separate chunk → LLM extraction

Advantages over PubMed

  • Full text — not just abstracts (methods, results, discussion)
  • More relationships — full papers mention more entity interactions
  • Tables and figures — section context preserved

Limitations

  • Slower (one article at a time, 0.35s rate limit)
  • Only open-access articles available
  • Larger text = more LLM calls = higher cost