PubMed & PMC Ingestion¶
PubMed Abstracts¶
Fetches abstracts from NCBI PubMed using the Entrez API.
How it works¶
Entrez.esearch— search PubMed for matching PMIDsEntrez.efetch— batch-fetch article XML (100 per batch)- Parse title, abstract, publication year
- Each abstract becomes one chunk → LLM extraction
Configuration¶
ENTREZ_EMAIL=your@email.com # Required by NCBI
ENTREZ_API_KEY=your-key # Optional, 10x rate limit (3→10 req/sec)
Rate Limits¶
- Without API key: 3 requests/second
- With API key: 10 requests/second
- Automatic retry (3 attempts) on transient failures
PMC Full-Text¶
Fetches full-text open-access articles from PubMed Central.
How it works¶
Entrez.esearchon PMC database withopen access[filter]Entrez.efetch— fetch full XML for each article- Parse sections (abstract, body paragraphs)
- Each section becomes a separate chunk → LLM extraction
Advantages over PubMed¶
- Full text — not just abstracts (methods, results, discussion)
- More relationships — full papers mention more entity interactions
- Tables and figures — section context preserved
Limitations¶
- Slower (one article at a time, 0.35s rate limit)
- Only open-access articles available
- Larger text = more LLM calls = higher cost