Literature Sources¶
These sources feed the knowledge graph (KG) pipeline via bioingest ingest. They provide text for LLM-powered entity and relationship extraction.
PubMed¶
NCBI's index of biomedical literature abstracts (~36M records). The primary source for KG construction — fetch abstracts by query, extract entities (Protein, Disease, Drug, Pathway) and relationships via LLM, then write to Neo4j/Neptune.
API: NCBI Entrez E-utilities (esearch + efetch)
Rate limit: 3 req/s (10 req/s with ENTREZ_API_KEY)
What it provides: Title, abstract, authors, MeSH terms, PMID, DOI, publication date
uv run bioingest ingest pubmed -q "EGFR biomarker" --max-results 100 --service bedrock
# Dry-run extraction
uv run bioingest ingest pubmed -q "IL-6 inflammation" --dry-run > entities.jsonl
Graph enrichment: Each document creates Document node linked to extracted Protein, Disease, Drug entities with provenance (PMID, chunk_id).
bioRxiv¶
Cold Spring Harbor preprint server for biology. Provides early-access research before peer review. API returns full metadata; PDFs downloaded for text extraction.
API: bioRxiv API (https://api.biorxiv.org/details/)
What it provides: Title, abstract, DOI, authors, posted date, full PDF
Extraction: Abstract text + PDF (multi-strategy: pymupdf → pdftotext → vision LLM)
Graph enrichment: Preprint content often covers novel findings not yet in PubMed. Creates same entity/relationship structure with bioRxiv DOI provenance.
PMC (PubMed Central)¶
Full-text open-access archive of biomedical articles. Provides complete article XML/text (not just abstracts), enabling extraction from methods, results, and discussion sections.
API: NCBI E-utilities + PMC OA bulk (FTP) What it provides: Full article text, figures, supplementary data, PMC ID Extraction: Full-text yields 5–10× more entities than abstract-only PubMed
Graph enrichment: Full-text extraction captures relationships from results tables and supplementary data that abstracts miss.
OpenAlex¶
Open catalog of scholarly works, authors, institutions, and concepts. Provides citation counts, h-index, and topic classification. Used to prioritize high-impact papers and enrich document nodes with bibliometric metadata.
API: OpenAlex REST API (free, no key required)
What it provides: works_count, cited_by_count, concepts, institutions, open_access status
# Enrich existing KG document nodes with citation data
uv run bioingest enrich openalex --database olink3
Graph enrichment: Adds cited_by_count and impact_factor properties to Document nodes; enables filtering KG by evidence quality.
QuickGO¶
EBI's browser for Gene Ontology annotations with evidence codes. Provides experimentally validated GO annotations (IDA, IPI, IMP) for proteins, complementing the computationally-predicted InterPro2GO mappings.
API: QuickGO REST API (https://www.ebi.ac.uk/QuickGO/services/annotation)
What it provides: UniProt ID → GO term annotations with evidence code, qualifier, and source
Graph enrichment: Adds ANNOTATED_WITH relationships from Protein to GOTerm nodes with evidence code properties. Experimental evidence (ECO:0000314) weighted higher than electronic annotation (IEA).