Skip to content

Literature Sources

These sources feed the knowledge graph (KG) pipeline via bioingest ingest. They provide text for LLM-powered entity and relationship extraction.

PubMed

NCBI's index of biomedical literature abstracts (~36M records). The primary source for KG construction — fetch abstracts by query, extract entities (Protein, Disease, Drug, Pathway) and relationships via LLM, then write to Neo4j/Neptune.

API: NCBI Entrez E-utilities (esearch + efetch) Rate limit: 3 req/s (10 req/s with ENTREZ_API_KEY) What it provides: Title, abstract, authors, MeSH terms, PMID, DOI, publication date

uv run bioingest ingest pubmed -q "EGFR biomarker" --max-results 100 --service bedrock
# Dry-run extraction
uv run bioingest ingest pubmed -q "IL-6 inflammation" --dry-run > entities.jsonl

Graph enrichment: Each document creates Document node linked to extracted Protein, Disease, Drug entities with provenance (PMID, chunk_id).


bioRxiv

Cold Spring Harbor preprint server for biology. Provides early-access research before peer review. API returns full metadata; PDFs downloaded for text extraction.

API: bioRxiv API (https://api.biorxiv.org/details/) What it provides: Title, abstract, DOI, authors, posted date, full PDF Extraction: Abstract text + PDF (multi-strategy: pymupdf → pdftotext → vision LLM)

uv run bioingest ingest biorxiv -q "proximity extension assay" --max-results 50 --service bedrock

Graph enrichment: Preprint content often covers novel findings not yet in PubMed. Creates same entity/relationship structure with bioRxiv DOI provenance.


PMC (PubMed Central)

Full-text open-access archive of biomedical articles. Provides complete article XML/text (not just abstracts), enabling extraction from methods, results, and discussion sections.

API: NCBI E-utilities + PMC OA bulk (FTP) What it provides: Full article text, figures, supplementary data, PMC ID Extraction: Full-text yields 5–10× more entities than abstract-only PubMed

uv run bioingest ingest pmc -q "Olink cardiovascular proteomics" --max-results 50 --service bedrock

Graph enrichment: Full-text extraction captures relationships from results tables and supplementary data that abstracts miss.


OpenAlex

Open catalog of scholarly works, authors, institutions, and concepts. Provides citation counts, h-index, and topic classification. Used to prioritize high-impact papers and enrich document nodes with bibliometric metadata.

API: OpenAlex REST API (free, no key required) What it provides: works_count, cited_by_count, concepts, institutions, open_access status

# Enrich existing KG document nodes with citation data
uv run bioingest enrich openalex --database olink3

Graph enrichment: Adds cited_by_count and impact_factor properties to Document nodes; enables filtering KG by evidence quality.


QuickGO

EBI's browser for Gene Ontology annotations with evidence codes. Provides experimentally validated GO annotations (IDA, IPI, IMP) for proteins, complementing the computationally-predicted InterPro2GO mappings.

API: QuickGO REST API (https://www.ebi.ac.uk/QuickGO/services/annotation) What it provides: UniProt ID → GO term annotations with evidence code, qualifier, and source

# Pull GO annotations for specific proteins
uv run bioingest pull quickgo --ids P00533,P04637

Graph enrichment: Adds ANNOTATED_WITH relationships from Protein to GOTerm nodes with evidence code properties. Experimental evidence (ECO:0000314) weighted higher than electronic annotation (IEA).