Pipeline Process Diagram¶
How a single PDF flows through the bioingest pipeline — showing each function call, inputs/outputs, and resources used.
Full Pipeline Flow¶
flowchart TD
subgraph INPUT
PDF[PDF File]
end
subgraph "Stage 1: EXTRACT"
EX[extract_text]
EX1[pymupdf: page.get_text]
EX2[pdftotext CLI]
EX3[Vision LLM fallback]
SC[_score_extraction]
PDF --> EX
EX --> EX1
EX1 -->|score < 0.6| EX2
EX2 -->|score < 0.6| EX3
EX1 & EX2 & EX3 --> SC
end
subgraph "Stage 2: SECTION PARSE"
SP[remove_sections]
SC -->|text| SP
SP -->|"strip refs + acks"| CLEAN[cleaned text]
end
subgraph "Stage 3: CHUNK"
CH[TokenChunker.chunk_text]
CLEAN --> CH
CH -->|"512 tokens, 64 overlap"| CHUNKS["chunks[]"]
end
subgraph "Stage 4: KG EXTRACTION"
LLM[KGExtractor.extract_from_chunks]
PROMPT[KG_EXTRACTION_PROMPT]
BED[Bedrock / Ollama / SageMaker]
PARSE[parse_llm_json + validate_extraction]
CHUNKS --> LLM
LLM --> PROMPT
PROMPT --> BED
BED --> PARSE
PARSE --> NODES["nodes[]"]
PARSE --> RELS["relationships[]"]
end
subgraph "Stage 5: RESOLVE"
RES[EntityResolver.resolve]
SYN[(synonym_map)]
NODES --> RES
RELS --> RES
SYN -.-> RES
RES --> RESOLVED_N["resolved nodes[]"]
RES --> RESOLVED_R["resolved rels[]"]
end
subgraph "Stage 6: ENRICH"
ENR[EntityEnricher.enrich]
PMAP[(protein_map.parquet)]
DMAP[(disease_map.parquet)]
RESOLVED_N --> ENR
PMAP -.-> ENR
DMAP -.-> ENR
ENR --> ENRICHED["enriched nodes[]"]
end
subgraph "Stage 7: WRITE"
WR{target?}
NEO[Neo4jWriter.write_document]
NEP[NeptuneWriter.write_document]
NONE[print results]
ENRICHED --> WR
RESOLVED_R --> WR
WR -->|neo4j| NEO
WR -->|neptune| NEP
WR -->|none| NONE
end
subgraph "Stage 8: EMBED"
EMB[EmbeddingPipeline.embed_chunks]
TITAN[Bedrock Titan v2]
AUR[(Aurora pgvector)]
CHUNKS --> EMB
EMB --> TITAN
TITAN --> AUR
end
subgraph "Stage 9: VALIDATE"
VAL[validate_all]
STR[(STRING PPI)]
DIS[(DISEASES 2.0)]
RCT[(Reactome)]
RESOLVED_R --> VAL
STR -.-> VAL
DIS -.-> VAL
RCT -.-> VAL
VAL --> EVIDENCE["evidence_level: high/medium/novel"]
end
Per-Chunk Detail¶
sequenceDiagram
participant C as Chunk
participant E as KGExtractor
participant L as LLM (Bedrock)
participant P as parse_llm_json
C->>E: chunk.text (512 tokens)
E->>E: build prompt (entity types + rel types + rules)
E->>L: invoke_model / converse
L-->>E: JSON response (may have markdown fences)
E->>P: strip fences, find braces, repair commas
P-->>E: {nodes: [...], relationships: [...]}
E->>E: validate_extraction (filter bad nodes)
E-->>C: chunk.nodes, chunk.relationships
Resource Dependencies¶
graph LR
subgraph "Local Files"
PM[data/mapping/protein_map.parquet]
DM[data/mapping/disease_map.parquet]
DrM[data/mapping/drug_map.parquet]
MG[data/mapping/mapping_graph.jsonl]
BK[data/bulk/*/*.tsv]
end
subgraph "AWS Services"
BR[Bedrock Runtime<br/>us-east-1]
NP[Neptune<br/>eu-north-1]
AU[Aurora pgvector<br/>eu-north-1]
CW[CloudWatch<br/>eu-north-1]
end
subgraph "Pipeline Functions"
EXT[extract_text]
CHK[TokenChunker]
KGE[KGExtractor]
RES[EntityResolver]
ENR[EntityEnricher]
WRT[NeptuneWriter]
EMB[EmbeddingPipeline]
VAL[validate_all]
TEL[PipelineMetrics]
end
KGE --> BR
WRT --> NP
EMB --> AU
TEL --> CW
ENR --> PM
ENR --> DM
VAL --> BK
RES -->|synonym_map| PM
Function Call Summary¶
| Stage | Function | Input | Output | Resource |
|---|---|---|---|---|
| Extract | extract_text(path) |
PDF path | text (str) | pymupdf / pdftotext |
| Section | remove_sections(text) |
raw text | cleaned text | regex |
| Chunk | TokenChunker.chunk_text(text, id) |
text | Chunk[] | tiktoken |
| KG | KGExtractor.extract_from_chunks(chunks) |
Chunk[] | Chunk[] (with nodes/rels) | Bedrock LLM |
| Resolve | EntityResolver.resolve(nodes, rels) |
nodes[], rels[] | deduped nodes[], rels[] | synonym_map |
| Enrich | EntityEnricher.enrich(nodes) |
nodes[] | nodes[] + uniprot/mondo | protein_map, disease_map |
| Write | writer.write_document(doc) |
ProcessedDocument | graph edges | Neptune / Neo4j |
| Embed | EmbeddingPipeline.embed_chunks(chunks) |
Chunk[] | vectors | Bedrock Titan → Aurora |
| Validate | validate_all(nodes, rels, data_dir) |
nodes[], rels[] | rels[] + evidence | STRING, DISEASES, Reactome |