Pipeline Process Diagram¶

How a single PDF flows through the bioingest pipeline — showing each function call, inputs/outputs, and resources used.

Full Pipeline Flow¶

flowchart TD
    subgraph INPUT
        PDF[PDF File]
    end

    subgraph "Stage 1: EXTRACT"
        EX[extract_text]
        EX1[pymupdf: page.get_text]
        EX2[pdftotext CLI]
        EX3[Vision LLM fallback]
        SC[_score_extraction]
        PDF --> EX
        EX --> EX1
        EX1 -->|score < 0.6| EX2
        EX2 -->|score < 0.6| EX3
        EX1 & EX2 & EX3 --> SC
    end

    subgraph "Stage 2: SECTION PARSE"
        SP[remove_sections]
        SC -->|text| SP
        SP -->|"strip refs + acks"| CLEAN[cleaned text]
    end

    subgraph "Stage 3: CHUNK"
        CH[TokenChunker.chunk_text]
        CLEAN --> CH
        CH -->|"512 tokens, 64 overlap"| CHUNKS["chunks[]"]
    end

    subgraph "Stage 4: KG EXTRACTION"
        LLM[KGExtractor.extract_from_chunks]
        PROMPT[KG_EXTRACTION_PROMPT]
        BED[Bedrock / Ollama / SageMaker]
        PARSE[parse_llm_json + validate_extraction]
        CHUNKS --> LLM
        LLM --> PROMPT
        PROMPT --> BED
        BED --> PARSE
        PARSE --> NODES["nodes[]"]
        PARSE --> RELS["relationships[]"]
    end

    subgraph "Stage 5: RESOLVE"
        RES[EntityResolver.resolve]
        SYN[(synonym_map)]
        NODES --> RES
        RELS --> RES
        SYN -.-> RES
        RES --> RESOLVED_N["resolved nodes[]"]
        RES --> RESOLVED_R["resolved rels[]"]
    end

    subgraph "Stage 6: ENRICH"
        ENR[EntityEnricher.enrich]
        PMAP[(protein_map.parquet)]
        DMAP[(disease_map.parquet)]
        RESOLVED_N --> ENR
        PMAP -.-> ENR
        DMAP -.-> ENR
        ENR --> ENRICHED["enriched nodes[]"]
    end

    subgraph "Stage 7: WRITE"
        WR{target?}
        NEO[Neo4jWriter.write_document]
        NEP[NeptuneWriter.write_document]
        NONE[print results]
        ENRICHED --> WR
        RESOLVED_R --> WR
        WR -->|neo4j| NEO
        WR -->|neptune| NEP
        WR -->|none| NONE
    end

    subgraph "Stage 8: EMBED"
        EMB[EmbeddingPipeline.embed_chunks]
        TITAN[Bedrock Titan v2]
        AUR[(Aurora pgvector)]
        CHUNKS --> EMB
        EMB --> TITAN
        TITAN --> AUR
    end

    subgraph "Stage 9: VALIDATE"
        VAL[validate_all]
        STR[(STRING PPI)]
        DIS[(DISEASES 2.0)]
        RCT[(Reactome)]
        RESOLVED_R --> VAL
        STR -.-> VAL
        DIS -.-> VAL
        RCT -.-> VAL
        VAL --> EVIDENCE["evidence_level: high/medium/novel"]
    end

Per-Chunk Detail¶

sequenceDiagram
    participant C as Chunk
    participant E as KGExtractor
    participant L as LLM (Bedrock)
    participant P as parse_llm_json

    C->>E: chunk.text (512 tokens)
    E->>E: build prompt (entity types + rel types + rules)
    E->>L: invoke_model / converse
    L-->>E: JSON response (may have markdown fences)
    E->>P: strip fences, find braces, repair commas
    P-->>E: {nodes: [...], relationships: [...]}
    E->>E: validate_extraction (filter bad nodes)
    E-->>C: chunk.nodes, chunk.relationships

Resource Dependencies¶

graph LR
    subgraph "Local Files"
        PM[data/mapping/protein_map.parquet]
        DM[data/mapping/disease_map.parquet]
        DrM[data/mapping/drug_map.parquet]
        MG[data/mapping/mapping_graph.jsonl]
        BK[data/bulk/*/*.tsv]
    end

    subgraph "AWS Services"
        BR[Bedrock Runtime<br/>us-east-1]
        NP[Neptune<br/>eu-north-1]
        AU[Aurora pgvector<br/>eu-north-1]
        CW[CloudWatch<br/>eu-north-1]
    end

    subgraph "Pipeline Functions"
        EXT[extract_text]
        CHK[TokenChunker]
        KGE[KGExtractor]
        RES[EntityResolver]
        ENR[EntityEnricher]
        WRT[NeptuneWriter]
        EMB[EmbeddingPipeline]
        VAL[validate_all]
        TEL[PipelineMetrics]
    end

    KGE --> BR
    WRT --> NP
    EMB --> AU
    TEL --> CW
    ENR --> PM
    ENR --> DM
    VAL --> BK
    RES -->|synonym_map| PM

Function Call Summary¶

Stage	Function	Input	Output	Resource
Extract	`extract_text(path)`	PDF path	text (str)	pymupdf / pdftotext
Section	`remove_sections(text)`	raw text	cleaned text	regex
Chunk	`TokenChunker.chunk_text(text, id)`	text	Chunk[]	tiktoken
KG	`KGExtractor.extract_from_chunks(chunks)`	Chunk[]	Chunk[] (with nodes/rels)	Bedrock LLM
Resolve	`EntityResolver.resolve(nodes, rels)`	nodes[], rels[]	deduped nodes[], rels[]	synonym_map
Enrich	`EntityEnricher.enrich(nodes)`	nodes[]	nodes[] + uniprot/mondo	protein_map, disease_map
Write	`writer.write_document(doc)`	ProcessedDocument	graph edges	Neptune / Neo4j
Embed	`EmbeddingPipeline.embed_chunks(chunks)`	Chunk[]	vectors	Bedrock Titan → Aurora
Validate	`validate_all(nodes, rels, data_dir)`	nodes[], rels[]	rels[] + evidence	STRING, DISEASES, Reactome