--- name: vision-first-book-pdf-extraction description: Extract structured text from book-like PDFs whose embedded text layer is broken but page images and bookmarks are usable. Use TOC-guided page segmentation, render PNGs, and send pages to a vision model for faithful transcription. tags: [pdf, extraction, vision, ocr, books, toc, bookmarks, transcription, structured-data] --- # Vision-First Book PDF Extraction Use this when a PDF looks readable to humans but direct text extraction is garbage, especially for books or scanned/encoded documents with: - usable bookmarks / table of contents - clean page images - repeated section structure - a need for faithful transcription, not summarization This worked well for an abridged Wilhelm I Ching PDF where: - PyMuPDF `get_text()` returned font-encoded junk - OCR was usable but noisy - rendered PNGs plus vision transcription produced much cleaner structured output ## When to prefer this over OCR-first Choose vision-first when: 1. The PDF text layer is broken or encoded incorrectly 2. The page images are sharp and readable 3. Section structure matters (headings, subsections, line entries) 4. The user wants high fidelity to source wording 5. The document has bookmarks/TOC you can trust for segmentation Do NOT start by bulk OCRing the whole file if a quick sample shows: - junk extracted text - good rendered page readability - strong TOC/bookmark structure ## Core pipeline ### 1. Test the document first Use a tiny sample before committing to a method. For local PDFs: - inspect page count and TOC with PyMuPDF - sample a few pages with `page.get_text()` - render at least one representative page to PNG - check readability with a vision tool If direct extraction is garbage but vision reads the rendered PNG well, switch to vision-first. ### 2. Use bookmarks/TOC for segmentation For structured books, bookmarks are the safest page-boundary source. Preferred method: - load `doc.get_toc()` in PyMuPDF - identify level-2 entries matching your document sections (for example `^\d+\.` for numbered chapters/hexagrams) - compute `startPage` from the current entry - compute `endPage` from the next top-level or same-level section minus one Important: - use TOC/bookmarks only for boundaries and headings - do NOT trust broken text extraction for the section content itself ### 3. Render section pages to PNG Render each page at high DPI, usually 300 DPI. Store under stable paths such as: - `public///page-005.png` - `public///page-006.png` Why: - vision models read clean PNGs much better than bad text layers - these images become your visual source of truth - the app can later show “view source page(s)” for fidelity verification ### 4. Transcribe with a vision model, not OCR only Send the rendered PNGs to a multimodal model with a strict transcription prompt. Provider is part of provenance. If the user specified a provider family (for example Alex's HD/MEGA ingestion requires OpenAI/ChatGPT-family models and explicitly forbids Venice), do not silently switch to another backend. Stop, report the limitation, or mark any wrong-provider outputs as superseded/audit-only and re-run with the requested provider. Prompt rules that worked well: - return JSON only - list exact required keys - do not paraphrase, modernize, simplify, or normalize wording - preserve old-fashioned or unusual phrasing - ignore page numbers, decorative bullets, and navigation marks - preserve visible poetic / verse line breaks where meaningful - join only obvious visual wraps - return section fields separately (for example heading, judgement, image, lines) If Above/Below or similar metadata lines exist, explicitly specify their desired format in the prompt. Example requirement: - `For above/below, return 'Romanized / English name, Element'` ### 5. Keep deterministic normalization small After vision transcription, only do light cleanup such as: - newline normalization - repeated blank line cleanup - exact formatting normalization for known metadata fields - rebuilding a canonical `fullText` from structured fields Do NOT do heavy rewriting or LLM repair after extraction if the user wants fidelity. ## Recommended JSON shape ```json { "number": 1, "heading": "1. Ch'ien / The Creative", "pageStart": 5, "pageEnd": 6, "sourceImages": [ "public/wilhelm/001/page-005.png", "public/wilhelm/001/page-006.png" ], "above": "Ch'ien / The Creative, Heaven", "below": "Ch'ien / The Creative, Heaven", "judgement": "The Creative works sublime success,\nFurthering through perseverance.", "image": "The movement of heaven is full of power.\nThus the superior man makes himself strong and untiring.", "lines": [ { "label": "Nine at the beginning means:", "text": "Hidden dragon. Do not act." } ], "notes": [], "fullText": "...rebuilt canonical text..." } ``` ## Implementation pattern ### Python for discovery and rendering Use Python / PyMuPDF for: - opening the PDF - reading the TOC - computing page ranges - rendering PNGs ### TypeScript or Python for model calls Use whichever language matches the repo. For JS/TS apps, a TS script calling OpenRouter/OpenAI is fine. Typical flow: 1. load section map JSON 2. ensure section page PNGs exist 3. base64 the PNGs 4. send them to the model as image inputs 5. parse JSON response 6. normalize small formatting details 7. write generated JSON + markdown report ## Golden-sample-first rule Before bulk extraction, always run a golden sample of 3–6 sections that represent: - simple early section - one with uncommon punctuation/romanization - one with longer line entries - one near the end of the book - one possible boundary edge case Review for: - heading fidelity - subsection fidelity - line-break quality - contamination from neighboring pages - page-range correctness Only run the full document after the sample is clearly good. ## Important findings from experience ### Vision beat OCR on readable page images In the Wilhelm PDF: - Tesseract got the rough content, but left garbage like stray glyphs, broken heading fragments, and repeated section markers - vision transcription from rendered PNGs produced cleaner wording and structure So if rendered PNGs are human-readable, vision may outperform OCR for faithful section extraction. ### TOC + vision is stronger than either alone Best combo: - TOC/bookmarks for deterministic section boundaries - vision for actual content extraction - source PNGs retained for verification ### Preserve source images in the final product If the user wants “exactly as it appears in the PDF,” text alone is not enough. Always retain and link the page PNGs so any ambiguous transcription can be checked visually. ### Watch for non-content pages after the final section A final chapter/section may be followed by afterword/back matter. Do not assume the last section runs to the final page of the PDF. Use the next TOC entry of same-or-higher level to stop the range. ## Suggested report outputs During extraction, also write: - `docs/reports/-sample.md` - `docs/reports/-sample.json` - optional section map JSON Include: - section heading - page range - source image paths - extracted structured fields - notes / warnings ## Verification checklist - [ ] direct text extraction was tested and rejected for a concrete reason - [ ] TOC/bookmarks were inspected - [ ] a rendered PNG sample was checked visually - [ ] vision sample output was reviewed before bulk run - [ ] output preserves source phrasing - [ ] section boundaries are correct - [ ] source images are retained - [ ] generated JSON is structured enough for app rendering ## Good fit examples - sacred texts / translations - book chapters with standard subheadings - poetry/prose sections where layout matters - older PDFs with broken embedded text but sharp page scans ## Bad fit examples - dense tables requiring exact cell extraction - handwritten notes - low-resolution scans with poor readability - documents with no stable section structure and no TOC