Use this when a PDF looks readable to humans but direct text extraction is garbage, especially for books or scanned/encoded documents with: - usable bookmarks / table of contents - clean page images - repeated section structure - a need for faithful transcription, not summarization
This worked well for an abridged Wilhelm I Ching PDF where:
- PyMuPDF get_text() returned font-encoded junk
- OCR was usable but noisy
- rendered PNGs plus vision transcription produced much cleaner structured output
Choose vision-first when: 1. The PDF text layer is broken or encoded incorrectly 2. The page images are sharp and readable 3. Section structure matters (headings, subsections, line entries) 4. The user wants high fidelity to source wording 5. The document has bookmarks/TOC you can trust for segmentation
Do NOT start by bulk OCRing the whole file if a quick sample shows: - junk extracted text - good rendered page readability - strong TOC/bookmark structure
Use a tiny sample before committing to a method.
For local PDFs:
- inspect page count and TOC with PyMuPDF
- sample a few pages with page.get_text()
- render at least one representative page to PNG
- check readability with a vision tool
If direct extraction is garbage but vision reads the rendered PNG well, switch to vision-first.
For structured books, bookmarks are the safest page-boundary source.
Preferred method:
- load doc.get_toc() in PyMuPDF
- identify level-2 entries matching your document sections (for example ^\d+\. for numbered chapters/hexagrams)
- compute startPage from the current entry
- compute endPage from the next top-level or same-level section minus one
Important: - use TOC/bookmarks only for boundaries and headings - do NOT trust broken text extraction for the section content itself
Render each page at high DPI, usually 300 DPI.
Store under stable paths such as:
- public/<doc-name>/<section-id>/page-005.png
- public/<doc-name>/<section-id>/page-006.png
Why: - vision models read clean PNGs much better than bad text layers - these images become your visual source of truth - the app can later show “view source page(s)” for fidelity verification
Send the rendered PNGs to a multimodal model with a strict transcription prompt.
Provider is part of provenance. If the user specified a provider family (for example Alex's HD/MEGA ingestion requires OpenAI/ChatGPT-family models and explicitly forbids Venice), do not silently switch to another backend. Stop, report the limitation, or mark any wrong-provider outputs as superseded/audit-only and re-run with the requested provider.
Prompt rules that worked well: - return JSON only - list exact required keys - do not paraphrase, modernize, simplify, or normalize wording - preserve old-fashioned or unusual phrasing - ignore page numbers, decorative bullets, and navigation marks - preserve visible poetic / verse line breaks where meaningful - join only obvious visual wraps - return section fields separately (for example heading, judgement, image, lines)
If Above/Below or similar metadata lines exist, explicitly specify their desired format in the prompt.
Example requirement:
- For above/below, return 'Romanized / English name, Element'
After vision transcription, only do light cleanup such as:
- newline normalization
- repeated blank line cleanup
- exact formatting normalization for known metadata fields
- rebuilding a canonical fullText from structured fields
Do NOT do heavy rewriting or LLM repair after extraction if the user wants fidelity.
{
"number": 1,
"heading": "1. Ch'ien / The Creative",
"pageStart": 5,
"pageEnd": 6,
"sourceImages": [
"public/wilhelm/001/page-005.png",
"public/wilhelm/001/page-006.png"
],
"above": "Ch'ien / The Creative, Heaven",
"below": "Ch'ien / The Creative, Heaven",
"judgement": "The Creative works sublime success,\nFurthering through perseverance.",
"image": "The movement of heaven is full of power.\nThus the superior man makes himself strong and untiring.",
"lines": [
{
"label": "Nine at the beginning means:",
"text": "Hidden dragon. Do not act."
}
],
"notes": [],
"fullText": "...rebuilt canonical text..."
}
Use Python / PyMuPDF for: - opening the PDF - reading the TOC - computing page ranges - rendering PNGs
Use whichever language matches the repo. For JS/TS apps, a TS script calling OpenRouter/OpenAI is fine.
Typical flow: 1. load section map JSON 2. ensure section page PNGs exist 3. base64 the PNGs 4. send them to the model as image inputs 5. parse JSON response 6. normalize small formatting details 7. write generated JSON + markdown report
Before bulk extraction, always run a golden sample of 3–6 sections that represent: - simple early section - one with uncommon punctuation/romanization - one with longer line entries - one near the end of the book - one possible boundary edge case
Review for: - heading fidelity - subsection fidelity - line-break quality - contamination from neighboring pages - page-range correctness
Only run the full document after the sample is clearly good.
In the Wilhelm PDF: - Tesseract got the rough content, but left garbage like stray glyphs, broken heading fragments, and repeated section markers - vision transcription from rendered PNGs produced cleaner wording and structure
So if rendered PNGs are human-readable, vision may outperform OCR for faithful section extraction.
Best combo: - TOC/bookmarks for deterministic section boundaries - vision for actual content extraction - source PNGs retained for verification
If the user wants “exactly as it appears in the PDF,” text alone is not enough. Always retain and link the page PNGs so any ambiguous transcription can be checked visually.
A final chapter/section may be followed by afterword/back matter. Do not assume the last section runs to the final page of the PDF. Use the next TOC entry of same-or-higher level to stop the range.
During extraction, also write:
- docs/reports/<name>-sample.md
- docs/reports/<name>-sample.json
- optional section map JSON
Include: - section heading - page range - source image paths - extracted structured fields - notes / warnings