vision-first-book-pdf-extraction

/home/avalon/.hermes/skills/productivity/vision-first-book-pdf-extraction/SKILL.md · raw

Vision-First Book PDF Extraction

Use this when a PDF looks readable to humans but direct text extraction is garbage, especially for books or scanned/encoded documents with: - usable bookmarks / table of contents - clean page images - repeated section structure - a need for faithful transcription, not summarization

This worked well for an abridged Wilhelm I Ching PDF where: - PyMuPDF get_text() returned font-encoded junk - OCR was usable but noisy - rendered PNGs plus vision transcription produced much cleaner structured output

When to prefer this over OCR-first

Choose vision-first when: 1. The PDF text layer is broken or encoded incorrectly 2. The page images are sharp and readable 3. Section structure matters (headings, subsections, line entries) 4. The user wants high fidelity to source wording 5. The document has bookmarks/TOC you can trust for segmentation

Do NOT start by bulk OCRing the whole file if a quick sample shows: - junk extracted text - good rendered page readability - strong TOC/bookmark structure

Core pipeline

1. Test the document first

Use a tiny sample before committing to a method.

For local PDFs: - inspect page count and TOC with PyMuPDF - sample a few pages with page.get_text() - render at least one representative page to PNG - check readability with a vision tool

If direct extraction is garbage but vision reads the rendered PNG well, switch to vision-first.

2. Use bookmarks/TOC for segmentation

For structured books, bookmarks are the safest page-boundary source.

Preferred method: - load doc.get_toc() in PyMuPDF - identify level-2 entries matching your document sections (for example ^\d+\. for numbered chapters/hexagrams) - compute startPage from the current entry - compute endPage from the next top-level or same-level section minus one

Important: - use TOC/bookmarks only for boundaries and headings - do NOT trust broken text extraction for the section content itself

3. Render section pages to PNG

Render each page at high DPI, usually 300 DPI.

Store under stable paths such as: - public/<doc-name>/<section-id>/page-005.png - public/<doc-name>/<section-id>/page-006.png

Why: - vision models read clean PNGs much better than bad text layers - these images become your visual source of truth - the app can later show “view source page(s)” for fidelity verification

4. Transcribe with a vision model, not OCR only

Send the rendered PNGs to a multimodal model with a strict transcription prompt.

Provider is part of provenance. If the user specified a provider family (for example Alex's HD/MEGA ingestion requires OpenAI/ChatGPT-family models and explicitly forbids Venice), do not silently switch to another backend. Stop, report the limitation, or mark any wrong-provider outputs as superseded/audit-only and re-run with the requested provider.

Prompt rules that worked well: - return JSON only - list exact required keys - do not paraphrase, modernize, simplify, or normalize wording - preserve old-fashioned or unusual phrasing - ignore page numbers, decorative bullets, and navigation marks - preserve visible poetic / verse line breaks where meaningful - join only obvious visual wraps - return section fields separately (for example heading, judgement, image, lines)

If Above/Below or similar metadata lines exist, explicitly specify their desired format in the prompt.

Example requirement: - For above/below, return 'Romanized / English name, Element'

5. Keep deterministic normalization small

After vision transcription, only do light cleanup such as: - newline normalization - repeated blank line cleanup - exact formatting normalization for known metadata fields - rebuilding a canonical fullText from structured fields

Do NOT do heavy rewriting or LLM repair after extraction if the user wants fidelity.

Recommended JSON shape

{
  "number": 1,
  "heading": "1. Ch'ien / The Creative",
  "pageStart": 5,
  "pageEnd": 6,
  "sourceImages": [
    "public/wilhelm/001/page-005.png",
    "public/wilhelm/001/page-006.png"
  ],
  "above": "Ch'ien / The Creative, Heaven",
  "below": "Ch'ien / The Creative, Heaven",
  "judgement": "The Creative works sublime success,\nFurthering through perseverance.",
  "image": "The movement of heaven is full of power.\nThus the superior man makes himself strong and untiring.",
  "lines": [
    {
      "label": "Nine at the beginning means:",
      "text": "Hidden dragon. Do not act."
    }
  ],
  "notes": [],
  "fullText": "...rebuilt canonical text..."
}

Implementation pattern

Python for discovery and rendering

Use Python / PyMuPDF for: - opening the PDF - reading the TOC - computing page ranges - rendering PNGs

TypeScript or Python for model calls

Use whichever language matches the repo. For JS/TS apps, a TS script calling OpenRouter/OpenAI is fine.

Typical flow: 1. load section map JSON 2. ensure section page PNGs exist 3. base64 the PNGs 4. send them to the model as image inputs 5. parse JSON response 6. normalize small formatting details 7. write generated JSON + markdown report

Golden-sample-first rule

Before bulk extraction, always run a golden sample of 3–6 sections that represent: - simple early section - one with uncommon punctuation/romanization - one with longer line entries - one near the end of the book - one possible boundary edge case

Review for: - heading fidelity - subsection fidelity - line-break quality - contamination from neighboring pages - page-range correctness

Only run the full document after the sample is clearly good.

Important findings from experience

Vision beat OCR on readable page images

In the Wilhelm PDF: - Tesseract got the rough content, but left garbage like stray glyphs, broken heading fragments, and repeated section markers - vision transcription from rendered PNGs produced cleaner wording and structure

So if rendered PNGs are human-readable, vision may outperform OCR for faithful section extraction.

TOC + vision is stronger than either alone

Best combo: - TOC/bookmarks for deterministic section boundaries - vision for actual content extraction - source PNGs retained for verification

Preserve source images in the final product

If the user wants “exactly as it appears in the PDF,” text alone is not enough. Always retain and link the page PNGs so any ambiguous transcription can be checked visually.

Watch for non-content pages after the final section

A final chapter/section may be followed by afterword/back matter. Do not assume the last section runs to the final page of the PDF. Use the next TOC entry of same-or-higher level to stop the range.

Suggested report outputs

During extraction, also write: - docs/reports/<name>-sample.md - docs/reports/<name>-sample.json - optional section map JSON

Include: - section heading - page range - source image paths - extracted structured fields - notes / warnings

Verification checklist

[ ] direct text extraction was tested and rejected for a concrete reason
[ ] TOC/bookmarks were inspected
[ ] a rendered PNG sample was checked visually
[ ] vision sample output was reviewed before bulk run
[ ] output preserves source phrasing
[ ] section boundaries are correct
[ ] source images are retained
[ ] generated JSON is structured enough for app rendering

Good fit examples

sacred texts / translations
book chapters with standard subheadings
poetry/prose sections where layout matters
older PDFs with broken embedded text but sharp page scans

Bad fit examples

dense tables requiring exact cell extraction
handwritten notes
low-resolution scans with poor readability
documents with no stable section structure and no TOC