--- name: gpt-image-book-to-wiki description: Use when extracting deep, queryable knowledge from PDFs, ebooks, or book-like documents using GPT Image 2 / vision-first page reading, with exact quotes, diagram placement, provenance, and LLM Wiki integration. version: 1.0.0 author: Hermes Agent license: MIT metadata: hermes: tags: [pdf, ebook, gpt-image-2, vision, transcription, diagrams, provenance, llm-wiki, knowledge-base] related_skills: [vision-first-book-pdf-extraction, llm-wiki, ocr-and-documents] --- # GPT Image Book to LLM Wiki ## Overview This is an experimental, high-fidelity ingestion workflow for turning a PDF or ebook into a queryable LLM Wiki knowledge base. It uses a vision-first pipeline: render pages to images, have a multimodal model read the pages, preserve exact quotes and diagram context, then compile both raw and synthesized knowledge into the wiki. Use this when the goal is not just “extract text,” but to understand a whole book: chapter structure, arguments, definitions, diagrams, captions, examples, quotes, and how ideas connect. ## When to Use Use this skill when the user asks to: - Read a whole PDF/book/ebook with high context fidelity - Preserve exact quotes and source page references - Capture diagrams, figures, tables, captions, or visual layouts - Build a queryable book knowledge base - Ingest a book into an LLM Wiki / Obsidian-style markdown vault - Compare book claims against existing wiki knowledge Do not use as the first choice for: - Simple text-based PDFs where `web_extract` or PyMuPDF is enough - One-off summaries where exact quotations and page provenance do not matter - Low-quality scans that are not human-readable after rendering - Mass ingestion without budget/runtime approval ## Model / Provider Assumption Preferred provider family: OpenAI / ChatGPT. For Alex's Human Design / MEGA corpus specifically, do **not** use Venice as the backend for extraction, transcription, synthesis, or alignment; use OpenAI/ChatGPT-family backends such as OpenAI Codex/ChatGPT, GPT-5.5, GPT Vision-capable chat models, and GPT Image 2 only where image generation/reconstruction is appropriate. Preferred image/vision model: - GPT-5.5 or another OpenAI vision-capable chat/reasoning model for page reading and structured extraction - `gpt-image-2` for image generation, image-aware transforms, or diagram reconstruction; do not assume it is the right API for plain structured text extraction - Hermes `openai-codex` may be useful for agentic coding/orchestration, but for BYO OpenAI API key extraction scripts call OpenAI endpoints directly and record the actual model used If a tool/backend cannot pass page images directly to the preferred OpenAI model, stop and report the limitation rather than silently switching providers. Always report which model/backend was actually used. ## Core Principle Never treat generated summaries as the source of truth. Source of truth hierarchy: 1. Original PDF/ebook file 2. Rendered page images 3. Raw page-level JSON extraction with quotes and bounding/context metadata 4. Chapter-level synthesis linked to raw pages 5. Wiki pages linked back to raw files and page images ## Pipeline ### 1. Intake, classification, and capability check For every document, classify before extracting: - Identify file type: PDF, EPUB, MOBI/AZW/KFX, DOCX, etc. - For PDFs, do **not** treat `.pdf` as one homogeneous category. Probe and route by observed structure: - text-rich book/article PDF: high coherent text chars/page, paragraphs, usable headings → text extraction plus vision spot checks. - scanned or OCR-poor book: low/garbled text layer, visible page images → render all pages + OCR/vision page extraction. - slide deck PDF: PowerPoint/Keynote/Slides metadata, sparse text layer, short bullets, diagrams/callouts → render all slides + slide-aware vision extraction. - diagram-heavy manual/workbook: forms, bodygraphs, tables, charts, exercises → render pages + figure/table/layout extraction. - mixed/unknown: run a 5-10 page golden sample and defer bulk extraction until a route is chosen. - Record a machine-readable routing decision with `detected_class`, `confidence`, `signals`, `chosen_pipeline`, `rejected_pipelines`, and `next_action` so future agents know why a tool was chosen. - Check whether embedded text is usable. - Check whether the document has TOC/bookmarks. - Count pages/chapters. - Render 3-6 representative pages. - Run a golden sample before bulk processing. For ebooks: - Prefer structured extraction for EPUB HTML where available. - Still render pages/screenshots when layout, images, or exact placement matters. - For proprietary formats, convert to PDF/EPUB only if the user has rights and local tools support it. ### 2. Segment before reading Create a deterministic section map: - book metadata - front matter - chapters - subsections if visible in TOC/bookmarks - page ranges - figure/table ranges Use TOC/bookmarks when available. If absent, infer boundaries from rendered pages with a small vision pass, then ask for approval before full ingestion. ### 3. Render source images Render each page as a high-quality image: - 200-300 DPI for normal text - 300+ DPI for dense diagrams/tables - Preserve page number in filename Suggested output: ```text raw/books//source.pdf raw/books//pages/page-0001.png raw/books//pages/page-0002.png raw/books//extraction/page-0001.json raw/books//chapters/chapter-01.md raw/assets/books//figures/figure-001.png ``` ### 4. Page-level vision extraction For every page image, request structured JSON. Required fields: ```json { "page": 1, "visible_page_label": "xiii or 42 if present", "section_heading": "exact visible heading if any", "body_text": "faithful transcription; preserve quotes and unusual wording", "quotes": [ { "text": "exact quote", "speaker_or_attribution": "if visible/inferable from page only", "page": 1, "confidence": "high|medium|low" } ], "figures": [ { "id": "fig-p001-01", "type": "diagram|photo|chart|table|equation|callout|other", "caption": "exact caption if present", "description": "precise visual description", "placement": "top/middle/bottom/left/right; before/after which paragraph", "related_text_before": "short exact text before figure", "related_text_after": "short exact text after figure", "claims_or_labels": ["exact labels or claims in the figure"], "needs_crop": true } ], "tables": [ { "caption": "exact caption if present", "columns": [], "rows": [], "notes": "layout caveats" } ], "uncertain_regions": ["anything unreadable or ambiguous"] } ``` Rules: - Do not paraphrase in `body_text` or `quotes`. - Preserve exact spelling, punctuation, and line breaks where meaningful. - Join only obvious visual line wraps. - Mark uncertainty instead of guessing. - Keep diagrams separate from body text but record where they fit. ### 5. Cross-page reconciliation After page extraction: - Merge hyphenated text across page breaks only when obvious. - Reconstruct chapters from page JSON. - Deduplicate running headers/footers/page numbers. - Verify chapter boundaries against TOC. - Build a quote index with page references. - Build a figure/table index with page references and cropped assets where helpful. ### 6. Chapter-level knowledge extraction For each chapter, produce: - faithful chapter markdown with page anchors - exact quote list - definitions - named entities - key claims - arguments and evidence - diagrams/figures and their role in the argument - open questions / unclear passages Every extracted claim should point to page-level provenance, e.g. `source: raw/books/book/extraction/page-0042.json` or `p. 42`. ### 7. LLM Wiki integration Before writing to an existing wiki, orient first: 1. Read `SCHEMA.md` 2. Read `index.md` 3. Read recent `log.md` 4. Search for existing entities/concepts from the book Then write: - raw immutable source files under `raw/books//` - one book summary page under `sources/` or `concepts/` depending on schema - concept/entity updates for central ideas - comparison pages when the book conflicts with or complements existing wiki pages - query pages only for substantial, reusable answers Each wiki page must include: - YAML frontmatter - source list - wikilinks to related pages - page-level provenance markers for exact claims - quote blocks with page numbers ### 8. Query interface pattern For later querying, use both: - compiled wiki pages for fast synthesis - raw page/chapter JSON for exact quote lookup Answer format for queries: - direct answer - supporting exact quotes with page numbers - diagrams/figures involved, with page and placement - confidence / caveats - links to wiki pages and raw source files ## Golden Sample Protocol Before full-book extraction, run 3-6 representative pages/sections: - one table-of-contents/front-matter page - one normal text page - one page with a figure/diagram - one dense quote/footnote page - one chapter boundary page - one difficult scan/layout page if present Review: - exact quote fidelity - diagram placement accuracy - table structure - section boundary correctness - token/cost estimate - runtime estimate Only proceed to full extraction after the golden sample is acceptable. ## Recommended Outputs ```text raw/books// source.pdf # local during active ingest; may become source.pdf.s3stub after archival metadata.json manifest.json # required when large assets are S3-backed section-map.json pages/page-0001.png # local during active ingest; may become page-0001.png.s3stub after archival extraction/page-0001.json chapters/chapter-01.md indexes/quotes.json indexes/figures.json indexes/entities.json reports/golden-sample.md reports/final-ingestion-report.md ``` For Alex's VPS wiki, use the hybrid raw-asset pattern after ingestion: keep compiled markdown, chapter text, extraction JSON, indexes, metadata, and manifests local for speed; upload large immutable PDFs/page images/figures/audio/video to private Hetzner S3 and leave local `.s3stub` files plus a `manifest.json` with `s3_uri`, `s3_key`, `bytes`, `sha256`, and content type. Fetch S3 assets into `.cache/s3/` only when exact visual/PDF reinspection is needed. ## Reference Cases - `references/openai-hd-mega-provider-correction.md` documents the 2026-05-11 provider correction for Alex's Human Design / MEGA wiki: Venice/Qwen outputs are superseded; use OpenAI/ChatGPT-family models, with verified quirks for GPT-5.5 vision and OpenAI audio timestamps. - `references/openai-hd-lyd-ingestion-runtime-notes.md` documents the RA LYD 7h OpenAI ingestion runtime pattern: verifying 94 GPT-5.5 slide JSON outputs, Whisper-1 timestamped audio outputs, buffered/no-log background process behavior, tight-disk cache handling, and post-run timestamp normalization/QA caveats. - `references/openai-hd-slide-transcript-alignment.md` documents the OpenAI-only heuristic slide/transcript candidate alignment pattern for Human Design lecture+slide packages, including IDF lexical windowing, provisional-output warnings, and wiki QA/log updates. - `references/36-faces-mega-golden-sample.md` documents a tested MEGA/PyMuPDF/vision workflow for Austin Coppock's *36 Faces*, including `megadl` download, no-bookmark PDF probing, 200 DPI page rendering, hybrid OCR+vision strategy, and a concrete `Mars in Aries I` exact-quote lookup. - `references/36-faces-astrology-wiki-ingest.md` documents the follow-up full astrology-wiki ingest shape: raw source package, decan/chapter JSON, rendered opening-page images, placement indexes, concept/query pages, verification counts, and pitfalls like creating zodiac sign pages to avoid broken wikilinks. - `references/decan-image-source-atlas.md` documents a reusable decan-image source workflow and known pictorial witness: *Astrolabium Planum* (1494) via Internet Archive/NLM, including verified zodiacal facies page numbers, Viewer/S3 layer metadata, and IA/PyMuPDF/OCR pitfalls. Wiki outputs depend on schema, but commonly: ```text concepts/.md entities/.md comparisons/.md queries/.md raw/books//... ``` ## Common Pitfalls 1. **Using summaries as source text.** Keep page-level JSON and source images so exact quotes are recoverable. 2. **Skipping sample validation.** Full-book vision extraction can be expensive and compound errors. 3. **Losing diagram placement.** Always capture what text appears before/after figures. 4. **Flattening tables into prose.** Use explicit rows/columns when possible and flag ambiguity. 5. **Overwriting wiki claims without contradiction handling.** Existing wiki pages may conflict with the book; preserve both with dates/sources. 6. **Ignoring copyright.** Extract for user-provided documents and personal/research use; avoid distributing copyrighted full text unless the user has rights. 7. **No model disclosure.** Always state the actual model/backend used for extraction. 8. **Treating slide decks like dense text books.** PowerPoint/Keynote exports to PDF often have very sparse text layers (~200–400 chars/page) with heavy artifacting. A 94-page slide deck may contain less text than a 10-page Word document. Always probe the text layer first; if it's sparse or garbled, skip bulk text extraction and go straight to vision-first rendering of representative slides. 9. **Python environment assumptions on VPS.** The system Python may be PEP 668 protected (no `--break-system-packages`), and `python -m venv` may fail with "No space left on device" if the disk is >95% full. Check `df -h` early. Prefer reusing an existing venv (e.g., the Hermes agent venv at `~/.hermes/hermes-agent/venv/bin/python3`) or installing tools via `pip install --user` in a writable location. 10. **Silent provider substitution.** If the user specified OpenAI/ChatGPT-family ingestion, do not silently use Venice, OpenRouter, or a model hosted through another provider even if the model name looks equivalent. The backend provider is part of provenance. Mark wrong-provider outputs superseded/audit-only and re-run with the requested provider. 12. **Slide/transcript alignment is a candidate stage, not provenance.** For lecture+slide packages, after OpenAI slide vision extraction and timestamped ASR, build a provisional slide-to-transcript candidate map before synthesis. Use local lexical/IDF windows or another auditable method, mark candidates as heuristic, and require human/LLM spot-check before citing them in concept/query pages. See `references/openai-hd-slide-transcript-alignment.md`. ## Verification Checklist - [ ] Document type and page/chapter count identified - [ ] Text layer tested before vision-first pipeline chosen - [ ] TOC/bookmarks inspected or section inference documented - [ ] Representative pages rendered and sampled - [ ] Golden sample reviewed before full run - [ ] Page-level JSON includes exact quotes, figures, tables, uncertainty, and page references - [ ] Rendered page images retained as source-of-truth assets - [ ] Chapter markdown and indexes generated - [ ] Wiki orientation completed before writes - [ ] Wiki pages have provenance, wikilinks, frontmatter, index updates, and log entry - [ ] Final report states model used, cost/runtime estimate, files written, caveats, and next query examples