gpt-image-book-to-wiki

/home/avalon/.hermes/skills/research/gpt-image-book-to-wiki/SKILL.md · raw

GPT Image Book to LLM Wiki

Overview

This is an experimental, high-fidelity ingestion workflow for turning a PDF or ebook into a queryable LLM Wiki knowledge base. It uses a vision-first pipeline: render pages to images, have a multimodal model read the pages, preserve exact quotes and diagram context, then compile both raw and synthesized knowledge into the wiki.

Use this when the goal is not just “extract text,” but to understand a whole book: chapter structure, arguments, definitions, diagrams, captions, examples, quotes, and how ideas connect.

When to Use

Use this skill when the user asks to: - Read a whole PDF/book/ebook with high context fidelity - Preserve exact quotes and source page references - Capture diagrams, figures, tables, captions, or visual layouts - Build a queryable book knowledge base - Ingest a book into an LLM Wiki / Obsidian-style markdown vault - Compare book claims against existing wiki knowledge

Do not use as the first choice for: - Simple text-based PDFs where web_extract or PyMuPDF is enough - One-off summaries where exact quotations and page provenance do not matter - Low-quality scans that are not human-readable after rendering - Mass ingestion without budget/runtime approval

Model / Provider Assumption

Preferred provider family: OpenAI / ChatGPT. For Alex's Human Design / MEGA corpus specifically, do not use Venice as the backend for extraction, transcription, synthesis, or alignment; use OpenAI/ChatGPT-family backends such as OpenAI Codex/ChatGPT, GPT-5.5, GPT Vision-capable chat models, and GPT Image 2 only where image generation/reconstruction is appropriate.

Preferred image/vision model: - GPT-5.5 or another OpenAI vision-capable chat/reasoning model for page reading and structured extraction - gpt-image-2 for image generation, image-aware transforms, or diagram reconstruction; do not assume it is the right API for plain structured text extraction - Hermes openai-codex may be useful for agentic coding/orchestration, but for BYO OpenAI API key extraction scripts call OpenAI endpoints directly and record the actual model used

If a tool/backend cannot pass page images directly to the preferred OpenAI model, stop and report the limitation rather than silently switching providers. Always report which model/backend was actually used.

Core Principle

Never treat generated summaries as the source of truth.

Source of truth hierarchy: 1. Original PDF/ebook file 2. Rendered page images 3. Raw page-level JSON extraction with quotes and bounding/context metadata 4. Chapter-level synthesis linked to raw pages 5. Wiki pages linked back to raw files and page images

Pipeline

1. Intake, classification, and capability check

For every document, classify before extracting: - Identify file type: PDF, EPUB, MOBI/AZW/KFX, DOCX, etc. - For PDFs, do not treat .pdf as one homogeneous category. Probe and route by observed structure: - text-rich book/article PDF: high coherent text chars/page, paragraphs, usable headings → text extraction plus vision spot checks. - scanned or OCR-poor book: low/garbled text layer, visible page images → render all pages + OCR/vision page extraction. - slide deck PDF: PowerPoint/Keynote/Slides metadata, sparse text layer, short bullets, diagrams/callouts → render all slides + slide-aware vision extraction. - diagram-heavy manual/workbook: forms, bodygraphs, tables, charts, exercises → render pages + figure/table/layout extraction. - mixed/unknown: run a 5-10 page golden sample and defer bulk extraction until a route is chosen. - Record a machine-readable routing decision with detected_class, confidence, signals, chosen_pipeline, rejected_pipelines, and next_action so future agents know why a tool was chosen. - Check whether embedded text is usable. - Check whether the document has TOC/bookmarks. - Count pages/chapters. - Render 3-6 representative pages. - Run a golden sample before bulk processing.

For ebooks: - Prefer structured extraction for EPUB HTML where available. - Still render pages/screenshots when layout, images, or exact placement matters. - For proprietary formats, convert to PDF/EPUB only if the user has rights and local tools support it.

2. Segment before reading

Create a deterministic section map: - book metadata - front matter - chapters - subsections if visible in TOC/bookmarks - page ranges - figure/table ranges

Use TOC/bookmarks when available. If absent, infer boundaries from rendered pages with a small vision pass, then ask for approval before full ingestion.

3. Render source images

Render each page as a high-quality image: - 200-300 DPI for normal text - 300+ DPI for dense diagrams/tables - Preserve page number in filename

Suggested output:

raw/books/<slug>/source.pdf
raw/books/<slug>/pages/page-0001.png
raw/books/<slug>/pages/page-0002.png
raw/books/<slug>/extraction/page-0001.json
raw/books/<slug>/chapters/chapter-01.md
raw/assets/books/<slug>/figures/figure-001.png

4. Page-level vision extraction

For every page image, request structured JSON. Required fields:

{
  "page": 1,
  "visible_page_label": "xiii or 42 if present",
  "section_heading": "exact visible heading if any",
  "body_text": "faithful transcription; preserve quotes and unusual wording",
  "quotes": [
    {
      "text": "exact quote",
      "speaker_or_attribution": "if visible/inferable from page only",
      "page": 1,
      "confidence": "high|medium|low"
    }
  ],
  "figures": [
    {
      "id": "fig-p001-01",
      "type": "diagram|photo|chart|table|equation|callout|other",
      "caption": "exact caption if present",
      "description": "precise visual description",
      "placement": "top/middle/bottom/left/right; before/after which paragraph",
      "related_text_before": "short exact text before figure",
      "related_text_after": "short exact text after figure",
      "claims_or_labels": ["exact labels or claims in the figure"],
      "needs_crop": true
    }
  ],
  "tables": [
    {
      "caption": "exact caption if present",
      "columns": [],
      "rows": [],
      "notes": "layout caveats"
    }
  ],
  "uncertain_regions": ["anything unreadable or ambiguous"]
}

Rules: - Do not paraphrase in body_text or quotes. - Preserve exact spelling, punctuation, and line breaks where meaningful. - Join only obvious visual line wraps. - Mark uncertainty instead of guessing. - Keep diagrams separate from body text but record where they fit.

5. Cross-page reconciliation

After page extraction: - Merge hyphenated text across page breaks only when obvious. - Reconstruct chapters from page JSON. - Deduplicate running headers/footers/page numbers. - Verify chapter boundaries against TOC. - Build a quote index with page references. - Build a figure/table index with page references and cropped assets where helpful.

6. Chapter-level knowledge extraction

For each chapter, produce: - faithful chapter markdown with page anchors - exact quote list - definitions - named entities - key claims - arguments and evidence - diagrams/figures and their role in the argument - open questions / unclear passages

Every extracted claim should point to page-level provenance, e.g. source: raw/books/book/extraction/page-0042.json or p. 42.

7. LLM Wiki integration

Before writing to an existing wiki, orient first: 1. Read SCHEMA.md 2. Read index.md 3. Read recent log.md 4. Search for existing entities/concepts from the book

Then write: - raw immutable source files under raw/books/<slug>/ - one book summary page under sources/ or concepts/ depending on schema - concept/entity updates for central ideas - comparison pages when the book conflicts with or complements existing wiki pages - query pages only for substantial, reusable answers

Each wiki page must include: - YAML frontmatter - source list - wikilinks to related pages - page-level provenance markers for exact claims - quote blocks with page numbers

8. Query interface pattern

For later querying, use both: - compiled wiki pages for fast synthesis - raw page/chapter JSON for exact quote lookup

Answer format for queries: - direct answer - supporting exact quotes with page numbers - diagrams/figures involved, with page and placement - confidence / caveats - links to wiki pages and raw source files

Golden Sample Protocol

Before full-book extraction, run 3-6 representative pages/sections: - one table-of-contents/front-matter page - one normal text page - one page with a figure/diagram - one dense quote/footnote page - one chapter boundary page - one difficult scan/layout page if present

Review: - exact quote fidelity - diagram placement accuracy - table structure - section boundary correctness - token/cost estimate - runtime estimate

Only proceed to full extraction after the golden sample is acceptable.

raw/books/<book-slug>/
  source.pdf                 # local during active ingest; may become source.pdf.s3stub after archival
  metadata.json
  manifest.json              # required when large assets are S3-backed
  section-map.json
  pages/page-0001.png        # local during active ingest; may become page-0001.png.s3stub after archival
  extraction/page-0001.json
  chapters/chapter-01.md
  indexes/quotes.json
  indexes/figures.json
  indexes/entities.json
  reports/golden-sample.md
  reports/final-ingestion-report.md

For Alex's VPS wiki, use the hybrid raw-asset pattern after ingestion: keep compiled markdown, chapter text, extraction JSON, indexes, metadata, and manifests local for speed; upload large immutable PDFs/page images/figures/audio/video to private Hetzner S3 and leave local .s3stub files plus a manifest.json with s3_uri, s3_key, bytes, sha256, and content type. Fetch S3 assets into .cache/s3/ only when exact visual/PDF reinspection is needed.

Reference Cases

Wiki outputs depend on schema, but commonly:

concepts/<central-concept>.md
entities/<author-or-organization>.md
comparisons/<book-vs-existing-topic>.md
queries/<book-question>.md
raw/books/<book-slug>/...

Common Pitfalls

  1. Using summaries as source text. Keep page-level JSON and source images so exact quotes are recoverable.
  2. Skipping sample validation. Full-book vision extraction can be expensive and compound errors.
  3. Losing diagram placement. Always capture what text appears before/after figures.
  4. Flattening tables into prose. Use explicit rows/columns when possible and flag ambiguity.
  5. Overwriting wiki claims without contradiction handling. Existing wiki pages may conflict with the book; preserve both with dates/sources.
  6. Ignoring copyright. Extract for user-provided documents and personal/research use; avoid distributing copyrighted full text unless the user has rights.
  7. No model disclosure. Always state the actual model/backend used for extraction.
  8. Treating slide decks like dense text books. PowerPoint/Keynote exports to PDF often have very sparse text layers (~200–400 chars/page) with heavy artifacting. A 94-page slide deck may contain less text than a 10-page Word document. Always probe the text layer first; if it's sparse or garbled, skip bulk text extraction and go straight to vision-first rendering of representative slides.
  9. Python environment assumptions on VPS. The system Python may be PEP 668 protected (no --break-system-packages), and python -m venv may fail with "No space left on device" if the disk is >95% full. Check df -h early. Prefer reusing an existing venv (e.g., the Hermes agent venv at ~/.hermes/hermes-agent/venv/bin/python3) or installing tools via pip install --user in a writable location.
  10. Silent provider substitution. If the user specified OpenAI/ChatGPT-family ingestion, do not silently use Venice, OpenRouter, or a model hosted through another provider even if the model name looks equivalent. The backend provider is part of provenance. Mark wrong-provider outputs superseded/audit-only and re-run with the requested provider.
  11. Slide/transcript alignment is a candidate stage, not provenance. For lecture+slide packages, after OpenAI slide vision extraction and timestamped ASR, build a provisional slide-to-transcript candidate map before synthesis. Use local lexical/IDF windows or another auditable method, mark candidates as heuristic, and require human/LLM spot-check before citing them in concept/query pages. See references/openai-hd-slide-transcript-alignment.md.

Verification Checklist