gpt-image-book-to-wiki

/home/avalon/.hermes/skills/research/gpt-image-book-to-wiki/SKILL.md · raw

GPT Image Book to LLM Wiki

Overview

This is an experimental, high-fidelity ingestion workflow for turning a PDF or ebook into a queryable LLM Wiki knowledge base. It uses a vision-first pipeline: render pages to images, have a multimodal model read the pages, preserve exact quotes and diagram context, then compile both raw and synthesized knowledge into the wiki.

Use this when the goal is not just “extract text,” but to understand a whole book: chapter structure, arguments, definitions, diagrams, captions, examples, quotes, and how ideas connect.

When to Use

Use this skill when the user asks to: - Read a whole PDF/book/ebook with high context fidelity - Preserve exact quotes and source page references - Capture diagrams, figures, tables, captions, or visual layouts - Build a queryable book knowledge base - Ingest a book into an LLM Wiki / Obsidian-style markdown vault - Compare book claims against existing wiki knowledge - Turn a pictorial manuscript/facsimile/codex/atlas/ritual-calendar source into an interactive viewer plus structured knowledge base

Do not use as the first choice for: - Simple text-based PDFs where web_extract or PyMuPDF is enough - One-off summaries where exact quotations and page provenance do not matter - Low-quality scans that are not human-readable after rendering - Mass ingestion without budget/runtime approval

Model / Provider Assumption

Preferred provider family: OpenAI / ChatGPT. For Alex's Human Design / MEGA corpus specifically, do not use Venice as the backend for extraction, transcription, synthesis, or alignment; use OpenAI/ChatGPT-family backends such as OpenAI Codex/ChatGPT, GPT-5.5, GPT Vision-capable chat models, and GPT Image 2 only where image generation/reconstruction is appropriate.

Preferred image/vision model: - GPT-5.5 or another OpenAI vision-capable chat/reasoning model for page reading and structured extraction - gpt-image-2 for image generation, image-aware transforms, or diagram reconstruction; do not assume it is the right API for plain structured text extraction - Hermes openai-codex may be useful for agentic coding/orchestration, but for BYO OpenAI API key extraction scripts call OpenAI endpoints directly and record the actual model used

If a tool/backend cannot pass page images directly to the preferred OpenAI model, stop and report the limitation rather than silently switching providers. Always report which model/backend was actually used.

Core Principle

Never treat generated summaries as the source of truth.

Source of truth hierarchy: 1. Original PDF/ebook file 2. Rendered page images 3. Raw page-level JSON extraction with quotes and bounding/context metadata 4. Chapter-level synthesis linked to raw pages 5. Wiki pages linked back to raw files and page images

Pipeline

1. Intake, classification, and capability check

For every document, classify before extracting: - Identify file type: PDF, EPUB, MOBI/AZW/KFX, DOCX, etc. - For PDFs, do not treat .pdf as one homogeneous category. Probe and route by observed structure: - text-rich book/article PDF: high coherent text chars/page, paragraphs, usable headings → text extraction plus vision spot checks. - scanned or OCR-poor book: low/garbled text layer, visible page images → render all pages + OCR/vision page extraction. - slide deck PDF: PowerPoint/Keynote/Slides metadata, sparse text layer, short bullets, diagrams/callouts → render all slides + slide-aware vision extraction. - diagram-heavy manual/workbook: forms, bodygraphs, tables, charts, exercises → render pages + figure/table/layout extraction. - mixed/unknown: run a 5-10 page golden sample and defer bulk extraction until a route is chosen. - Record a machine-readable routing decision with detected_class, confidence, signals, chosen_pipeline, rejected_pipelines, and next_action so future agents know why a tool was chosen. - Check whether embedded text is usable. - Check whether the document has TOC/bookmarks. - Count pages/chapters. - Render 3-6 representative pages. - Run a golden sample before bulk processing.

For ebooks: - Prefer structured extraction for EPUB HTML where available. - Still render pages/screenshots when layout, images, or exact placement matters. - For proprietary formats, convert to PDF/EPUB only if the user has rights and local tools support it.

2. Segment before reading

Create a deterministic section map: - book metadata - front matter - chapters - subsections if visible in TOC/bookmarks - page ranges - figure/table ranges

Use TOC/bookmarks when available. If absent, infer boundaries from rendered pages with a small vision pass, then ask for approval before full ingestion.

3. Render source images

Render each page as a high-quality image: - 200-300 DPI for normal text - 300+ DPI for dense diagrams/tables - Preserve page number in filename

Suggested output:

raw/books/<slug>/source.pdf
raw/books/<slug>/pages/page-0001.png
raw/books/<slug>/pages/page-0002.png
raw/books/<slug>/extraction/page-0001.json
raw/books/<slug>/chapters/chapter-01.md
raw/assets/books/<slug>/figures/figure-001.png

4. Page-level vision extraction

For every page image, request structured JSON. Required fields:

{
  "page": 1,
  "visible_page_label": "xiii or 42 if present",
  "section_heading": "exact visible heading if any",
  "body_text": "faithful transcription; preserve quotes and unusual wording",
  "quotes": [
    {
      "text": "exact quote",
      "speaker_or_attribution": "if visible/inferable from page only",
      "page": 1,
      "confidence": "high|medium|low"
    }
  ],
  "figures": [
    {
      "id": "fig-p001-01",
      "type": "diagram|photo|chart|table|equation|callout|other",
      "caption": "exact caption if present",
      "description": "precise visual description",
      "placement": "top/middle/bottom/left/right; before/after which paragraph",
      "related_text_before": "short exact text before figure",
      "related_text_after": "short exact text after figure",
      "claims_or_labels": ["exact labels or claims in the figure"],
      "needs_crop": true
    }
  ],
  "tables": [
    {
      "caption": "exact caption if present",
      "columns": [],
      "rows": [],
      "notes": "layout caveats"
    }
  ],
  "uncertain_regions": ["anything unreadable or ambiguous"]
}

Rules: - Do not paraphrase in body_text or quotes. - Preserve exact spelling, punctuation, and line breaks where meaningful. - Join only obvious visual line wraps. - Mark uncertainty instead of guessing. - Keep diagrams separate from body text but record where they fit.

4A. Pictorial manuscript / facsimile viewer variant

For codices, illuminated manuscripts, atlases, ritual calendars, and other image-first sources, add an asset-and-KB layer alongside page extraction: - Treat complete plate images as the source of truth; preserve originals and generate optimized WebP/thumb/upscaled derivatives. - Create deterministic crops for panels, figures, glyphs, marginalia, and diagrams; label automated crops as provisional until visually reviewed. - Use conservative background removal for manuscript cutouts so pigment and linework are not destroyed. - Keep a manifest with dimensions, generated paths, source folio, crop role, and S3/CDN object keys if uploaded. - Store plate/folio interpretations, symbol/deity identifications, cycle mechanics, and caveats in a separate machine-readable KB rather than burying them in UI code. - For calendar manuscripts, encode correlation anchors explicitly and expose uncertainty in both KB and UI.

See references/pictorial-manuscript-calendar-lab.md for the Codex Borbonicus-derived pattern.

5. Cross-page reconciliation

After page extraction: - Merge hyphenated text across page breaks only when obvious. - Reconstruct chapters from page JSON. - Deduplicate running headers/footers/page numbers. - Verify chapter boundaries against TOC. - Build a quote index with page references. - Build a figure/table index with page references and cropped assets where helpful.

6. Chapter-level knowledge extraction

For each chapter, produce: - faithful chapter markdown with page anchors - exact quote list - definitions - named entities - key claims - arguments and evidence - diagrams/figures and their role in the argument - open questions / unclear passages

Every extracted claim should point to page-level provenance, e.g. source: raw/books/book/extraction/page-0042.json or p. 42.

7. LLM Wiki integration

Before writing to an existing wiki, orient first: 1. Read SCHEMA.md 2. Read index.md 3. Read recent log.md 4. Search for existing entities/concepts from the book

Then write: - raw immutable source files under raw/books/<slug>/ - one book summary page under sources/ or concepts/ depending on schema - concept/entity updates for central ideas - comparison pages when the book conflicts with or complements existing wiki pages - query pages only for substantial, reusable answers

Each wiki page must include: - YAML frontmatter - source list - wikilinks to related pages - page-level provenance markers for exact claims - quote blocks with page numbers

8. Query interface pattern

For later querying, use both: - compiled wiki pages for fast synthesis - raw page/chapter JSON for exact quote lookup

Answer format for queries: - direct answer - supporting exact quotes with page numbers - diagrams/figures involved, with page and placement - confidence / caveats - links to wiki pages and raw source files

Golden Sample Protocol

Before full-book extraction, run 3-6 representative pages/sections: - one table-of-contents/front-matter page - one normal text page - one page with a figure/diagram - one dense quote/footnote page - one chapter boundary page - one difficult scan/layout page if present

Review: - exact quote fidelity - diagram placement accuracy - table structure - section boundary correctness - token/cost estimate - runtime estimate

Only proceed to full extraction after the golden sample is acceptable.

Recommended Outputs

raw/books/<book-slug>/
  source.pdf                 # local during active ingest; may become source.pdf.s3stub after archival
  metadata.json
  manifest.json              # required when large assets are S3-backed
  section-map.json
  pages/page-0001.png        # local during active ingest; may become page-0001.png.s3stub after archival
  extraction/page-0001.json
  chapters/chapter-01.md
  indexes/quotes.json
  indexes/figures.json
  indexes/entities.json
  reports/golden-sample.md
  reports/final-ingestion-report.md

For Alex's VPS wiki, use the hybrid raw-asset pattern after ingestion: keep compiled markdown, chapter text, extraction JSON, indexes, metadata, and manifests local for speed; upload large immutable PDFs/page images/figures/audio/video to private Hetzner S3 and leave local .s3stub files plus a manifest.json with s3_uri, s3_key, bytes, sha256, and content type. Fetch S3 assets into .cache/s3/ only when exact visual/PDF reinspection is needed.

Reference Cases

references/openai-hd-mega-provider-correction.md documents the 2026-05-11 provider correction for Alex's Human Design / MEGA wiki: Venice/Qwen outputs are superseded; use OpenAI/ChatGPT-family models, with verified quirks for GPT-5.5 vision and OpenAI audio timestamps.
references/openai-hd-lyd-ingestion-runtime-notes.md documents the RA LYD 7h OpenAI ingestion runtime pattern: verifying 94 GPT-5.5 slide JSON outputs, Whisper-1 timestamped audio outputs, buffered/no-log background process behavior, tight-disk cache handling, and post-run timestamp normalization/QA caveats.
references/openai-hd-slide-transcript-alignment.md documents the OpenAI-only heuristic slide/transcript candidate alignment pattern for Human Design lecture+slide packages, including IDF lexical windowing, provisional-output warnings, and wiki QA/log updates.
references/36-faces-mega-golden-sample.md documents a tested MEGA/PyMuPDF/vision workflow for Austin Coppock's 36 Faces, including megadl download, no-bookmark PDF probing, 200 DPI page rendering, hybrid OCR+vision strategy, and a concrete Mars in Aries I exact-quote lookup.
references/36-faces-astrology-wiki-ingest.md documents the follow-up full astrology-wiki ingest shape: raw source package, decan/chapter JSON, rendered opening-page images, placement indexes, concept/query pages, verification counts, and pitfalls like creating zodiac sign pages to avoid broken wikilinks.
references/36-faces-vs-inner-sky-forensic-audit.md documents the 2026-07-10 forensic comparison between the wiki-first 36 Faces production pipeline and Foundry's deterministic The Inner Sky semantic compiler. Use it when auditing provenance, identifying missing wiki-note stages, or explaining why a semantic graph is not the same artifact as a compiled book wiki.
references/planets-in-transit-astrology-wiki-ingest.md documents a stripped-PDF astrology wiki ingest of Robert Hand's Planets in Transit: PyMuPDF text-layer extraction, contents-derived section mapping, S3 stubbing for source PDF/page renders, and /tmp/hermes-verify-* ad-hoc verification when no canonical wiki test exists.
references/decan-image-source-atlas.md documents a reusable decan-image source workflow and known pictorial witness: Astrolabium Planum (1494) via Internet Archive/NLM, including verified zodiacal facies page numbers, Viewer/S3 layer metadata, and IA/PyMuPDF/OCR pitfalls.
references/pictorial-manuscript-calendar-lab.md documents the Codex Borbonicus-derived pattern for turning a complete pictorial manuscript/facsimile into a mobile viewer with originals, optimized/upscaled plates, crops/cutouts, S3 manifests, calendar/cycle mechanics, explicit correlation anchors, and caveated plate-by-plate KB data.

Wiki outputs depend on schema, but commonly:

concepts/<central-concept>.md
entities/<author-or-organization>.md
comparisons/<book-vs-existing-topic>.md
queries/<book-question>.md
raw/books/<book-slug>/...

Common Pitfalls

Using summaries as source text. Keep page-level JSON and source images so exact quotes are recoverable.
Skipping sample validation. Full-book vision extraction can be expensive and compound errors.
Losing diagram placement. Always capture what text appears before/after figures.
Flattening tables into prose. Use explicit rows/columns when possible and flag ambiguity.
Overwriting wiki claims without contradiction handling. Existing wiki pages may conflict with the book; preserve both with dates/sources.
Ignoring copyright. Extract for user-provided documents and personal/research use; avoid distributing copyrighted full text unless the user has rights.
No model disclosure. Always state the actual model/backend used for extraction.
Treating slide decks like dense text books. PowerPoint/Keynote exports to PDF often have very sparse text layers (~200–400 chars/page) with heavy artifacting. A 94-page slide deck may contain less text than a 10-page Word document. Always probe the text layer first; if it's sparse or garbled, skip bulk text extraction and go straight to vision-first rendering of representative slides.
Python environment assumptions on VPS. The system Python may be PEP 668 protected (no --break-system-packages), and python -m venv may fail with "No space left on device" if the disk is >95% full. Check df -h early. Prefer reusing an existing venv (e.g., the Hermes agent venv at ~/.hermes/hermes-agent/venv/bin/python3) or installing tools via pip install --user in a writable location.
Silent provider substitution. If the user specified OpenAI/ChatGPT-family ingestion, do not silently use Venice, OpenRouter, or a model hosted through another provider even if the model name looks equivalent. The backend provider is part of provenance. Mark wrong-provider outputs superseded/audit-only and re-run with the requested provider.
Slide/transcript alignment is a candidate stage, not provenance. For lecture+slide packages, after OpenAI slide vision extraction and timestamped ASR, build a provisional slide-to-transcript candidate map before synthesis. Use local lexical/IDF windows or another auditable method, mark candidates as heuristic, and require human/LLM spot-check before citing them in concept/query pages. See references/openai-hd-slide-transcript-alignment.md.

Verification Checklist

[ ] Document type and page/chapter count identified
[ ] Text layer tested before vision-first pipeline chosen
[ ] TOC/bookmarks inspected or section inference documented
[ ] Representative pages rendered and sampled
[ ] Golden sample reviewed before full run
[ ] Page-level JSON includes exact quotes, figures, tables, uncertainty, and page references
[ ] Rendered page images retained as source-of-truth assets
[ ] Chapter markdown and indexes generated
[ ] Wiki orientation completed before writes
[ ] Wiki pages have provenance, wikilinks, frontmatter, index updates, and log entry
[ ] Final report states model used, cost/runtime estimate, files written, caveats, and next query examples