ocr-and-documents

/home/avalon/.hermes/skills/productivity/ocr-and-documents/SKILL.md · raw

PDF & Document Extraction

For DOCX: use python-docx (parses actual document structure, far better than OCR). For PPTX: see the powerpoint skill (uses python-pptx with full slide/notes support). This skill covers PDFs and scanned documents.

Step 0: Untrusted or Bulk Document Source?

If the user wants to download/inspect PDFs, EPUBs, DOCX, or ebook files from an untrusted site/source and is worried about malware, do not open or OCR them directly first. Build/run a static quarantine triage pipeline before extraction:

separate discovery → authorized download → offline scan;
use polite rate limiting and no block/rate-limit bypass;
quarantine downloads by SHA-256 with restrictive permissions;
scan in a networkless, low-privilege container/VM where possible;
inspect PDFs for JavaScript/OpenAction/Launch/EmbeddedFile/XFA/object-stream/encryption markers;
inspect EPUB/DOCX/CBZ/ZIP-like files for scripts, suspicious embedded filenames, and zip-bomb-like ratios;
produce a final catalog joining source URL, resolved file URL, format, hash, scan verdict, and quarantine path.

See references/quarantined-ebook-document-triage.md for the class pattern, pitfalls, and ad-hoc verification fixture.

Step 1: Remote URL Available?

If the document has a URL and it is not an untrusted/bulk malware-triage case, try web_extract first:

web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://example.com/report.pdf"])

This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.

Only use local extraction when: the file is local, web_extract fails, or you need batch processing.

Step 2: Choose Local Extractor

Feature	pymupdf (~25MB)	marker-pdf (~3-5GB)
Text-based PDF	✅	✅
Scanned PDF (OCR)	❌	✅ (90+ languages)
Tables	✅ (basic)	✅ (high accuracy)
Equations / LaTeX	❌	✅
Code blocks	❌	✅
Forms	❌	✅
Headers/footers removal	❌	✅
Reading order detection	❌	✅
Images extraction	✅ (embedded)	✅ (with context)
Images → text (OCR)	❌	✅
EPUB	✅	✅
Markdown output	✅ (via pymupdf4llm)	✅ (native, higher quality)
Install size	~25MB	~3-5GB (PyTorch + models)
Speed	Instant	~1-14s/page (CPU), ~0.2s/page (GPU)

Decision: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.

If the user needs marker capabilities but the system lacks ~5GB free disk:

"This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."

Special Case: PDF has a usable TOC/bookmarks but broken embedded text

Some PDFs are not truly scanned, yet direct text extraction is still unusable because the font mapping is garbage. Symptom: pymupdf returns readable structure in doc.get_toc() but page text looks like encoded junk. In that case, do not treat the PDF as a normal text PDF.

Use this pipeline instead:

Use doc.get_toc() first to discover real section boundaries.
Segment the document by bookmark/page ranges instead of fuzzy heading detection.
Render each page in the section to PNG at ~300 DPI with PyMuPDF.
OCR the rendered PNGs with tesseract.
Save both: - raw OCR text - rendered source page images
Apply only deterministic cleanup: - remove page-number lines - remove repeated running headers/footers - normalize quotes/apostrophes - join line-end hyphenation - preserve source wording (no paraphrase / no LLM rewriting)
If the goal is “full fidelity”, keep the source page images and expose them alongside the extracted text so users can visually verify the original page.

This pattern is especially strong when the PDF has high-quality bookmarks for chapters/entries (for example, one bookmark per section or per record), because the bookmarks become the canonical segmentation layer while OCR becomes only the text-recovery layer.

pymupdf (lightweight)

pip install pymupdf pymupdf4llm

Via helper script:

python scripts/extract_pymupdf.py document.pdf              # Plain text
python scripts/extract_pymupdf.py document.pdf --markdown    # Markdown
python scripts/extract_pymupdf.py document.pdf --tables      # Tables
python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
python scripts/extract_pymupdf.py document.pdf --metadata    # Title, author, pages
python scripts/extract_pymupdf.py document.pdf --pages 0-4   # Specific pages

Inline:

python3 -c "
import pymupdf
doc = pymupdf.open('document.pdf')
for page in doc:
    print(page.get_text())
"

marker-pdf (high-quality OCR)

# Check disk space first
python scripts/extract_marker.py --check

pip install marker-pdf

Via helper script:

python scripts/extract_marker.py document.pdf                # Markdown
python scripts/extract_marker.py document.pdf --json         # JSON with metadata
python scripts/extract_marker.py document.pdf --output_dir out/  # Save images
python scripts/extract_marker.py scanned.pdf                 # Scanned PDF (OCR)
python scripts/extract_marker.py document.pdf --use_llm      # LLM-boosted accuracy

CLI (installed with marker-pdf):

marker_single document.pdf --output_dir ./output
marker /path/to/folder --workers 4    # Batch

Arxiv Papers

# Abstract only (fast)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])

# Full paper
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

# Search
web_search(query="arxiv GRPO reinforcement learning 2026")

Split, Merge & Search

pymupdf handles these natively — use execute_code or inline Python:

# Split: extract pages 1-5 to a new PDF
import pymupdf
doc = pymupdf.open("report.pdf")
new = pymupdf.open()
for i in range(5):
    new.insert_pdf(doc, from_page=i, to_page=i)
new.save("pages_1-5.pdf")

# Merge multiple PDFs
import pymupdf
result = pymupdf.open()
for path in ["a.pdf", "b.pdf", "c.pdf"]:
    result.insert_pdf(pymupdf.open(path))
result.save("merged.pdf")

# Search for text across all pages
import pymupdf
doc = pymupdf.open("report.pdf")
for i, page in enumerate(doc):
    results = page.search_for("revenue")
    if results:
        print(f"Page {i+1}: {len(results)} match(es)")
        print(page.get_text("text"))

No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package.

Brand Guide / Design PDF Extraction

When extracting from brand standards PDFs, you need more than text — you need exact color values, images, and structured design tokens.

Full Pipeline

import pymupdf, os

doc = pymupdf.open('brand-guide.pdf')
out_dir = '/path/to/brand-assets'
os.makedirs(out_dir, exist_ok=True)

# 1. Extract ALL embedded images (logos, photos, swatches)
for page_num in range(len(doc)):
    for img_idx, img in enumerate(doc[page_num].get_images(full=True)):
        base_image = doc.extract_image(img[0])
        if base_image and len(base_image["image"]) > 500:  # skip tiny
            fname = f"page{page_num+1:02d}_img{img_idx:02d}.{base_image['ext']}"
            with open(f"{out_dir}/{fname}", "wb") as f:
                f.write(base_image["image"])

# 2. Render pages as PNGs for visual analysis
for i in range(len(doc)):
    doc[i].get_pixmap(dpi=150).save(f"{out_dir}/page_{i+1:02d}.png")

# 3. Pixel-sample EXACT colors from swatch pages (PDF color specs are often wrong!)
page = doc[10]  # color palette page
pix = page.get_pixmap(dpi=300)  # high DPI for accuracy
w, h = pix.width, pix.height
# Sample multiple points in the swatch area, average them
samples = []
for x in range(int(w*0.1), int(w*0.35), int(w*0.05)):
    for y in range(int(h*0.3), int(h*0.7), int(h*0.1)):
        r, g, b = pix.pixel(x, y)[:3]
        if r < 100:  # filter for the color you're sampling
            samples.append((r, g, b))
avg = tuple(sum(c)//len(samples) for c in zip(*samples))
hex_color = f"#{avg[0]:02X}{avg[1]:02X}{avg[2]:02X}"

Key Pitfall: PDF Color Specs Are Unreliable

In the Jungle Studio brand PDF, the printed CMYK/HEX values for "Charcoal" were WRONG (showed White Sand's values for both colors — a production error). Always pixel-sample the actual rendered swatches at 300dpi to verify. The visual swatch is the ground truth, not the text overlay.

Output Structure

Write a BRAND-STYLE-GUIDE.md with: - Color palette table (Name | HEX | RGB | Usage) - CSS design tokens (:root variables) - Tailwind theme extension config - Typography specs (font names, weights, tracking, usage rules) - Logo variations and usage guidelines - Asset index mapping extracted images to their content

Malware-safe bulk PDF/ebook acquisition and triage

When the user asks to download many PDFs/EPUBs/ebooks from a site and check them for malware, do not treat this as ordinary document extraction. Use a quarantine pipeline:

Separate discovery → authorized download → static scan → catalog.
Keep discovery polite: single-threaded, delay+jitter, robots-aware, Retry-After/429 backoff, no proxy/CAPTCHA/block evasion.
Require explicit authorization before downloading bulk files, especially from sites likely to host copyrighted books.
Store downloads by content hash under a quarantined work directory with restrictive permissions.
Scan statically in a constrained networkless container where possible: ClamAV, YARA marker rules, file, PDF qpdf/pdfinfo/mutool, byte-marker search, and ZIP/EPUB structure inspection.
Do not call incomplete scans clean; if required scanners are missing, use a verdict such as INCOMPLETE_STATIC_TRIAGE.
Produce a final catalog with source/detail URL, title, resolved file URL, format, SHA-256, bytes, path, download status, and verdict.

See references/ebook-quarantine-download-and-static-triage.md for the session-derived implementation pattern, scan markers, catalog schema, and ad-hoc verifier fixture. Key pitfall from the session: after streaming a download into an open file, flush() + fsync() before validating magic bytes; otherwise a just-written valid file can be falsely rejected as not matching %PDF-/format magic.

For preparing quarantined PDFs for OpenNotebook/NotebookLM-style ingestion, see references/open-notebook-active-content-pdf-sanitization.md. It covers PDF-aware stripping of /JS, /OpenAction, /AA, form/XFA, and embedded-file hooks; syntax-token rescanning that ignores stream payload false positives; Ghostscript fallback; and OpenNotebook-specific ingestion/embedding verification.

Notes

web_extract is always first choice for URLs when the goal is text extraction from a known safe/authorized document.
pymupdf is the safe default — instant, no models, works everywhere
marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
marker-pdf downloads ~2.5GB of models to ~/.cache/huggingface/ on first use
For Word docs: pip install python-docx (better than OCR — parses actual document structure)
For PowerPoint: see the powerpoint skill (uses python-pptx)