ocr-and-documents

/home/avalon/.hermes/skills/productivity/ocr-and-documents/SKILL.md · raw

PDF & Document Extraction

For DOCX: use python-docx (parses actual document structure, far better than OCR). For PPTX: see the powerpoint skill (uses python-pptx with full slide/notes support). This skill covers PDFs and scanned documents.

Step 1: Remote URL Available?

If the document has a URL, always try web_extract first:

web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://example.com/report.pdf"])

This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.

Only use local extraction when: the file is local, web_extract fails, or you need batch processing.

Step 2: Choose Local Extractor

Feature pymupdf (~25MB) marker-pdf (~3-5GB)
Text-based PDF
Scanned PDF (OCR) ✅ (90+ languages)
Tables ✅ (basic) ✅ (high accuracy)
Equations / LaTeX
Code blocks
Forms
Headers/footers removal
Reading order detection
Images extraction ✅ (embedded) ✅ (with context)
Images → text (OCR)
EPUB
Markdown output ✅ (via pymupdf4llm) ✅ (native, higher quality)
Install size ~25MB ~3-5GB (PyTorch + models)
Speed Instant ~1-14s/page (CPU), ~0.2s/page (GPU)

Decision: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.

If the user needs marker capabilities but the system lacks ~5GB free disk:

"This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."

Special Case: PDF has a usable TOC/bookmarks but broken embedded text

Some PDFs are not truly scanned, yet direct text extraction is still unusable because the font mapping is garbage. Symptom: pymupdf returns readable structure in doc.get_toc() but page text looks like encoded junk. In that case, do not treat the PDF as a normal text PDF.

Use this pipeline instead:

  1. Use doc.get_toc() first to discover real section boundaries.
  2. Segment the document by bookmark/page ranges instead of fuzzy heading detection.
  3. Render each page in the section to PNG at ~300 DPI with PyMuPDF.
  4. OCR the rendered PNGs with tesseract.
  5. Save both: - raw OCR text - rendered source page images
  6. Apply only deterministic cleanup: - remove page-number lines - remove repeated running headers/footers - normalize quotes/apostrophes - join line-end hyphenation - preserve source wording (no paraphrase / no LLM rewriting)
  7. If the goal is “full fidelity”, keep the source page images and expose them alongside the extracted text so users can visually verify the original page.

This pattern is especially strong when the PDF has high-quality bookmarks for chapters/entries (for example, one bookmark per section or per record), because the bookmarks become the canonical segmentation layer while OCR becomes only the text-recovery layer.


pymupdf (lightweight)

pip install pymupdf pymupdf4llm

Via helper script:

python scripts/extract_pymupdf.py document.pdf              # Plain text
python scripts/extract_pymupdf.py document.pdf --markdown    # Markdown
python scripts/extract_pymupdf.py document.pdf --tables      # Tables
python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
python scripts/extract_pymupdf.py document.pdf --metadata    # Title, author, pages
python scripts/extract_pymupdf.py document.pdf --pages 0-4   # Specific pages

Inline:

python3 -c "
import pymupdf
doc = pymupdf.open('document.pdf')
for page in doc:
    print(page.get_text())
"

marker-pdf (high-quality OCR)

# Check disk space first
python scripts/extract_marker.py --check

pip install marker-pdf

Via helper script:

python scripts/extract_marker.py document.pdf                # Markdown
python scripts/extract_marker.py document.pdf --json         # JSON with metadata
python scripts/extract_marker.py document.pdf --output_dir out/  # Save images
python scripts/extract_marker.py scanned.pdf                 # Scanned PDF (OCR)
python scripts/extract_marker.py document.pdf --use_llm      # LLM-boosted accuracy

CLI (installed with marker-pdf):

marker_single document.pdf --output_dir ./output
marker /path/to/folder --workers 4    # Batch

Arxiv Papers

# Abstract only (fast)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])

# Full paper
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

# Search
web_search(query="arxiv GRPO reinforcement learning 2026")

pymupdf handles these natively — use execute_code or inline Python:

# Split: extract pages 1-5 to a new PDF
import pymupdf
doc = pymupdf.open("report.pdf")
new = pymupdf.open()
for i in range(5):
    new.insert_pdf(doc, from_page=i, to_page=i)
new.save("pages_1-5.pdf")
# Merge multiple PDFs
import pymupdf
result = pymupdf.open()
for path in ["a.pdf", "b.pdf", "c.pdf"]:
    result.insert_pdf(pymupdf.open(path))
result.save("merged.pdf")
# Search for text across all pages
import pymupdf
doc = pymupdf.open("report.pdf")
for i, page in enumerate(doc):
    results = page.search_for("revenue")
    if results:
        print(f"Page {i+1}: {len(results)} match(es)")
        print(page.get_text("text"))

No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package.


Brand Guide / Design PDF Extraction

When extracting from brand standards PDFs, you need more than text — you need exact color values, images, and structured design tokens.

Full Pipeline

import pymupdf, os

doc = pymupdf.open('brand-guide.pdf')
out_dir = '/path/to/brand-assets'
os.makedirs(out_dir, exist_ok=True)

# 1. Extract ALL embedded images (logos, photos, swatches)
for page_num in range(len(doc)):
    for img_idx, img in enumerate(doc[page_num].get_images(full=True)):
        base_image = doc.extract_image(img[0])
        if base_image and len(base_image["image"]) > 500:  # skip tiny
            fname = f"page{page_num+1:02d}_img{img_idx:02d}.{base_image['ext']}"
            with open(f"{out_dir}/{fname}", "wb") as f:
                f.write(base_image["image"])

# 2. Render pages as PNGs for visual analysis
for i in range(len(doc)):
    doc[i].get_pixmap(dpi=150).save(f"{out_dir}/page_{i+1:02d}.png")

# 3. Pixel-sample EXACT colors from swatch pages (PDF color specs are often wrong!)
page = doc[10]  # color palette page
pix = page.get_pixmap(dpi=300)  # high DPI for accuracy
w, h = pix.width, pix.height
# Sample multiple points in the swatch area, average them
samples = []
for x in range(int(w*0.1), int(w*0.35), int(w*0.05)):
    for y in range(int(h*0.3), int(h*0.7), int(h*0.1)):
        r, g, b = pix.pixel(x, y)[:3]
        if r < 100:  # filter for the color you're sampling
            samples.append((r, g, b))
avg = tuple(sum(c)//len(samples) for c in zip(*samples))
hex_color = f"#{avg[0]:02X}{avg[1]:02X}{avg[2]:02X}"

Key Pitfall: PDF Color Specs Are Unreliable

In the Jungle Studio brand PDF, the printed CMYK/HEX values for "Charcoal" were WRONG (showed White Sand's values for both colors — a production error). Always pixel-sample the actual rendered swatches at 300dpi to verify. The visual swatch is the ground truth, not the text overlay.

Output Structure

Write a BRAND-STYLE-GUIDE.md with: - Color palette table (Name | HEX | RGB | Usage) - CSS design tokens (:root variables) - Tailwind theme extension config - Typography specs (font names, weights, tracking, usage rules) - Logo variations and usage guidelines - Asset index mapping extracted images to their content


Notes