---
name: ocr-and-documents
description: Extract text from PDFs and scanned documents. Use web_extract for remote URLs, pymupdf for local text-based PDFs, marker-pdf for OCR/scanned docs. For DOCX use python-docx, for PPTX see the powerpoint skill.
version: 2.3.0
author: Hermes Agent
license: MIT
metadata:
  hermes:
    tags: [PDF, Documents, Research, Arxiv, Text-Extraction, OCR]
    related_skills: [powerpoint]
---

# PDF & Document Extraction

For DOCX: use `python-docx` (parses actual document structure, far better than OCR).
For PPTX: see the `powerpoint` skill (uses `python-pptx` with full slide/notes support).
This skill covers **PDFs and scanned documents**.

## Step 0: Untrusted or Bulk Document Source?

If the user wants to download/inspect PDFs, EPUBs, DOCX, or ebook files from an untrusted site/source and is worried about malware, **do not open or OCR them directly first**. Build/run a static quarantine triage pipeline before extraction:

- separate discovery → authorized download → offline scan;
- use polite rate limiting and no block/rate-limit bypass;
- quarantine downloads by SHA-256 with restrictive permissions;
- scan in a networkless, low-privilege container/VM where possible;
- inspect PDFs for JavaScript/OpenAction/Launch/EmbeddedFile/XFA/object-stream/encryption markers;
- inspect EPUB/DOCX/CBZ/ZIP-like files for scripts, suspicious embedded filenames, and zip-bomb-like ratios;
- produce a final catalog joining source URL, resolved file URL, format, hash, scan verdict, and quarantine path.

See `references/quarantined-ebook-document-triage.md` for the class pattern, pitfalls, and ad-hoc verification fixture.

## Step 1: Remote URL Available?

If the document has a URL and it is not an untrusted/bulk malware-triage case, **try `web_extract` first**:

```
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://example.com/report.pdf"])
```

This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.

Only use local extraction when: the file is local, web_extract fails, or you need batch processing.

## Step 2: Choose Local Extractor

| Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) |
|---------|-----------------|---------------------|
| **Text-based PDF** | ✅ | ✅ |
| **Scanned PDF (OCR)** | ❌ | ✅ (90+ languages) |
| **Tables** | ✅ (basic) | ✅ (high accuracy) |
| **Equations / LaTeX** | ❌ | ✅ |
| **Code blocks** | ❌ | ✅ |
| **Forms** | ❌ | ✅ |
| **Headers/footers removal** | ❌ | ✅ |
| **Reading order detection** | ❌ | ✅ |
| **Images extraction** | ✅ (embedded) | ✅ (with context) |
| **Images → text (OCR)** | ❌ | ✅ |
| **EPUB** | ✅ | ✅ |
| **Markdown output** | ✅ (via pymupdf4llm) | ✅ (native, higher quality) |
| **Install size** | ~25MB | ~3-5GB (PyTorch + models) |
| **Speed** | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) |

**Decision**: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.

If the user needs marker capabilities but the system lacks ~5GB free disk:
> "This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."

## Special Case: PDF has a usable TOC/bookmarks but broken embedded text

Some PDFs are not truly scanned, yet direct text extraction is still unusable because the font mapping is garbage. Symptom: `pymupdf` returns readable structure in `doc.get_toc()` but page text looks like encoded junk. In that case, do **not** treat the PDF as a normal text PDF.

Use this pipeline instead:

1. Use `doc.get_toc()` first to discover real section boundaries.
2. Segment the document by bookmark/page ranges instead of fuzzy heading detection.
3. Render each page in the section to PNG at ~300 DPI with PyMuPDF.
4. OCR the rendered PNGs with `tesseract`.
5. Save both:
   - raw OCR text
   - rendered source page images
6. Apply only deterministic cleanup:
   - remove page-number lines
   - remove repeated running headers/footers
   - normalize quotes/apostrophes
   - join line-end hyphenation
   - preserve source wording (no paraphrase / no LLM rewriting)
7. If the goal is “full fidelity”, keep the source page images and expose them alongside the extracted text so users can visually verify the original page.

This pattern is especially strong when the PDF has high-quality bookmarks for chapters/entries (for example, one bookmark per section or per record), because the bookmarks become the canonical segmentation layer while OCR becomes only the text-recovery layer.

---

## pymupdf (lightweight)

```bash
pip install pymupdf pymupdf4llm
```

**Via helper script**:
```bash
python scripts/extract_pymupdf.py document.pdf              # Plain text
python scripts/extract_pymupdf.py document.pdf --markdown    # Markdown
python scripts/extract_pymupdf.py document.pdf --tables      # Tables
python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
python scripts/extract_pymupdf.py document.pdf --metadata    # Title, author, pages
python scripts/extract_pymupdf.py document.pdf --pages 0-4   # Specific pages
```

**Inline**:
```bash
python3 -c "
import pymupdf
doc = pymupdf.open('document.pdf')
for page in doc:
    print(page.get_text())
"
```

---

## marker-pdf (high-quality OCR)

```bash
# Check disk space first
python scripts/extract_marker.py --check

pip install marker-pdf
```

**Via helper script**:
```bash
python scripts/extract_marker.py document.pdf                # Markdown
python scripts/extract_marker.py document.pdf --json         # JSON with metadata
python scripts/extract_marker.py document.pdf --output_dir out/  # Save images
python scripts/extract_marker.py scanned.pdf                 # Scanned PDF (OCR)
python scripts/extract_marker.py document.pdf --use_llm      # LLM-boosted accuracy
```

**CLI** (installed with marker-pdf):
```bash
marker_single document.pdf --output_dir ./output
marker /path/to/folder --workers 4    # Batch
```

---

## Arxiv Papers

```
# Abstract only (fast)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])

# Full paper
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

# Search
web_search(query="arxiv GRPO reinforcement learning 2026")
```

## Split, Merge & Search

pymupdf handles these natively — use `execute_code` or inline Python:

```python
# Split: extract pages 1-5 to a new PDF
import pymupdf
doc = pymupdf.open("report.pdf")
new = pymupdf.open()
for i in range(5):
    new.insert_pdf(doc, from_page=i, to_page=i)
new.save("pages_1-5.pdf")
```

```python
# Merge multiple PDFs
import pymupdf
result = pymupdf.open()
for path in ["a.pdf", "b.pdf", "c.pdf"]:
    result.insert_pdf(pymupdf.open(path))
result.save("merged.pdf")
```

```python
# Search for text across all pages
import pymupdf
doc = pymupdf.open("report.pdf")
for i, page in enumerate(doc):
    results = page.search_for("revenue")
    if results:
        print(f"Page {i+1}: {len(results)} match(es)")
        print(page.get_text("text"))
```

No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package.

---

## Brand Guide / Design PDF Extraction

When extracting from brand standards PDFs, you need more than text — you need exact color values, images, and structured design tokens.

### Full Pipeline
```python
import pymupdf, os

doc = pymupdf.open('brand-guide.pdf')
out_dir = '/path/to/brand-assets'
os.makedirs(out_dir, exist_ok=True)

# 1. Extract ALL embedded images (logos, photos, swatches)
for page_num in range(len(doc)):
    for img_idx, img in enumerate(doc[page_num].get_images(full=True)):
        base_image = doc.extract_image(img[0])
        if base_image and len(base_image["image"]) > 500:  # skip tiny
            fname = f"page{page_num+1:02d}_img{img_idx:02d}.{base_image['ext']}"
            with open(f"{out_dir}/{fname}", "wb") as f:
                f.write(base_image["image"])

# 2. Render pages as PNGs for visual analysis
for i in range(len(doc)):
    doc[i].get_pixmap(dpi=150).save(f"{out_dir}/page_{i+1:02d}.png")

# 3. Pixel-sample EXACT colors from swatch pages (PDF color specs are often wrong!)
page = doc[10]  # color palette page
pix = page.get_pixmap(dpi=300)  # high DPI for accuracy
w, h = pix.width, pix.height
# Sample multiple points in the swatch area, average them
samples = []
for x in range(int(w*0.1), int(w*0.35), int(w*0.05)):
    for y in range(int(h*0.3), int(h*0.7), int(h*0.1)):
        r, g, b = pix.pixel(x, y)[:3]
        if r < 100:  # filter for the color you're sampling
            samples.append((r, g, b))
avg = tuple(sum(c)//len(samples) for c in zip(*samples))
hex_color = f"#{avg[0]:02X}{avg[1]:02X}{avg[2]:02X}"
```

### Key Pitfall: PDF Color Specs Are Unreliable
In the Jungle Studio brand PDF, the printed CMYK/HEX values for "Charcoal" were WRONG (showed White Sand's values for both colors — a production error). **Always pixel-sample the actual rendered swatches at 300dpi** to verify. The visual swatch is the ground truth, not the text overlay.

### Output Structure
Write a BRAND-STYLE-GUIDE.md with:
- Color palette table (Name | HEX | RGB | Usage)
- CSS design tokens (`:root` variables)
- Tailwind theme extension config
- Typography specs (font names, weights, tracking, usage rules)
- Logo variations and usage guidelines
- Asset index mapping extracted images to their content

---

## Malware-safe bulk PDF/ebook acquisition and triage

When the user asks to download many PDFs/EPUBs/ebooks from a site and check them for malware, do **not** treat this as ordinary document extraction. Use a quarantine pipeline:

1. Separate **discovery → authorized download → static scan → catalog**.
2. Keep discovery polite: single-threaded, delay+jitter, robots-aware, `Retry-After`/429 backoff, no proxy/CAPTCHA/block evasion.
3. Require explicit authorization before downloading bulk files, especially from sites likely to host copyrighted books.
4. Store downloads by content hash under a quarantined work directory with restrictive permissions.
5. Scan statically in a constrained networkless container where possible: ClamAV, YARA marker rules, `file`, PDF `qpdf`/`pdfinfo`/`mutool`, byte-marker search, and ZIP/EPUB structure inspection.
6. Do not call incomplete scans clean; if required scanners are missing, use a verdict such as `INCOMPLETE_STATIC_TRIAGE`.
7. Produce a final catalog with source/detail URL, title, resolved file URL, format, SHA-256, bytes, path, download status, and verdict.

See `references/ebook-quarantine-download-and-static-triage.md` for the session-derived implementation pattern, scan markers, catalog schema, and ad-hoc verifier fixture. Key pitfall from the session: after streaming a download into an open file, `flush()` + `fsync()` before validating magic bytes; otherwise a just-written valid file can be falsely rejected as not matching `%PDF-`/format magic.

For preparing quarantined PDFs for OpenNotebook/NotebookLM-style ingestion, see `references/open-notebook-active-content-pdf-sanitization.md`. It covers PDF-aware stripping of `/JS`, `/OpenAction`, `/AA`, form/XFA, and embedded-file hooks; syntax-token rescanning that ignores stream payload false positives; Ghostscript fallback; and OpenNotebook-specific ingestion/embedding verification.

## Notes

- `web_extract` is always first choice for URLs when the goal is text extraction from a known safe/authorized document.
- pymupdf is the safe default — instant, no models, works everywhere
- marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
- marker-pdf downloads ~2.5GB of models to `~/.cache/huggingface/` on first use
- For Word docs: `pip install python-docx` (better than OCR — parses actual document structure)
- For PowerPoint: see the `powerpoint` skill (uses python-pptx)