--- name: ocr-and-documents description: Extract text from PDFs and scanned documents. Use web_extract for remote URLs, pymupdf for local text-based PDFs, marker-pdf for OCR/scanned docs. For DOCX use python-docx, for PPTX see the powerpoint skill. version: 2.3.0 author: Hermes Agent license: MIT metadata: hermes: tags: [PDF, Documents, Research, Arxiv, Text-Extraction, OCR] related_skills: [powerpoint] --- # PDF & Document Extraction For DOCX: use `python-docx` (parses actual document structure, far better than OCR). For PPTX: see the `powerpoint` skill (uses `python-pptx` with full slide/notes support). This skill covers **PDFs and scanned documents**. ## Step 1: Remote URL Available? If the document has a URL, **always try `web_extract` first**: ``` web_extract(urls=["https://arxiv.org/pdf/2402.03300"]) web_extract(urls=["https://example.com/report.pdf"]) ``` This handles PDF-to-markdown conversion via Firecrawl with no local dependencies. Only use local extraction when: the file is local, web_extract fails, or you need batch processing. ## Step 2: Choose Local Extractor | Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) | |---------|-----------------|---------------------| | **Text-based PDF** | ✅ | ✅ | | **Scanned PDF (OCR)** | ❌ | ✅ (90+ languages) | | **Tables** | ✅ (basic) | ✅ (high accuracy) | | **Equations / LaTeX** | ❌ | ✅ | | **Code blocks** | ❌ | ✅ | | **Forms** | ❌ | ✅ | | **Headers/footers removal** | ❌ | ✅ | | **Reading order detection** | ❌ | ✅ | | **Images extraction** | ✅ (embedded) | ✅ (with context) | | **Images → text (OCR)** | ❌ | ✅ | | **EPUB** | ✅ | ✅ | | **Markdown output** | ✅ (via pymupdf4llm) | ✅ (native, higher quality) | | **Install size** | ~25MB | ~3-5GB (PyTorch + models) | | **Speed** | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) | **Decision**: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis. If the user needs marker capabilities but the system lacks ~5GB free disk: > "This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations." ## Special Case: PDF has a usable TOC/bookmarks but broken embedded text Some PDFs are not truly scanned, yet direct text extraction is still unusable because the font mapping is garbage. Symptom: `pymupdf` returns readable structure in `doc.get_toc()` but page text looks like encoded junk. In that case, do **not** treat the PDF as a normal text PDF. Use this pipeline instead: 1. Use `doc.get_toc()` first to discover real section boundaries. 2. Segment the document by bookmark/page ranges instead of fuzzy heading detection. 3. Render each page in the section to PNG at ~300 DPI with PyMuPDF. 4. OCR the rendered PNGs with `tesseract`. 5. Save both: - raw OCR text - rendered source page images 6. Apply only deterministic cleanup: - remove page-number lines - remove repeated running headers/footers - normalize quotes/apostrophes - join line-end hyphenation - preserve source wording (no paraphrase / no LLM rewriting) 7. If the goal is “full fidelity”, keep the source page images and expose them alongside the extracted text so users can visually verify the original page. This pattern is especially strong when the PDF has high-quality bookmarks for chapters/entries (for example, one bookmark per section or per record), because the bookmarks become the canonical segmentation layer while OCR becomes only the text-recovery layer. --- ## pymupdf (lightweight) ```bash pip install pymupdf pymupdf4llm ``` **Via helper script**: ```bash python scripts/extract_pymupdf.py document.pdf # Plain text python scripts/extract_pymupdf.py document.pdf --markdown # Markdown python scripts/extract_pymupdf.py document.pdf --tables # Tables python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images python scripts/extract_pymupdf.py document.pdf --metadata # Title, author, pages python scripts/extract_pymupdf.py document.pdf --pages 0-4 # Specific pages ``` **Inline**: ```bash python3 -c " import pymupdf doc = pymupdf.open('document.pdf') for page in doc: print(page.get_text()) " ``` --- ## marker-pdf (high-quality OCR) ```bash # Check disk space first python scripts/extract_marker.py --check pip install marker-pdf ``` **Via helper script**: ```bash python scripts/extract_marker.py document.pdf # Markdown python scripts/extract_marker.py document.pdf --json # JSON with metadata python scripts/extract_marker.py document.pdf --output_dir out/ # Save images python scripts/extract_marker.py scanned.pdf # Scanned PDF (OCR) python scripts/extract_marker.py document.pdf --use_llm # LLM-boosted accuracy ``` **CLI** (installed with marker-pdf): ```bash marker_single document.pdf --output_dir ./output marker /path/to/folder --workers 4 # Batch ``` --- ## Arxiv Papers ``` # Abstract only (fast) web_extract(urls=["https://arxiv.org/abs/2402.03300"]) # Full paper web_extract(urls=["https://arxiv.org/pdf/2402.03300"]) # Search web_search(query="arxiv GRPO reinforcement learning 2026") ``` ## Split, Merge & Search pymupdf handles these natively — use `execute_code` or inline Python: ```python # Split: extract pages 1-5 to a new PDF import pymupdf doc = pymupdf.open("report.pdf") new = pymupdf.open() for i in range(5): new.insert_pdf(doc, from_page=i, to_page=i) new.save("pages_1-5.pdf") ``` ```python # Merge multiple PDFs import pymupdf result = pymupdf.open() for path in ["a.pdf", "b.pdf", "c.pdf"]: result.insert_pdf(pymupdf.open(path)) result.save("merged.pdf") ``` ```python # Search for text across all pages import pymupdf doc = pymupdf.open("report.pdf") for i, page in enumerate(doc): results = page.search_for("revenue") if results: print(f"Page {i+1}: {len(results)} match(es)") print(page.get_text("text")) ``` No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package. --- ## Brand Guide / Design PDF Extraction When extracting from brand standards PDFs, you need more than text — you need exact color values, images, and structured design tokens. ### Full Pipeline ```python import pymupdf, os doc = pymupdf.open('brand-guide.pdf') out_dir = '/path/to/brand-assets' os.makedirs(out_dir, exist_ok=True) # 1. Extract ALL embedded images (logos, photos, swatches) for page_num in range(len(doc)): for img_idx, img in enumerate(doc[page_num].get_images(full=True)): base_image = doc.extract_image(img[0]) if base_image and len(base_image["image"]) > 500: # skip tiny fname = f"page{page_num+1:02d}_img{img_idx:02d}.{base_image['ext']}" with open(f"{out_dir}/{fname}", "wb") as f: f.write(base_image["image"]) # 2. Render pages as PNGs for visual analysis for i in range(len(doc)): doc[i].get_pixmap(dpi=150).save(f"{out_dir}/page_{i+1:02d}.png") # 3. Pixel-sample EXACT colors from swatch pages (PDF color specs are often wrong!) page = doc[10] # color palette page pix = page.get_pixmap(dpi=300) # high DPI for accuracy w, h = pix.width, pix.height # Sample multiple points in the swatch area, average them samples = [] for x in range(int(w*0.1), int(w*0.35), int(w*0.05)): for y in range(int(h*0.3), int(h*0.7), int(h*0.1)): r, g, b = pix.pixel(x, y)[:3] if r < 100: # filter for the color you're sampling samples.append((r, g, b)) avg = tuple(sum(c)//len(samples) for c in zip(*samples)) hex_color = f"#{avg[0]:02X}{avg[1]:02X}{avg[2]:02X}" ``` ### Key Pitfall: PDF Color Specs Are Unreliable In the Jungle Studio brand PDF, the printed CMYK/HEX values for "Charcoal" were WRONG (showed White Sand's values for both colors — a production error). **Always pixel-sample the actual rendered swatches at 300dpi** to verify. The visual swatch is the ground truth, not the text overlay. ### Output Structure Write a BRAND-STYLE-GUIDE.md with: - Color palette table (Name | HEX | RGB | Usage) - CSS design tokens (`:root` variables) - Tailwind theme extension config - Typography specs (font names, weights, tracking, usage rules) - Logo variations and usage guidelines - Asset index mapping extracted images to their content --- ## Notes - `web_extract` is always first choice for URLs - pymupdf is the safe default — instant, no models, works everywhere - marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed - Both helper scripts accept `--help` for full usage - marker-pdf downloads ~2.5GB of models to `~/.cache/huggingface/` on first use - For Word docs: `pip install python-docx` (better than OCR — parses actual structure) - For PowerPoint: see the `powerpoint` skill (uses python-pptx)