For DOCX: use python-docx (parses actual document structure, far better than OCR).
For PPTX: see the powerpoint skill (uses python-pptx with full slide/notes support).
This skill covers PDFs and scanned documents.
If the document has a URL, always try web_extract first:
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://example.com/report.pdf"])
This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.
Only use local extraction when: the file is local, web_extract fails, or you need batch processing.
| Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) |
|---|---|---|
| Text-based PDF | ✅ | ✅ |
| Scanned PDF (OCR) | ❌ | ✅ (90+ languages) |
| Tables | ✅ (basic) | ✅ (high accuracy) |
| Equations / LaTeX | ❌ | ✅ |
| Code blocks | ❌ | ✅ |
| Forms | ❌ | ✅ |
| Headers/footers removal | ❌ | ✅ |
| Reading order detection | ❌ | ✅ |
| Images extraction | ✅ (embedded) | ✅ (with context) |
| Images → text (OCR) | ❌ | ✅ |
| EPUB | ✅ | ✅ |
| Markdown output | ✅ (via pymupdf4llm) | ✅ (native, higher quality) |
| Install size | ~25MB | ~3-5GB (PyTorch + models) |
| Speed | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) |
Decision: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.
If the user needs marker capabilities but the system lacks ~5GB free disk:
"This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."
Some PDFs are not truly scanned, yet direct text extraction is still unusable because the font mapping is garbage. Symptom: pymupdf returns readable structure in doc.get_toc() but page text looks like encoded junk. In that case, do not treat the PDF as a normal text PDF.
Use this pipeline instead:
doc.get_toc() first to discover real section boundaries.tesseract.This pattern is especially strong when the PDF has high-quality bookmarks for chapters/entries (for example, one bookmark per section or per record), because the bookmarks become the canonical segmentation layer while OCR becomes only the text-recovery layer.
pip install pymupdf pymupdf4llm
Via helper script:
python scripts/extract_pymupdf.py document.pdf # Plain text
python scripts/extract_pymupdf.py document.pdf --markdown # Markdown
python scripts/extract_pymupdf.py document.pdf --tables # Tables
python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
python scripts/extract_pymupdf.py document.pdf --metadata # Title, author, pages
python scripts/extract_pymupdf.py document.pdf --pages 0-4 # Specific pages
Inline:
python3 -c "
import pymupdf
doc = pymupdf.open('document.pdf')
for page in doc:
print(page.get_text())
"
# Check disk space first
python scripts/extract_marker.py --check
pip install marker-pdf
Via helper script:
python scripts/extract_marker.py document.pdf # Markdown
python scripts/extract_marker.py document.pdf --json # JSON with metadata
python scripts/extract_marker.py document.pdf --output_dir out/ # Save images
python scripts/extract_marker.py scanned.pdf # Scanned PDF (OCR)
python scripts/extract_marker.py document.pdf --use_llm # LLM-boosted accuracy
CLI (installed with marker-pdf):
marker_single document.pdf --output_dir ./output
marker /path/to/folder --workers 4 # Batch
# Abstract only (fast)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])
# Full paper
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
# Search
web_search(query="arxiv GRPO reinforcement learning 2026")
pymupdf handles these natively — use execute_code or inline Python:
# Split: extract pages 1-5 to a new PDF
import pymupdf
doc = pymupdf.open("report.pdf")
new = pymupdf.open()
for i in range(5):
new.insert_pdf(doc, from_page=i, to_page=i)
new.save("pages_1-5.pdf")
# Merge multiple PDFs
import pymupdf
result = pymupdf.open()
for path in ["a.pdf", "b.pdf", "c.pdf"]:
result.insert_pdf(pymupdf.open(path))
result.save("merged.pdf")
# Search for text across all pages
import pymupdf
doc = pymupdf.open("report.pdf")
for i, page in enumerate(doc):
results = page.search_for("revenue")
if results:
print(f"Page {i+1}: {len(results)} match(es)")
print(page.get_text("text"))
No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package.
When extracting from brand standards PDFs, you need more than text — you need exact color values, images, and structured design tokens.
import pymupdf, os
doc = pymupdf.open('brand-guide.pdf')
out_dir = '/path/to/brand-assets'
os.makedirs(out_dir, exist_ok=True)
# 1. Extract ALL embedded images (logos, photos, swatches)
for page_num in range(len(doc)):
for img_idx, img in enumerate(doc[page_num].get_images(full=True)):
base_image = doc.extract_image(img[0])
if base_image and len(base_image["image"]) > 500: # skip tiny
fname = f"page{page_num+1:02d}_img{img_idx:02d}.{base_image['ext']}"
with open(f"{out_dir}/{fname}", "wb") as f:
f.write(base_image["image"])
# 2. Render pages as PNGs for visual analysis
for i in range(len(doc)):
doc[i].get_pixmap(dpi=150).save(f"{out_dir}/page_{i+1:02d}.png")
# 3. Pixel-sample EXACT colors from swatch pages (PDF color specs are often wrong!)
page = doc[10] # color palette page
pix = page.get_pixmap(dpi=300) # high DPI for accuracy
w, h = pix.width, pix.height
# Sample multiple points in the swatch area, average them
samples = []
for x in range(int(w*0.1), int(w*0.35), int(w*0.05)):
for y in range(int(h*0.3), int(h*0.7), int(h*0.1)):
r, g, b = pix.pixel(x, y)[:3]
if r < 100: # filter for the color you're sampling
samples.append((r, g, b))
avg = tuple(sum(c)//len(samples) for c in zip(*samples))
hex_color = f"#{avg[0]:02X}{avg[1]:02X}{avg[2]:02X}"
In the Jungle Studio brand PDF, the printed CMYK/HEX values for "Charcoal" were WRONG (showed White Sand's values for both colors — a production error). Always pixel-sample the actual rendered swatches at 300dpi to verify. The visual swatch is the ground truth, not the text overlay.
Write a BRAND-STYLE-GUIDE.md with:
- Color palette table (Name | HEX | RGB | Usage)
- CSS design tokens (:root variables)
- Tailwind theme extension config
- Typography specs (font names, weights, tracking, usage rules)
- Logo variations and usage guidelines
- Asset index mapping extracted images to their content
web_extract is always first choice for URLs--help for full usage~/.cache/huggingface/ on first usepip install python-docx (better than OCR — parses actual structure)powerpoint skill (uses python-pptx)