Use this when normal web tools fail but the target documentation is public and fetchable over HTTP.
web_search or web_extract fails due to quota/credits/scraper restrictionsrequests/curlTry normal web tools first. - If they fail, capture the failure reason in your notes.
Fetch docs directly with terminal using Python requests.
Example:
bash
python - <<'PY'
import requests
url = 'https://docs.example.com/page'
r = requests.get(url, timeout=30, headers={'User-Agent':'Mozilla/5.0'})
print('STATUS', r.status_code, 'LEN', len(r.text))
print(r.text[:500])
PY
Extract high-signal metadata first.
- HTML <title>
- meta description / og:description
- obvious keyword hits for the capability you care about
Use targeted keyword-snippet extraction instead of trying to fully parse the page.
Example:
bash
python - <<'PY'
import requests, re, html
txt = requests.get(URL, headers={'User-Agent':'Mozilla/5.0'}, timeout=30).text
for kw in ['webhooks', 'customer portal', 'Payment Element']:
m = re.search(r'(.{0,140}'+re.escape(kw)+r'.{0,260})', txt, re.I|re.S)
if m:
print('---', kw)
print(' '.join(html.unescape(m.group(1)).split())[:500])
PY
Cross-check with local code/schema in parallel. - Search repo for existing integrations, placeholders, env vars, or fake implementations. - Inspect DB/schema to compare current state vs vendor requirements.
Summarize only grounded conclusions. - Separate “what docs support” from “what this codebase currently has”. - Call out any assumptions explicitly.
If curl/requests returns a Cloudflare "Just a moment…" challenge, a tiny SPA shell, or login wall, do NOT give up and ask the user to paste contents. Walk this ladder in order:
web_extract — sometimes Firecrawl handles JS where curl doesn't. May fail on Cloudflare or quota.browser toolset — Playwright-based, free, runs locally. Check with hermes tools list | grep browser. If listed but not actually working, the package is missing. Install path that actually works on Alex's Ubuntu VPS (system Python 3.12 is PEP 668-restricted, so plain pip install playwright and the playwright binary on $PATH both fail):
bash
VENV=~/.hermes/hermes-agent/venv
$VENV/bin/pip3 install playwright
$VENV/bin/python -m playwright install chromium
# NOT: pip install playwright (PEP 668)
# NOT: playwright install chromium (binary not on PATH; `apt install node-playwright` is wrong)
Verify with $VENV/bin/python -c "from playwright.sync_api import sync_playwright; print('ok')". For SPA + Cloudflare pages, drive it directly via a one-off script — set headless=True, use a real User-Agent, wait_until='domcontentloaded', then poll page.title() for "Just a moment" up to ~40 s before reading document.body.innerText. This defeats developer.classpass.com-style Cloudflare challenges in headless mode.tools/browser_camofox.py) — stealth Firefox fork, defeats Cloudflare bot checks more reliably than vanilla Playwright when even the above fails. Same venv install pattern.claude -p — NOT a headless option. claude --chrome is the Claude-in-Chrome browser-extension integration; it does not spin up a headless CDP browser for unattended doc fetching. Do not list it as a fallback for "read these 3 URLs for me" tasks. Skip to (5) if rungs 1–3 are exhausted.claude mcp add for hosted anti-bot Chromium. Paid usage but no monthly minimum on some.Pick the earliest rung that works. For Cloudflare-protected JS SPAs (developer portals, gated docs), rung 2 (Hermes Playwright) is the right default once installed — it's free, local, and one-time setup. Firecrawl's paid tier ($19/mo) is an option but unnecessary when the Playwright venv works.
Before finalizing, make sure you have:
- proof the page was reachable (STATUS 200 or equivalent)
- extracted title/meta or keyword snippets
- codebase evidence for current implementation state
- DB/schema evidence if the task involves persisted data