terminal-docs-research-fallback

/home/avalon/.hermes/skills/research/terminal-docs-research-fallback/SKILL.md · raw

Terminal Docs Research Fallback

Use this when normal web tools fail but the target documentation is public and fetchable over HTTP.

When to Use

web_search or web_extract fails due to quota/credits/scraper restrictions
You still need grounded current documentation facts
Public docs are accessible directly with requests/curl
You need a concise research summary, not full-page browsing

Workflow

Try normal web tools first. - If they fail, capture the failure reason in your notes.
Fetch docs directly with terminal using Python requests. Example: bash python - <<'PY' import requests url = 'https://docs.example.com/page' r = requests.get(url, timeout=30, headers={'User-Agent':'Mozilla/5.0'}) print('STATUS', r.status_code, 'LEN', len(r.text)) print(r.text[:500]) PY
Extract high-signal metadata first. - HTML <title> - meta description / og:description - obvious keyword hits for the capability you care about
Use targeted keyword-snippet extraction instead of trying to fully parse the page. Example: bash python - <<'PY' import requests, re, html txt = requests.get(URL, headers={'User-Agent':'Mozilla/5.0'}, timeout=30).text for kw in ['webhooks', 'customer portal', 'Payment Element']: m = re.search(r'(.{0,140}'+re.escape(kw)+r'.{0,260})', txt, re.I|re.S) if m: print('---', kw) print(' '.join(html.unescape(m.group(1)).split())[:500]) PY
Cross-check with local code/schema in parallel. - Search repo for existing integrations, placeholders, env vars, or fake implementations. - Inspect DB/schema to compare current state vs vendor requirements.
Summarize only grounded conclusions. - Separate “what docs support” from “what this codebase currently has”. - Call out any assumptions explicitly.

Good Targets for This Method

Stripe docs
public SaaS API docs
SDK documentation pages
billing / webhook / tax docs

Pitfalls

Do not claim details you did not actually extract.
Large HTML pages are noisy; use keyword snippets, not full dumps.
Pages may be JS-heavy; if HTTP fetch fails to contain content, switch to browser tools (see escalation ladder below).
Quote doc claims conservatively unless you extracted exact wording.
Do NOT declare a question a "blocker for the vendor" until you have tried the doc-fetch ladder yourself. The user expects research before escalation.

Escalation Ladder When HTTP Fetch Fails

If curl/requests returns a Cloudflare "Just a moment…" challenge, a tiny SPA shell, or login wall, do NOT give up and ask the user to paste contents. Walk this ladder in order:

web_extract — sometimes Firecrawl handles JS where curl doesn't. May fail on Cloudflare or quota.
Hermes built-in browser toolset — Playwright-based, free, runs locally. Check with hermes tools list | grep browser. If listed but not actually working, the package is missing. Install path that actually works on Alex's Ubuntu VPS (system Python 3.12 is PEP 668-restricted, so plain pip install playwright and the playwright binary on $PATH both fail): bash VENV=~/.hermes/hermes-agent/venv $VENV/bin/pip3 install playwright $VENV/bin/python -m playwright install chromium # NOT: pip install playwright (PEP 668) # NOT: playwright install chromium (binary not on PATH; `apt install node-playwright` is wrong) Verify with $VENV/bin/python -c "from playwright.sync_api import sync_playwright; print('ok')". For SPA + Cloudflare pages, drive it directly via a one-off script — set headless=True, use a real User-Agent, wait_until='domcontentloaded', then poll page.title() for "Just a moment" up to ~40 s before reading document.body.innerText. This defeats developer.classpass.com-style Cloudflare challenges in headless mode.
Camoufox (tools/browser_camofox.py) — stealth Firefox fork, defeats Cloudflare bot checks more reliably than vanilla Playwright when even the above fails. Same venv install pattern.
claude -p — NOT a headless option. claude --chrome is the Claude-in-Chrome browser-extension integration; it does not spin up a headless CDP browser for unattended doc fetching. Do not list it as a fallback for "read these 3 URLs for me" tasks. Skip to (5) if rungs 1–3 are exhausted.
MCP browser server — Browserbase / Apify / Bright Data via claude mcp add for hosted anti-bot Chromium. Paid usage but no monthly minimum on some.
Only then ask the user to paste contents or screenshot.

Pick the earliest rung that works. For Cloudflare-protected JS SPAs (developer portals, gated docs), rung 2 (Hermes Playwright) is the right default once installed — it's free, local, and one-time setup. Firecrawl's paid tier ($19/mo) is an option but unnecessary when the Playwright venv works.

Verification

Before finalizing, make sure you have: - proof the page was reachable (STATUS 200 or equivalent) - extracted title/meta or keyword snippets - codebase evidence for current implementation state - DB/schema evidence if the task involves persisted data