terminal-docs-research-fallback

/home/avalon/.hermes/skills/research/terminal-docs-research-fallback/SKILL.md · raw

Terminal Docs Research Fallback

Use this when normal web tools fail but the target documentation is public and fetchable over HTTP.

When to Use

Workflow

  1. Try normal web tools first. - If they fail, capture the failure reason in your notes.

  2. Fetch docs directly with terminal using Python requests. Example: bash python - <<'PY' import requests url = 'https://docs.example.com/page' r = requests.get(url, timeout=30, headers={'User-Agent':'Mozilla/5.0'}) print('STATUS', r.status_code, 'LEN', len(r.text)) print(r.text[:500]) PY

  3. Extract high-signal metadata first. - HTML <title> - meta description / og:description - obvious keyword hits for the capability you care about

  4. Use targeted keyword-snippet extraction instead of trying to fully parse the page. Example: bash python - <<'PY' import requests, re, html txt = requests.get(URL, headers={'User-Agent':'Mozilla/5.0'}, timeout=30).text for kw in ['webhooks', 'customer portal', 'Payment Element']: m = re.search(r'(.{0,140}'+re.escape(kw)+r'.{0,260})', txt, re.I|re.S) if m: print('---', kw) print(' '.join(html.unescape(m.group(1)).split())[:500]) PY

  5. Cross-check with local code/schema in parallel. - Search repo for existing integrations, placeholders, env vars, or fake implementations. - Inspect DB/schema to compare current state vs vendor requirements.

  6. Summarize only grounded conclusions. - Separate “what docs support” from “what this codebase currently has”. - Call out any assumptions explicitly.

Good Targets for This Method

Pitfalls

Escalation Ladder When HTTP Fetch Fails

If curl/requests returns a Cloudflare "Just a moment…" challenge, a tiny SPA shell, or login wall, do NOT give up and ask the user to paste contents. Walk this ladder in order:

  1. web_extract — sometimes Firecrawl handles JS where curl doesn't. May fail on Cloudflare or quota.
  2. Hermes built-in browser toolset — Playwright-based, free, runs locally. Check with hermes tools list | grep browser. If listed but not actually working, the package is missing. Install path that actually works on Alex's Ubuntu VPS (system Python 3.12 is PEP 668-restricted, so plain pip install playwright and the playwright binary on $PATH both fail): bash VENV=~/.hermes/hermes-agent/venv $VENV/bin/pip3 install playwright $VENV/bin/python -m playwright install chromium # NOT: pip install playwright (PEP 668) # NOT: playwright install chromium (binary not on PATH; `apt install node-playwright` is wrong) Verify with $VENV/bin/python -c "from playwright.sync_api import sync_playwright; print('ok')". For SPA + Cloudflare pages, drive it directly via a one-off script — set headless=True, use a real User-Agent, wait_until='domcontentloaded', then poll page.title() for "Just a moment" up to ~40 s before reading document.body.innerText. This defeats developer.classpass.com-style Cloudflare challenges in headless mode.
  3. Camoufox (tools/browser_camofox.py) — stealth Firefox fork, defeats Cloudflare bot checks more reliably than vanilla Playwright when even the above fails. Same venv install pattern.
  4. claude -p — NOT a headless option. claude --chrome is the Claude-in-Chrome browser-extension integration; it does not spin up a headless CDP browser for unattended doc fetching. Do not list it as a fallback for "read these 3 URLs for me" tasks. Skip to (5) if rungs 1–3 are exhausted.
  5. MCP browser server — Browserbase / Apify / Bright Data via claude mcp add for hosted anti-bot Chromium. Paid usage but no monthly minimum on some.
  6. Only then ask the user to paste contents or screenshot.

Pick the earliest rung that works. For Cloudflare-protected JS SPAs (developer portals, gated docs), rung 2 (Hermes Playwright) is the right default once installed — it's free, local, and one-time setup. Firecrawl's paid tier ($19/mo) is an option but unnecessary when the Playwright venv works.

Verification

Before finalizing, make sure you have: - proof the page was reachable (STATUS 200 or equivalent) - extracted title/meta or keyword snippets - codebase evidence for current implementation state - DB/schema evidence if the task involves persisted data