--- name: terminal-docs-research-fallback description: Fallback workflow for researching vendor docs when web_search/web_extract fail (for example due to credits or scraper limits) by using terminal HTTP fetches plus targeted string extraction. --- # Terminal Docs Research Fallback Use this when normal web tools fail but the target documentation is public and fetchable over HTTP. ## When to Use - `web_search` or `web_extract` fails due to quota/credits/scraper restrictions - You still need grounded current documentation facts - Public docs are accessible directly with `requests`/`curl` - You need a concise research summary, not full-page browsing ## Workflow 1. Try normal web tools first. - If they fail, capture the failure reason in your notes. 2. Fetch docs directly with `terminal` using Python `requests`. Example: ```bash python - <<'PY' import requests url = 'https://docs.example.com/page' r = requests.get(url, timeout=30, headers={'User-Agent':'Mozilla/5.0'}) print('STATUS', r.status_code, 'LEN', len(r.text)) print(r.text[:500]) PY ``` 3. Extract high-signal metadata first. - HTML `` - meta description / `og:description` - obvious keyword hits for the capability you care about 4. Use targeted keyword-snippet extraction instead of trying to fully parse the page. Example: ```bash python - <<'PY' import requests, re, html txt = requests.get(URL, headers={'User-Agent':'Mozilla/5.0'}, timeout=30).text for kw in ['webhooks', 'customer portal', 'Payment Element']: m = re.search(r'(.{0,140}'+re.escape(kw)+r'.{0,260})', txt, re.I|re.S) if m: print('---', kw) print(' '.join(html.unescape(m.group(1)).split())[:500]) PY ``` 5. Cross-check with local code/schema in parallel. - Search repo for existing integrations, placeholders, env vars, or fake implementations. - Inspect DB/schema to compare current state vs vendor requirements. 6. Summarize only grounded conclusions. - Separate “what docs support” from “what this codebase currently has”. - Call out any assumptions explicitly. ## Good Targets for This Method - Stripe docs - public SaaS API docs - SDK documentation pages - billing / webhook / tax docs ## Pitfalls - Do not claim details you did not actually extract. - Large HTML pages are noisy; use keyword snippets, not full dumps. - Pages may be JS-heavy; if HTTP fetch fails to contain content, switch to browser tools (see escalation ladder below). - Quote doc claims conservatively unless you extracted exact wording. - Do NOT declare a question a "blocker for the vendor" until you have tried the doc-fetch ladder yourself. The user expects research before escalation. ## Escalation Ladder When HTTP Fetch Fails If `curl`/`requests` returns a Cloudflare "Just a moment…" challenge, a tiny SPA shell, or login wall, do NOT give up and ask the user to paste contents. Walk this ladder in order: 1. **`web_extract`** — sometimes Firecrawl handles JS where curl doesn't. May fail on Cloudflare or quota. 2. **Hermes built-in `browser` toolset** — Playwright-based, free, runs locally. Check with `hermes tools list | grep browser`. If listed but not actually working, the package is missing. Install path that actually works on Alex's Ubuntu VPS (system Python 3.12 is PEP 668-restricted, so plain `pip install playwright` and the `playwright` binary on `$PATH` both fail): ```bash VENV=~/.hermes/hermes-agent/venv $VENV/bin/pip3 install playwright $VENV/bin/python -m playwright install chromium # NOT: pip install playwright (PEP 668) # NOT: playwright install chromium (binary not on PATH; `apt install node-playwright` is wrong) ``` Verify with `$VENV/bin/python -c "from playwright.sync_api import sync_playwright; print('ok')"`. For SPA + Cloudflare pages, drive it directly via a one-off script — set `headless=True`, use a real User-Agent, `wait_until='domcontentloaded'`, then poll `page.title()` for "Just a moment" up to ~40 s before reading `document.body.innerText`. This defeats `developer.classpass.com`-style Cloudflare challenges in headless mode. 3. **Camoufox** (`tools/browser_camofox.py`) — stealth Firefox fork, defeats Cloudflare bot checks more reliably than vanilla Playwright when even the above fails. Same venv install pattern. 4. **`claude -p` — NOT a headless option.** `claude --chrome` is the Claude-in-Chrome browser-*extension* integration; it does not spin up a headless CDP browser for unattended doc fetching. Do not list it as a fallback for "read these 3 URLs for me" tasks. Skip to (5) if rungs 1–3 are exhausted. 5. **MCP browser server** — Browserbase / Apify / Bright Data via `claude mcp add` for hosted anti-bot Chromium. Paid usage but no monthly minimum on some. 6. **Only then** ask the user to paste contents or screenshot. Pick the earliest rung that works. For Cloudflare-protected JS SPAs (developer portals, gated docs), rung 2 (Hermes Playwright) is the right default once installed — it's free, local, and one-time setup. Firecrawl's paid tier ($19/mo) is an option but unnecessary when the Playwright venv works. ## Verification Before finalizing, make sure you have: - proof the page was reachable (`STATUS 200` or equivalent) - extracted title/meta or keyword snippets - codebase evidence for current implementation state - DB/schema evidence if the task involves persisted data