---
name: terminal-docs-research-fallback
description: Fallback workflow for researching vendor docs when web_search/web_extract fail (for example due to credits or scraper limits) by using terminal HTTP fetches plus targeted string extraction.
---
# Terminal Docs Research Fallback
Use this when normal web tools fail but the target documentation is public and fetchable over HTTP.
## When to Use
- `web_search` or `web_extract` fails due to quota/credits/scraper restrictions
- You still need grounded current documentation facts
- Public docs are accessible directly with `requests`/`curl`
- You need a concise research summary, not full-page browsing
## Workflow
1. Try normal web tools first.
- If they fail, capture the failure reason in your notes.
2. Fetch docs directly with `terminal` using Python `requests`.
Example:
```bash
python - <<'PY'
import requests
url = 'https://docs.example.com/page'
r = requests.get(url, timeout=30, headers={'User-Agent':'Mozilla/5.0'})
print('STATUS', r.status_code, 'LEN', len(r.text))
print(r.text[:500])
PY
```
3. Extract high-signal metadata first.
- HTML `
`
- meta description / `og:description`
- obvious keyword hits for the capability you care about
4. Use targeted keyword-snippet extraction instead of trying to fully parse the page.
Example:
```bash
python - <<'PY'
import requests, re, html
txt = requests.get(URL, headers={'User-Agent':'Mozilla/5.0'}, timeout=30).text
for kw in ['webhooks', 'customer portal', 'Payment Element']:
m = re.search(r'(.{0,140}'+re.escape(kw)+r'.{0,260})', txt, re.I|re.S)
if m:
print('---', kw)
print(' '.join(html.unescape(m.group(1)).split())[:500])
PY
```
5. Cross-check with local code/schema in parallel.
- Search repo for existing integrations, placeholders, env vars, or fake implementations.
- Inspect DB/schema to compare current state vs vendor requirements.
6. Summarize only grounded conclusions.
- Separate “what docs support” from “what this codebase currently has”.
- Call out any assumptions explicitly.
## Good Targets for This Method
- Stripe docs
- public SaaS API docs
- SDK documentation pages
- billing / webhook / tax docs
## Pitfalls
- Do not claim details you did not actually extract.
- Large HTML pages are noisy; use keyword snippets, not full dumps.
- Pages may be JS-heavy; if HTTP fetch fails to contain content, switch to browser tools (see escalation ladder below).
- Quote doc claims conservatively unless you extracted exact wording.
- Do NOT declare a question a "blocker for the vendor" until you have tried the doc-fetch ladder yourself. The user expects research before escalation.
## Escalation Ladder When HTTP Fetch Fails
If `curl`/`requests` returns a Cloudflare "Just a moment…" challenge, a tiny SPA shell, or login wall, do NOT give up and ask the user to paste contents. Walk this ladder in order:
1. **`web_extract`** — sometimes Firecrawl handles JS where curl doesn't. May fail on Cloudflare or quota.
2. **Hermes built-in `browser` toolset** — Playwright-based, free, runs locally. Check with `hermes tools list | grep browser`. If listed but not actually working, the package is missing. Install path that actually works on Alex's Ubuntu VPS (system Python 3.12 is PEP 668-restricted, so plain `pip install playwright` and the `playwright` binary on `$PATH` both fail):
```bash
VENV=~/.hermes/hermes-agent/venv
$VENV/bin/pip3 install playwright
$VENV/bin/python -m playwright install chromium
# NOT: pip install playwright (PEP 668)
# NOT: playwright install chromium (binary not on PATH; `apt install node-playwright` is wrong)
```
Verify with `$VENV/bin/python -c "from playwright.sync_api import sync_playwright; print('ok')"`. For SPA + Cloudflare pages, drive it directly via a one-off script — set `headless=True`, use a real User-Agent, `wait_until='domcontentloaded'`, then poll `page.title()` for "Just a moment" up to ~40 s before reading `document.body.innerText`. This defeats `developer.classpass.com`-style Cloudflare challenges in headless mode.
3. **Camoufox** (`tools/browser_camofox.py`) — stealth Firefox fork, defeats Cloudflare bot checks more reliably than vanilla Playwright when even the above fails. Same venv install pattern.
4. **`claude -p` — NOT a headless option.** `claude --chrome` is the Claude-in-Chrome browser-*extension* integration; it does not spin up a headless CDP browser for unattended doc fetching. Do not list it as a fallback for "read these 3 URLs for me" tasks. Skip to (5) if rungs 1–3 are exhausted.
5. **MCP browser server** — Browserbase / Apify / Bright Data via `claude mcp add` for hosted anti-bot Chromium. Paid usage but no monthly minimum on some.
6. **Only then** ask the user to paste contents or screenshot.
Pick the earliest rung that works. For Cloudflare-protected JS SPAs (developer portals, gated docs), rung 2 (Hermes Playwright) is the right default once installed — it's free, local, and one-time setup. Firecrawl's paid tier ($19/mo) is an option but unnecessary when the Playwright venv works.
## Verification
Before finalizing, make sure you have:
- proof the page was reachable (`STATUS 200` or equivalent)
- extracted title/meta or keyword snippets
- codebase evidence for current implementation state
- DB/schema evidence if the task involves persisted data