--- name: tiered-web-extract description: "Tiered web extract/search plugin (Jina → Crawl4AI → Firecrawl) and its Crawl4AI service." version: 1.0.0 author: Alex + Hermes license: MIT metadata: hermes: tags: [web, scraping, crawl4ai, jina, firecrawl, hermes-plugin] --- # Tiered Web Extract (Jina → Crawl4AI → Firecrawl) Alex's cost-first web extract/search backend. Replaces direct Firecrawl usage. Saves the Firecrawl free tier for hardest-to-render pages only. ## Architecture **Extract chain (per URL, falls through on miss):** 1. `r.jina.ai/` — free, no key, static pages, ~85% of cases 2. Self-hosted **Crawl4AI** on `http://5.78.214.66:11235/md` — JS-heavy pages 3. **Firecrawl** SDK — last-resort paid fallback (uses existing FIRECRAWL_API_KEY) **Search chain:** 1. **ddgs** (DuckDuckGo) — free 2. **Firecrawl** search — fallback ## Files - Plugin: `~/.hermes/plugins/web/tiered/` (provider.py, __init__.py, plugin.yaml) - Plugin enabled in `~/.hermes/config.yaml` under `plugins.enabled: [..., web/tiered]` - Backend selected: `web.backend: tiered` - Crawl4AI server: Hetzner CPX11 `hermes-crawl4ai` @ 5.78.214.66 (€4.35/mo, hil region) - Crawl4AI runs as Docker container `unclecode/crawl4ai:latest` on port 11235 via `/root/crawl4ai/docker-compose.yml` ## Verify it's working ```bash # Live Crawl4AI health curl -s --max-time 5 http://5.78.214.66:11235/health # Expect: {"status":"ok","timestamp":...,"version":"0.8.6"} # Live crawl test curl -s -X POST http://5.78.214.66:11235/md \ -H "Content-Type: application/json" \ -d '{"url":"https://example.com"}' # Plugin loads & registers cd /home/avalon/.hermes/hermes-agent /home/avalon/.hermes/hermes-agent/venv/bin/python -c " from hermes_cli.plugins import PluginManager pm = PluginManager(); pm.discover_and_load() from agent.web_search_registry import get_active_search_provider, get_active_extract_provider print('search:', get_active_search_provider().name) # tiered print('extract:', get_active_extract_provider().name) # tiered " ``` ## Pitfalls discovered during setup - **User plugins load under synthetic namespace** `hermes_plugins.` not `plugins.`. So `__init__.py` MUST use **relative imports** (`from .provider import …`) not absolute (`from plugins.web.tiered.provider …`). - However, the tiered provider's lazy imports of bundled providers (`plugins.web.firecrawl.provider`, `plugins.web.ddgs.provider`) DO use absolute paths — those resolve correctly because bundled providers live at the real `plugins.web.*` path under hermes-agent. - **ddgs python package** is an optional dep. Install with `~/.hermes/hermes-agent/venv/bin/pip3 install ddgs` (not system pip — PEP 668). - **hcloud CLI needs `HOME` set explicitly** when called from agent contexts where HOME isn't propagated: `export HOME=/home/avalon`. - Hermes uses `~/.hermes/hermes-agent/venv` (named `venv`, not `.venv`). Activate or call binaries directly. ## Pitfall: gateway/MCP process predates the plugin Symptom (the exact shape this bug takes): the agent's `web_search` MCP tool keeps returning `Firecrawl search failed: Payment Required: Insufficient credits` even though `web.backend: tiered` is in `~/.hermes/config.yaml` and the plugin files exist on disk. Direct Python tests inside `hermes-agent/venv` resolve `tiered` correctly (`get_active_search_provider().name == 'tiered'`) — the discrepancy is the giveaway. Cause: the long-running gateway process (the one serving Telegram / MCP for the active session) was started BEFORE the tiered plugin was installed. Plugins are loaded once at gateway startup. Editing config.yaml and dropping new plugin files into `~/.hermes/plugins/` does NOT hot-reload them into the running process. Diagnosis recipe: ```bash # What's the gateway process and when did it start? ps -ef | grep "hermes_cli.main gateway run" | grep -v grep ps -o lstart= -p # Compare to plugin install / config edit time stat -c '%y %n' ~/.hermes/plugins/web/tiered/*.py ~/.hermes/config.yaml ``` If the gateway `lstart` is earlier than the plugin/config mtime → stale gateway, restart needed. Fix: restart the gateway. Whatever launches it (pm2 / systemd / hand-launched) — kill the PID and let it respawn, or `pm2 restart hermes-gateway`. Brief Telegram drop (a few seconds), then the new chain is wired in. **Generalize:** this same pattern bites any time config.yaml / `.env` / a plugin directory is edited while the gateway is running. Whenever Alex says "but we already configured X, why is it still doing the old thing", check gateway start time vs config/plugin mtime BEFORE re-debugging the config. ## Maintaining Crawl4AI SSH in via the main VPS (key already authorized): ```bash ssh root@5.78.214.66 cd /root/crawl4ai docker compose logs -f --tail 100 docker compose pull && docker compose up -d # update docker stats crawl4ai # mem/cpu ``` ## When to extend - **Add Jina API key** if hitting Jina free-tier rate limits — set `JINA_API_KEY` in `~/.hermes/.env`, plugin reads it automatically. - **Crawl4AI server URL override:** set `CRAWL4AI_URL` env var (default `http://5.78.214.66:11235`). - **Replace Firecrawl tier** if free credits run out and you don't want to pay: edit `_try_firecrawl_extract` to return an error result, leaving 2-tier (Jina + Crawl4AI). Pages that fail both will return error to the agent — usually fine. ## Cost - Hetzner CPX11: ~$4.90/mo (saves €228+/yr vs Firecrawl Standard) - Jina Reader free tier: covers most static pages - Firecrawl free tier: trickle fallback only