video-watch

/home/avalon/.hermes/skills/media/video-watch/SKILL.md · raw

Video Watch for Hermes

Overview

This is a Hermes-native adaptation of the Claude /watch pattern from bradautomates/claude-video. Version 2 includes an executable runtime port of its MIT-licensed 0.2.0 scene/keyframe, deduplication, cue-frame, focused-range, and chunked-Whisper features. See references/upstream-attribution.md.

Use it when the user needs an agent to watch a video rather than just summarize metadata. This is the primary Hermes-native visual-analysis path, and it intentionally stays close to the Brad Bonanno /watch pattern rather than relying on URL-level Gemini analysis. The core pattern is:

download or locate the video
extract a bounded set of timestamped frames with ffmpeg
obtain captions/transcript if available
analyze the frames with Hermes vision tools
answer using both the visual evidence and the transcript

Unlike Claude Code's Read-image workflow, Hermes should use vision_analyze on extracted frames or contact sheets.

When to Use

Use this skill when the user asks to: - analyze a YouTube, Loom, Vimeo, TikTok, X, Instagram, or other public video URL - inspect a local .mp4, .mov, .mkv, or .webm - identify what happens at a specific timestamp - debug from a screen recording - break down hooks, pacing, visual structure, or on-screen text - summarize a video where the visuals matter, not just the spoken words

Do not use this skill when: - the task is only YouTube metadata/transcript extraction for archiving or indexing → prefer youtube-content - the user only needs title/description/channel info - a static screenshot would answer the question more cheaply than processing the whole video

IMPORTANT: Check Skills First

When the user asks about any YouTube video — whether to download audio, extract metadata, transcribe, or analyze — do not start experimenting with random approaches. Load the relevant skills first:

youtube-content — metadata + transcript extraction via Supadata API (no IP blocks, preferred for YouTube)
video-watch — visual frame analysis (this skill)
whisper — audio transcription fallback

The user has these pipelines for a reason. Starting with them avoids wasted cycles on approaches the user already has established workflows for.

Prerequisites

Check the live environment first:

command -v ffmpeg
command -v yt-dlp
python3 --version

Optional transcription fallbacks: - native captions via yt-dlp are preferred and usually free - if no captions exist, use Groq/OpenAI Whisper only if credentials are configured - Supadata API (youtube-content skill) for transcript when yt-dlp is blocked

YouTube Bot Detection — Critical Note

For the full cloud-VPS retrieval architecture—anonymous proxy-first testing, dedicated ISP/static-residential egress selection, account/cookie safety, verification ladder, and conservative runtime defaults—read references/youtube-vps-egress-and-account-safety.md before configuring a paid proxy or attaching an account.

A cloud/VPS egress IP may be challenged by YouTube. If yt-dlp fails with:

ERROR: [youtube] VIDEO_ID: Sign in to confirm you're not a bot.

Treat this first as an egress/session trust problem, not automatically as a broken yt-dlp installation. Re-check the live environment and current verbose output rather than hard-coding assumptions about every alternate client or provider.

Paths that remain useful without raw media: - Supadata API (youtube-content skill) for metadata and transcripts - Direct watch-page HTML parsing for title, channel, description, chapters, and transcript-panel evidence

Bounded escalation for actual media: 1. Use Supadata when the task only requires transcript/metadata. 2. For public media, test one fixed dedicated ISP/static-residential proxy anonymously first. Verify egress, run a simulation, download one small file, probe it, and run video-watch before persisting the provider. 3. If the anonymous proxied lane works, keep public downloads account-free. 4. Attach cookies from one fully secured secondary account only when the content legitimately requires authentication. Never paste passwords, TOTP seeds, backup codes, or cookies into chat or a link-shareable sheet. 5. Add a PO-token provider/client only when current yt-dlp output shows it is required. 6. Do not cycle random proxy strings, clients, IPs, or accounts. One bounded proof of concept is useful; repeated evasion attempts are not.

Configured VPS proxy behavior

On a VPS with a verified YouTube egress lane, scripts/download.py automatically applies the proxy to YouTube-family URLs only (youtube.com, youtu.be, and youtube-nocookie.com). It reads YOUTUBE_PROXY_URL from the process environment first, then from ${YOUTUBE_PROXY_SECRET_FILE:-~/.hermes/secrets/youtube-proxy.env}. Use this file shape and mode:

# Preferred on a VPS: provider-side IP whitelist, no credentials stored
YOUTUBE_PROXY_URL='http://HOST:PORT'

# Fallback only when IP whitelisting is unavailable
# YOUTUBE_PROXY_URL='http://USER:PASS@HOST:PORT'

chmod 600 ~/.hermes/secrets/youtube-proxy.env

The proxy is passed through the yt-dlp subprocess environment rather than command arguments, so it is absent from ordinary process argv and command logs. Proxied YouTube downloads automatically use one concurrent fragment, 2-second request sleeps, 5–10-second inter-download sleeps, and a 5 MB/s limit. Non-YouTube URLs remain direct and retain their existing behavior. On this profile, the optional ~/.local/bin/hermes-youtube-fetch wrapper provides the same protected lane for explicit download-first workflows; pass its local output to watch.py to avoid downloading the same source twice.

Run the bundled redaction-safe verification probe after adding or rotating a proxy. It checks mode 0600, compares direct/proxied egress, validates a harmless YouTube endpoint, and can run an anonymous simulation without putting credentials in argv:

python3 "$SKILL_DIR/scripts/verify_youtube_proxy.py" \
  --simulate-url "https://www.youtube.com/watch?v=PUBLIC_VIDEO_ID"

Shot-Boundary and Scene-Manifest Workflows

When the user asks for scene/cut detection, clip splitting, scene-aware search, or generated-video cut QA, read references/shot-boundary-scene-manifests.md. Preserve detected boundaries independently from representative-frame sampling, default explicit scene-detection requests to a hybrid boundary-plus-long-shot-coverage mode, and keep semantic/LLM grouping optional.

Core Workflow

The bundled runtime is the default path. Resolve SKILL_DIR to this skill directory, then run:

python3 "$SKILL_DIR/scripts/watch.py" "$SOURCE" --detail balanced \
  --out-dir "$HOME/.hermes/video-watch/$(date +%Y%m%d-%H%M%S)"

SOURCE may be a public URL or local video. The command reports metadata, transcript source, extraction engine, dedup count, timestamped frame paths, and its working directory.

Detail modes

Mode	Behavior	Default cap
`transcript`	Captions only; skips video download when possible	0 frames
`efficient`	Fast encoded-keyframe scan; uniform fallback when too sparse	50
`balanced`	Scene-change extraction across the full range; uniform fallback for static video	100
`token-burner`	Scene-aware and uncapped; use only when fidelity justifies the cost	unlimited

The default is balanced. Override per run with --detail, or set WATCH_DETAIL in the environment or ~/.config/watch/.env.

Frame behavior

The runtime now provides all of these automatically:

duration-aware whole-video budgets, capped at 2 fps
denser focused-window budgets with --start and --end
scene-aware selection for balanced and token-burner
fast I-frame/keyframe selection for efficient
conservative near-duplicate removal before the cap (--no-dedup disables it)
even sampling across the full timeline when candidates exceed the cap
exact transcript-cue frames with --timestamps T1,T2,...
--resolution 1024 for text-heavy UI/slides; otherwise keep the 512 px default

Examples:

# Cheap first pass
python3 "$SKILL_DIR/scripts/watch.py" "$SOURCE" --detail efficient

# Dense inspection of a named range
python3 "$SKILL_DIR/scripts/watch.py" "$SOURCE" \
  --detail balanced --start 2:15 --end 2:45 --resolution 1024

# Pin frames where the transcript says “look here” or “notice this”
python3 "$SKILL_DIR/scripts/watch.py" "$LOCAL_VIDEO" \
  --detail transcript --timestamps 4:32,7:10,9:55

Cue frames are reserved against the cap and never evicted by ordinary sampling. When rerunning a URL for cue frames, reuse the downloaded local video from the first work directory to avoid another network download.

Transcript Strategy

Use transcript sources in this order:

YouTube / transcript-first task: run youtube-content first. Supadata bypasses the VPS YouTube bot wall and persistently archives metadata, chapters, links, and timestamped segments.
Visual URL analysis: the bundled runtime tries native captions before downloading media.
No captions or local media: the runtime extracts mono 16 kHz/64 kbps audio and uses Groq Whisper first, OpenAI second when keys are configured.
Long audio: audio over the 24 MB safety threshold is split automatically; timestamps are shifted back into source time. A failed chunk is skipped, and the transcript is accepted if at least one chunk succeeds.
No transcript path: proceed frames-only and state the limitation.

Supported runtime flags include --whisper groq|openai, --no-whisper, --max-frames N, --fps F, and --out-dir DIR.

Instagram browser-resource fallback

If Instagram yt-dlp extraction fails with an empty media response but the browser can play the post, do not stop immediately or ask for cookies as the only path:

Open the post and confirm playback/duration via document.querySelector('video').
Inspect performance.getEntriesByType('resource') for .mp4 CDN URLs.
Identify separate video/audio resources when present.
Remove transient byte-range query parameters, download the full asset, and verify with ffprobe.
Feed the local asset to scripts/watch.py.

See references/instagram-browser-resource-audio.md for the concrete workaround.

Hermes Vision Pattern

Hermes should not try to reason from filenames alone. Use vision_analyze.

Best practice: contact sheets first

For many frames, create contact sheets in batches:

ffmpeg -y -pattern_type glob -i "$WORKDIR/frames/*.jpg" \
  -vf "scale=320:-1,tile=4x4" "$WORKDIR/contact-1.jpg"

If needed, make multiple contact sheets from subsets of frames.

Then use vision_analyze to answer questions like: - what changes across these frames? - when does the UI break? - what text is visible on screen? - what visual hook opens the video?

Escalate to individual frames

If a contact sheet reveals an important region, inspect the most relevant individual frames with vision_analyze for precise details.

QA a generated Video Story / AI video output

Treat this as a product QA pass, not just a summary.
Start by probing exact duration and comparing it to the requested/declared target.
Extract dense enough frames for the short output (<=30s → ~2 fps is usually fine) and make timestamp-labeled contact sheets.
Inspect key individual frames where the contact sheet shows drift or artifacts.
If the video came from Video Story, compare visuals against the DB/project plan when available: script segments, scene descriptions, shot prompts, reference images, lip-sync status, and export logs.
Report timestamped defects in terms Alex can use to improve the pipeline: subject/reference drift, unrequested new characters, story continuity breaks, pacing dead zones, unreadable text, lip-sync/talking-head believability, artifact-prone actions (hands/water/hair/cloth/energy), style drift, and whether the final shot resolves the requested story.
For user-supplied reference subjects, explicitly check beginning/middle/end identity preservation and whether the ending stays on the intended subject.

Common Recipes

Break down a YouTube hook

extract the first 10-20 seconds densely
inspect opening frames with vision
align with opening transcript lines
report: first visual, first spoken line, pacing shift, pattern interrupt

Debug a screen recording

focus on the suspicious time range
use higher resolution frames
look for state changes, disabled controls, error banners, modal transitions, or missing renders
if needed, compare pre-failure and failure frames side by side

Summarize a long video cheaply

start with transcript/captions
do sparse whole-video frames
if visual ambiguity remains, rerun only on the relevant chapter or timestamp window

Capability-Parity Audits

When comparing this pipeline with another video tool, separate overlapping outcomes from implementation parity. Do not say “we already have it” merely because both systems can produce a transcript or sample frames.

Audit the concrete runtime capabilities first:

one-command executable path versus a prose/manual workflow
scene-aware selection versus fixed-interval sampling
keyframe-only fast mode
near-duplicate removal
focused-range transcript filtering and denser frame budgets
forced frames at transcript-cue timestamps
long-audio chunking and partial-failure recovery
tests, setup automation, persistence, and provider/network fallbacks

State the result precisely: already equivalent, partially overlapping, or missing and worth porting. Also distinguish what actually ran in the current request. A transcript-only extraction is not a visual watch; say explicitly when no frames were downloaded or inspected.

When a useful external implementation is MIT-compatible and fills real gaps, prefer porting its tested runtime into this class-level Hermes skill while retaining Hermes-specific strengths, rather than installing a parallel duplicate. Preserve attribution, pin the upstream revision, add regression tests before production code, and verify with a synthetic video smoke.

Common Pitfalls

Scanning a long video end-to-end when the user asked about one moment. Use focused extraction.
Using too many high-resolution frames. Token/cost grows fast; bump resolution only for text-heavy scenes.
Assuming transcript-only is enough. For demos, bugs, hooks, slides, or charts, visuals are often the main signal. Never describe a transcript-only run as having watched the video.
Claiming feature parity from category-level overlap. Compare the concrete runtime checklist above before telling the user an external tool adds nothing.
Trusting a sparse scan too much. For videos over 10 minutes, call it a sparse scan and offer a targeted rerun.
Forgetting local files. This pattern works for local recordings too; not just web URLs.
Wasting time on unbounded yt-dlp variations when egress is challenged. Confirm the error with current verbose output, use Supadata for transcript/metadata, and follow the bounded proxy verification ladder in references/youtube-vps-egress-and-account-safety.md. Test one static ISP egress anonymously before exposing account cookies; do not rotate accounts/IPs after challenges.
Not checking skills first. Before experimenting with any approach (curl, browser, Python API, Tor), load the relevant skills. The user has established pipelines — youtube-content for metadata/transcripts, video-watch for frames, whisper for audio — that encode known-working approaches.

Verification Checklist

[ ] Confirmed ffmpeg, ffprobe, and yt-dlp exist
[ ] Ran the bundled scripts/watch.py rather than rebuilding the extraction loop ad hoc
[ ] Selected the cheapest sufficient detail mode and used focused ranges when possible
[ ] Confirmed the report names the expected engine (keyframe, scene, or uniform) and frame count
[ ] Preferred Supadata/native captions before Whisper fallback
[ ] Used vision_analyze on contact sheets or frames
[ ] Answer grounded in visible evidence and timestamps
[ ] Stated clearly if transcript/audio evidence was unavailable
[ ] For runtime changes, ran pytest -q plus a real ffmpeg-synthesized smoke