--- name: video-watch description: Use when the user needs grounded analysis of a video URL or local file by extracting frames, aligning them with transcript/captions, and answering based on what is visibly on screen rather than title/description alone. version: 1.0.0 author: Hermes Agent license: MIT metadata: hermes: tags: [video, vision, yt-dlp, ffmpeg, transcript, analysis] related_skills: [youtube-content, whisper, systematic-debugging] --- # Video Watch for Hermes ## Overview This is a Hermes-native adaptation of the Claude `/watch` pattern from `bradautomates/claude-video`. Use it when the user needs an agent to *watch* a video rather than just summarize metadata. This is the primary Hermes-native visual-analysis path, and it intentionally stays close to the Brad Bonanno `/watch` pattern rather than relying on URL-level Gemini analysis. The core pattern is: 1. download or locate the video 2. extract a bounded set of timestamped frames with `ffmpeg` 3. obtain captions/transcript if available 4. analyze the frames with Hermes vision tools 5. answer using both the visual evidence and the transcript Unlike Claude Code's `Read`-image workflow, Hermes should use `vision_analyze` on extracted frames or contact sheets. ## When to Use Use this skill when the user asks to: - analyze a YouTube, Loom, Vimeo, TikTok, X, Instagram, or other public video URL - inspect a local `.mp4`, `.mov`, `.mkv`, or `.webm` - identify what happens at a specific timestamp - debug from a screen recording - break down hooks, pacing, visual structure, or on-screen text - summarize a video where the visuals matter, not just the spoken words Do **not** use this skill when: - the task is only YouTube metadata/transcript extraction for archiving or indexing → prefer `youtube-content` - the user only needs title/description/channel info - a static screenshot would answer the question more cheaply than processing the whole video ## Prerequisites Check the live environment first: ```bash command -v ffmpeg command -v yt-dlp python3 --version ``` Optional transcription fallbacks: - native captions via `yt-dlp` are preferred and usually free - if no captions exist, use Groq/OpenAI Whisper only if credentials are configured ## Core Workflow ### 1) Create a working directory ```bash TS=$(date +%Y%m%d-%H%M%S) WORKDIR="$HOME/.hermes/video-watch/$TS" mkdir -p "$WORKDIR/frames" ``` ### 2) Resolve the source If the input is a URL, download it with `yt-dlp`: ```bash yt-dlp -f 'bestvideo[height<=720]+bestaudio/best[height<=720]/best' \ --merge-output-format mp4 \ -o "$WORKDIR/video.%(ext)s" \ "$URL" ``` If the input is a local file, use it in place and copy only if a stable workdir is useful. ### 3) Probe duration ```bash ffprobe -v error -show_entries format=duration \ -of default=noprint_wrappers=1:nokey=1 "$VIDEO" ``` Use duration to set frame density. ## Frame Budget Follow the `claude-video` spirit: enough visual coverage to ground the answer without exploding cost. Default whole-video targets: - `<=30s` → about 30 frames - `30-60s` → about 40 frames - `1-3m` → about 60 frames - `3-10m` → about 80 frames - `>10m` → cap around 100 frames and warn that this is a sparse scan Focused-window targets: - `<=5s` → up to 2 fps - `5-15s` → up to 2 fps - `15-30s` → dense focused extraction - `30-180s` → enough frames to inspect the named segment without scanning the entire video If the user names a moment or range, prefer `--start` / `--end` style focused extraction over full-video scanning. ## Extract Frames For a full video, compute fps from budget ÷ duration. Keep it simple and bounded. Example sparse extraction: ```bash ffmpeg -y -i "$VIDEO" -vf "fps=$FPS,scale=768:-1" "$WORKDIR/frames/frame_%04d.jpg" ``` Focused extraction: ```bash ffmpeg -y -ss "$START" -to "$END" -i "$VIDEO" \ -vf "fps=$FPS,scale=1024:-1" \ "$WORKDIR/frames/frame_%04d.jpg" ``` Use `1024` width only when the user needs readable UI/code/slide text. Otherwise prefer `768` or smaller. ## Get Transcript / Captions ### Preferred: native captions from yt-dlp For public web videos, first attempt captions: ```bash yt-dlp --skip-download \ --write-subs --write-auto-subs \ --sub-langs "en.*,en,-live_chat" \ -o "$WORKDIR/source.%(ext)s" \ "$URL" ``` Then inspect generated `.vtt`/`.srt` files in the workdir. Use native captions when available. ### Fallback: Whisper If no captions exist and the user still needs spoken content: ```bash ffmpeg -y -i "$VIDEO" -vn -ac 1 -ar 16000 "$WORKDIR/audio.wav" ``` Then transcribe using the configured provider/tooling available in the environment. Prefer Groq for speed/cost when configured. If no transcription path exists, proceed visually and state clearly that the answer is based on frames only. ## Hermes Vision Pattern Hermes should not try to reason from filenames alone. Use `vision_analyze`. ### Best practice: contact sheets first For many frames, create contact sheets in batches: ```bash ffmpeg -y -pattern_type glob -i "$WORKDIR/frames/*.jpg" \ -vf "scale=320:-1,tile=4x4" "$WORKDIR/contact-1.jpg" ``` If needed, make multiple contact sheets from subsets of frames. Then use `vision_analyze` to answer questions like: - what changes across these frames? - when does the UI break? - what text is visible on screen? - what visual hook opens the video? ### Escalate to individual frames If a contact sheet reveals an important region, inspect the most relevant individual frames with `vision_analyze` for precise details. ## Recommended Answer Pattern Combine: - visual findings from frames/contact sheets - transcript/caption evidence - exact timestamps whenever possible Structure answers as: 1. direct answer 2. timestamped evidence 3. notable uncertainty or gaps 4. optional suggestion to rerun on a narrower window if needed ## Common Recipes ### Break down a YouTube hook - extract the first 10-20 seconds densely - inspect opening frames with vision - align with opening transcript lines - report: first visual, first spoken line, pacing shift, pattern interrupt ### Debug a screen recording - focus on the suspicious time range - use higher resolution frames - look for state changes, disabled controls, error banners, modal transitions, or missing renders - if needed, compare pre-failure and failure frames side by side ### Summarize a long video cheaply - start with transcript/captions - do sparse whole-video frames - if visual ambiguity remains, rerun only on the relevant chapter or timestamp window ## Common Pitfalls 1. **Scanning a long video end-to-end when the user asked about one moment.** Use focused extraction. 2. **Using too many high-resolution frames.** Token/cost grows fast; bump resolution only for text-heavy scenes. 3. **Assuming transcript-only is enough.** For demos, bugs, hooks, slides, or charts, visuals are often the main signal. 4. **Trusting a sparse scan too much.** For videos over 10 minutes, call it a sparse scan and offer a targeted rerun. 5. **Forgetting local files.** This pattern works for local recordings too; not just web URLs. ## Verification Checklist - [ ] Confirmed `ffmpeg` and `yt-dlp` exist - [ ] Used a bounded frame budget - [ ] Preferred captions before Whisper fallback - [ ] Used `vision_analyze` on contact sheets or frames - [ ] Answer grounded in visible evidence and timestamps - [ ] Stated clearly if transcript/audio evidence was unavailable