video-watch

/home/avalon/.hermes/skills/media/video-watch/SKILL.md · raw

Video Watch for Hermes

Overview

This is a Hermes-native adaptation of the Claude /watch pattern from bradautomates/claude-video.

Use it when the user needs an agent to watch a video rather than just summarize metadata. This is the primary Hermes-native visual-analysis path, and it intentionally stays close to the Brad Bonanno /watch pattern rather than relying on URL-level Gemini analysis. The core pattern is:

  1. download or locate the video
  2. extract a bounded set of timestamped frames with ffmpeg
  3. obtain captions/transcript if available
  4. analyze the frames with Hermes vision tools
  5. answer using both the visual evidence and the transcript

Unlike Claude Code's Read-image workflow, Hermes should use vision_analyze on extracted frames or contact sheets.

When to Use

Use this skill when the user asks to: - analyze a YouTube, Loom, Vimeo, TikTok, X, Instagram, or other public video URL - inspect a local .mp4, .mov, .mkv, or .webm - identify what happens at a specific timestamp - debug from a screen recording - break down hooks, pacing, visual structure, or on-screen text - summarize a video where the visuals matter, not just the spoken words

Do not use this skill when: - the task is only YouTube metadata/transcript extraction for archiving or indexing → prefer youtube-content - the user only needs title/description/channel info - a static screenshot would answer the question more cheaply than processing the whole video

Prerequisites

Check the live environment first:

command -v ffmpeg
command -v yt-dlp
python3 --version

Optional transcription fallbacks: - native captions via yt-dlp are preferred and usually free - if no captions exist, use Groq/OpenAI Whisper only if credentials are configured

Core Workflow

1) Create a working directory

TS=$(date +%Y%m%d-%H%M%S)
WORKDIR="$HOME/.hermes/video-watch/$TS"
mkdir -p "$WORKDIR/frames"

2) Resolve the source

If the input is a URL, download it with yt-dlp:

yt-dlp -f 'bestvideo[height<=720]+bestaudio/best[height<=720]/best' \
  --merge-output-format mp4 \
  -o "$WORKDIR/video.%(ext)s" \
  "$URL"

If the input is a local file, use it in place and copy only if a stable workdir is useful.

3) Probe duration

ffprobe -v error -show_entries format=duration \
  -of default=noprint_wrappers=1:nokey=1 "$VIDEO"

Use duration to set frame density.

Frame Budget

Follow the claude-video spirit: enough visual coverage to ground the answer without exploding cost.

Default whole-video targets: - <=30s → about 30 frames - 30-60s → about 40 frames - 1-3m → about 60 frames - 3-10m → about 80 frames - >10m → cap around 100 frames and warn that this is a sparse scan

Focused-window targets: - <=5s → up to 2 fps - 5-15s → up to 2 fps - 15-30s → dense focused extraction - 30-180s → enough frames to inspect the named segment without scanning the entire video

If the user names a moment or range, prefer --start / --end style focused extraction over full-video scanning.

Extract Frames

For a full video, compute fps from budget ÷ duration. Keep it simple and bounded.

Example sparse extraction:

ffmpeg -y -i "$VIDEO" -vf "fps=$FPS,scale=768:-1" "$WORKDIR/frames/frame_%04d.jpg"

Focused extraction:

ffmpeg -y -ss "$START" -to "$END" -i "$VIDEO" \
  -vf "fps=$FPS,scale=1024:-1" \
  "$WORKDIR/frames/frame_%04d.jpg"

Use 1024 width only when the user needs readable UI/code/slide text. Otherwise prefer 768 or smaller.

Get Transcript / Captions

Preferred: native captions from yt-dlp

For public web videos, first attempt captions:

yt-dlp --skip-download \
  --write-subs --write-auto-subs \
  --sub-langs "en.*,en,-live_chat" \
  -o "$WORKDIR/source.%(ext)s" \
  "$URL"

Then inspect generated .vtt/.srt files in the workdir. Use native captions when available.

Fallback: Whisper

If no captions exist and the user still needs spoken content:

ffmpeg -y -i "$VIDEO" -vn -ac 1 -ar 16000 "$WORKDIR/audio.wav"

Then transcribe using the configured provider/tooling available in the environment. Prefer Groq for speed/cost when configured.

If no transcription path exists, proceed visually and state clearly that the answer is based on frames only.

Hermes Vision Pattern

Hermes should not try to reason from filenames alone. Use vision_analyze.

Best practice: contact sheets first

For many frames, create contact sheets in batches:

ffmpeg -y -pattern_type glob -i "$WORKDIR/frames/*.jpg" \
  -vf "scale=320:-1,tile=4x4" "$WORKDIR/contact-1.jpg"

If needed, make multiple contact sheets from subsets of frames.

Then use vision_analyze to answer questions like: - what changes across these frames? - when does the UI break? - what text is visible on screen? - what visual hook opens the video?

Escalate to individual frames

If a contact sheet reveals an important region, inspect the most relevant individual frames with vision_analyze for precise details.

Combine: - visual findings from frames/contact sheets - transcript/caption evidence - exact timestamps whenever possible

Structure answers as: 1. direct answer 2. timestamped evidence 3. notable uncertainty or gaps 4. optional suggestion to rerun on a narrower window if needed

Common Recipes

Break down a YouTube hook

Debug a screen recording

Summarize a long video cheaply

Common Pitfalls

  1. Scanning a long video end-to-end when the user asked about one moment. Use focused extraction.
  2. Using too many high-resolution frames. Token/cost grows fast; bump resolution only for text-heavy scenes.
  3. Assuming transcript-only is enough. For demos, bugs, hooks, slides, or charts, visuals are often the main signal.
  4. Trusting a sparse scan too much. For videos over 10 minutes, call it a sparse scan and offer a targeted rerun.
  5. Forgetting local files. This pattern works for local recordings too; not just web URLs.

Verification Checklist