This is a Hermes-native adaptation of the Claude /watch pattern from bradautomates/claude-video.
Use it when the user needs an agent to watch a video rather than just summarize metadata. This is the primary Hermes-native visual-analysis path, and it intentionally stays close to the Brad Bonanno /watch pattern rather than relying on URL-level Gemini analysis. The core pattern is:
ffmpegUnlike Claude Code's Read-image workflow, Hermes should use vision_analyze on extracted frames or contact sheets.
Use this skill when the user asks to:
- analyze a YouTube, Loom, Vimeo, TikTok, X, Instagram, or other public video URL
- inspect a local .mp4, .mov, .mkv, or .webm
- identify what happens at a specific timestamp
- debug from a screen recording
- break down hooks, pacing, visual structure, or on-screen text
- summarize a video where the visuals matter, not just the spoken words
Do not use this skill when:
- the task is only YouTube metadata/transcript extraction for archiving or indexing → prefer youtube-content
- the user only needs title/description/channel info
- a static screenshot would answer the question more cheaply than processing the whole video
Check the live environment first:
command -v ffmpeg
command -v yt-dlp
python3 --version
Optional transcription fallbacks:
- native captions via yt-dlp are preferred and usually free
- if no captions exist, use Groq/OpenAI Whisper only if credentials are configured
TS=$(date +%Y%m%d-%H%M%S)
WORKDIR="$HOME/.hermes/video-watch/$TS"
mkdir -p "$WORKDIR/frames"
If the input is a URL, download it with yt-dlp:
yt-dlp -f 'bestvideo[height<=720]+bestaudio/best[height<=720]/best' \
--merge-output-format mp4 \
-o "$WORKDIR/video.%(ext)s" \
"$URL"
If the input is a local file, use it in place and copy only if a stable workdir is useful.
ffprobe -v error -show_entries format=duration \
-of default=noprint_wrappers=1:nokey=1 "$VIDEO"
Use duration to set frame density.
Follow the claude-video spirit: enough visual coverage to ground the answer without exploding cost.
Default whole-video targets:
- <=30s → about 30 frames
- 30-60s → about 40 frames
- 1-3m → about 60 frames
- 3-10m → about 80 frames
- >10m → cap around 100 frames and warn that this is a sparse scan
Focused-window targets:
- <=5s → up to 2 fps
- 5-15s → up to 2 fps
- 15-30s → dense focused extraction
- 30-180s → enough frames to inspect the named segment without scanning the entire video
If the user names a moment or range, prefer --start / --end style focused extraction over full-video scanning.
For a full video, compute fps from budget ÷ duration. Keep it simple and bounded.
Example sparse extraction:
ffmpeg -y -i "$VIDEO" -vf "fps=$FPS,scale=768:-1" "$WORKDIR/frames/frame_%04d.jpg"
Focused extraction:
ffmpeg -y -ss "$START" -to "$END" -i "$VIDEO" \
-vf "fps=$FPS,scale=1024:-1" \
"$WORKDIR/frames/frame_%04d.jpg"
Use 1024 width only when the user needs readable UI/code/slide text. Otherwise prefer 768 or smaller.
For public web videos, first attempt captions:
yt-dlp --skip-download \
--write-subs --write-auto-subs \
--sub-langs "en.*,en,-live_chat" \
-o "$WORKDIR/source.%(ext)s" \
"$URL"
Then inspect generated .vtt/.srt files in the workdir. Use native captions when available.
If no captions exist and the user still needs spoken content:
ffmpeg -y -i "$VIDEO" -vn -ac 1 -ar 16000 "$WORKDIR/audio.wav"
Then transcribe using the configured provider/tooling available in the environment. Prefer Groq for speed/cost when configured.
If no transcription path exists, proceed visually and state clearly that the answer is based on frames only.
Hermes should not try to reason from filenames alone. Use vision_analyze.
For many frames, create contact sheets in batches:
ffmpeg -y -pattern_type glob -i "$WORKDIR/frames/*.jpg" \
-vf "scale=320:-1,tile=4x4" "$WORKDIR/contact-1.jpg"
If needed, make multiple contact sheets from subsets of frames.
Then use vision_analyze to answer questions like:
- what changes across these frames?
- when does the UI break?
- what text is visible on screen?
- what visual hook opens the video?
If a contact sheet reveals an important region, inspect the most relevant individual frames with vision_analyze for precise details.
Combine: - visual findings from frames/contact sheets - transcript/caption evidence - exact timestamps whenever possible
Structure answers as: 1. direct answer 2. timestamped evidence 3. notable uncertainty or gaps 4. optional suggestion to rerun on a narrower window if needed
ffmpeg and yt-dlp existvision_analyze on contact sheets or frames