---
name: video-watch
description: Use when the user needs grounded analysis of a video URL or local file by extracting frames, aligning them with transcript/captions, and answering based on what is visibly on screen rather than title/description alone.
version: 2.0.0
author: Hermes Agent
license: MIT
metadata:
  hermes:
    tags: [video, vision, yt-dlp, ffmpeg, transcript, analysis]
    related_skills: [youtube-content, whisper, systematic-debugging]
---

# Video Watch for Hermes

## Overview

This is a Hermes-native adaptation of the Claude `/watch` pattern from `bradautomates/claude-video`. Version 2 includes an executable runtime port of its MIT-licensed 0.2.0 scene/keyframe, deduplication, cue-frame, focused-range, and chunked-Whisper features. See `references/upstream-attribution.md`.

Use it when the user needs an agent to *watch* a video rather than just summarize metadata. This is the primary Hermes-native visual-analysis path, and it intentionally stays close to the Brad Bonanno `/watch` pattern rather than relying on URL-level Gemini analysis. The core pattern is:

1. download or locate the video
2. extract a bounded set of timestamped frames with `ffmpeg`
3. obtain captions/transcript if available
4. analyze the frames with Hermes vision tools
5. answer using both the visual evidence and the transcript

Unlike Claude Code's `Read`-image workflow, Hermes should use `vision_analyze` on extracted frames or contact sheets.

## When to Use

Use this skill when the user asks to:
- analyze a YouTube, Loom, Vimeo, TikTok, X, Instagram, or other public video URL
- inspect a local `.mp4`, `.mov`, `.mkv`, or `.webm`
- identify what happens at a specific timestamp
- debug from a screen recording
- break down hooks, pacing, visual structure, or on-screen text
- summarize a video where the visuals matter, not just the spoken words

Do **not** use this skill when:
- the task is only YouTube metadata/transcript extraction for archiving or indexing → prefer `youtube-content`
- the user only needs title/description/channel info
- a static screenshot would answer the question more cheaply than processing the whole video

## IMPORTANT: Check Skills First

When the user asks about any YouTube video — whether to download audio, extract metadata, transcribe, or analyze — **do not start experimenting with random approaches**. Load the relevant skills first:

- `youtube-content` — metadata + transcript extraction via Supadata API (no IP blocks, preferred for YouTube)
- `video-watch` — visual frame analysis (this skill)
- `whisper` — audio transcription fallback

The user has these pipelines for a reason. Starting with them avoids wasted cycles on approaches the user already has established workflows for.

## Prerequisites

Check the live environment first:

```bash
command -v ffmpeg
command -v yt-dlp
python3 --version
```

Optional transcription fallbacks:
- native captions via `yt-dlp` are preferred and usually free
- if no captions exist, use Groq/OpenAI Whisper only if credentials are configured
- Supadata API (`youtube-content` skill) for transcript when yt-dlp is blocked

## YouTube Bot Detection — Critical Note

For the full cloud-VPS retrieval architecture—anonymous proxy-first testing, dedicated ISP/static-residential egress selection, account/cookie safety, verification ladder, and conservative runtime defaults—read `references/youtube-vps-egress-and-account-safety.md` before configuring a paid proxy or attaching an account.

**A cloud/VPS egress IP may be challenged by YouTube.** If `yt-dlp` fails with:

```
ERROR: [youtube] VIDEO_ID: Sign in to confirm you're not a bot.
```

Treat this first as an egress/session trust problem, not automatically as a broken `yt-dlp` installation. Re-check the live environment and current verbose output rather than hard-coding assumptions about every alternate client or provider.

**Paths that remain useful without raw media:**
- **Supadata API** (`youtube-content` skill) for metadata and transcripts
- Direct watch-page HTML parsing for title, channel, description, chapters, and transcript-panel evidence

**Bounded escalation for actual media:**
1. Use Supadata when the task only requires transcript/metadata.
2. For public media, test one fixed dedicated ISP/static-residential proxy **anonymously first**. Verify egress, run a simulation, download one small file, probe it, and run `video-watch` before persisting the provider.
3. If the anonymous proxied lane works, keep public downloads account-free.
4. Attach cookies from one fully secured secondary account only when the content legitimately requires authentication. Never paste passwords, TOTP seeds, backup codes, or cookies into chat or a link-shareable sheet.
5. Add a PO-token provider/client only when current `yt-dlp` output shows it is required.
6. Do not cycle random proxy strings, clients, IPs, or accounts. One bounded proof of concept is useful; repeated evasion attempts are not.

### Configured VPS proxy behavior

On a VPS with a verified YouTube egress lane, `scripts/download.py` automatically applies the proxy to YouTube-family URLs only (`youtube.com`, `youtu.be`, and `youtube-nocookie.com`). It reads `YOUTUBE_PROXY_URL` from the process environment first, then from `${YOUTUBE_PROXY_SECRET_FILE:-~/.hermes/secrets/youtube-proxy.env}`. Use this file shape and mode:

```bash
# Preferred on a VPS: provider-side IP whitelist, no credentials stored
YOUTUBE_PROXY_URL='http://HOST:PORT'

# Fallback only when IP whitelisting is unavailable
# YOUTUBE_PROXY_URL='http://USER:PASS@HOST:PORT'

chmod 600 ~/.hermes/secrets/youtube-proxy.env
```

The proxy is passed through the `yt-dlp` subprocess environment rather than command arguments, so it is absent from ordinary process argv and command logs. Proxied YouTube downloads automatically use one concurrent fragment, 2-second request sleeps, 5–10-second inter-download sleeps, and a 5 MB/s limit. Non-YouTube URLs remain direct and retain their existing behavior. On this profile, the optional `~/.local/bin/hermes-youtube-fetch` wrapper provides the same protected lane for explicit download-first workflows; pass its local output to `watch.py` to avoid downloading the same source twice.

Run the bundled redaction-safe verification probe after adding or rotating a proxy. It checks mode `0600`, compares direct/proxied egress, validates a harmless YouTube endpoint, and can run an anonymous simulation without putting credentials in argv:

```bash
python3 "$SKILL_DIR/scripts/verify_youtube_proxy.py" \
  --simulate-url "https://www.youtube.com/watch?v=PUBLIC_VIDEO_ID"
```

## Shot-Boundary and Scene-Manifest Workflows

When the user asks for scene/cut detection, clip splitting, scene-aware search, or generated-video cut QA, read `references/shot-boundary-scene-manifests.md`. Preserve detected boundaries independently from representative-frame sampling, default explicit scene-detection requests to a hybrid boundary-plus-long-shot-coverage mode, and keep semantic/LLM grouping optional.

## Core Workflow

The bundled runtime is the default path. Resolve `SKILL_DIR` to this skill directory, then run:

```bash
python3 "$SKILL_DIR/scripts/watch.py" "$SOURCE" --detail balanced \
  --out-dir "$HOME/.hermes/video-watch/$(date +%Y%m%d-%H%M%S)"
```

`SOURCE` may be a public URL or local video. The command reports metadata, transcript source, extraction engine, dedup count, timestamped frame paths, and its working directory.

### Detail modes

| Mode | Behavior | Default cap |
|---|---|---:|
| `transcript` | Captions only; skips video download when possible | 0 frames |
| `efficient` | Fast encoded-keyframe scan; uniform fallback when too sparse | 50 |
| `balanced` | Scene-change extraction across the full range; uniform fallback for static video | 100 |
| `token-burner` | Scene-aware and uncapped; use only when fidelity justifies the cost | unlimited |

The default is `balanced`. Override per run with `--detail`, or set `WATCH_DETAIL` in the environment or `~/.config/watch/.env`.

### Frame behavior

The runtime now provides all of these automatically:

- duration-aware whole-video budgets, capped at 2 fps
- denser focused-window budgets with `--start` and `--end`
- scene-aware selection for `balanced` and `token-burner`
- fast I-frame/keyframe selection for `efficient`
- conservative near-duplicate removal before the cap (`--no-dedup` disables it)
- even sampling across the full timeline when candidates exceed the cap
- exact transcript-cue frames with `--timestamps T1,T2,...`
- `--resolution 1024` for text-heavy UI/slides; otherwise keep the 512 px default

Examples:

```bash
# Cheap first pass
python3 "$SKILL_DIR/scripts/watch.py" "$SOURCE" --detail efficient

# Dense inspection of a named range
python3 "$SKILL_DIR/scripts/watch.py" "$SOURCE" \
  --detail balanced --start 2:15 --end 2:45 --resolution 1024

# Pin frames where the transcript says “look here” or “notice this”
python3 "$SKILL_DIR/scripts/watch.py" "$LOCAL_VIDEO" \
  --detail transcript --timestamps 4:32,7:10,9:55
```

Cue frames are reserved against the cap and never evicted by ordinary sampling. When rerunning a URL for cue frames, reuse the downloaded local video from the first work directory to avoid another network download.

## Transcript Strategy

Use transcript sources in this order:

1. **YouTube / transcript-first task:** run `youtube-content` first. Supadata bypasses the VPS YouTube bot wall and persistently archives metadata, chapters, links, and timestamped segments.
2. **Visual URL analysis:** the bundled runtime tries native captions before downloading media.
3. **No captions or local media:** the runtime extracts mono 16 kHz/64 kbps audio and uses Groq Whisper first, OpenAI second when keys are configured.
4. **Long audio:** audio over the 24 MB safety threshold is split automatically; timestamps are shifted back into source time. A failed chunk is skipped, and the transcript is accepted if at least one chunk succeeds.
5. **No transcript path:** proceed frames-only and state the limitation.

Supported runtime flags include `--whisper groq|openai`, `--no-whisper`, `--max-frames N`, `--fps F`, and `--out-dir DIR`.

### Instagram browser-resource fallback

If Instagram `yt-dlp` extraction fails with an empty media response but the browser can play the post, do **not** stop immediately or ask for cookies as the only path:

1. Open the post and confirm playback/duration via `document.querySelector('video')`.
2. Inspect `performance.getEntriesByType('resource')` for `.mp4` CDN URLs.
3. Identify separate video/audio resources when present.
4. Remove transient byte-range query parameters, download the full asset, and verify with `ffprobe`.
5. Feed the local asset to `scripts/watch.py`.

See `references/instagram-browser-resource-audio.md` for the concrete workaround.

## Hermes Vision Pattern

Hermes should not try to reason from filenames alone. Use `vision_analyze`.

### Best practice: contact sheets first

For many frames, create contact sheets in batches:

```bash
ffmpeg -y -pattern_type glob -i "$WORKDIR/frames/*.jpg" \
  -vf "scale=320:-1,tile=4x4" "$WORKDIR/contact-1.jpg"
```

If needed, make multiple contact sheets from subsets of frames.

Then use `vision_analyze` to answer questions like:
- what changes across these frames?
- when does the UI break?
- what text is visible on screen?
- what visual hook opens the video?

### Escalate to individual frames

If a contact sheet reveals an important region, inspect the most relevant individual frames with `vision_analyze` for precise details.

### QA a generated Video Story / AI video output
- Treat this as a product QA pass, not just a summary.
- Start by probing exact duration and comparing it to the requested/declared target.
- Extract dense enough frames for the short output (`<=30s` → ~2 fps is usually fine) and make timestamp-labeled contact sheets.
- Inspect key individual frames where the contact sheet shows drift or artifacts.
- If the video came from Video Story, compare visuals against the DB/project plan when available: script segments, scene descriptions, shot prompts, reference images, lip-sync status, and export logs.
- Report timestamped defects in terms Alex can use to improve the pipeline: subject/reference drift, unrequested new characters, story continuity breaks, pacing dead zones, unreadable text, lip-sync/talking-head believability, artifact-prone actions (hands/water/hair/cloth/energy), style drift, and whether the final shot resolves the requested story.
- For user-supplied reference subjects, explicitly check beginning/middle/end identity preservation and whether the ending stays on the intended subject.

## Recommended Answer Pattern

Combine:
- visual findings from frames/contact sheets
- transcript/caption evidence
- exact timestamps whenever possible

Structure answers as:
1. direct answer
2. timestamped evidence
3. notable uncertainty or gaps
4. optional suggestion to rerun on a narrower window if needed

## Common Recipes

### Break down a YouTube hook
- extract the first 10-20 seconds densely
- inspect opening frames with vision
- align with opening transcript lines
- report: first visual, first spoken line, pacing shift, pattern interrupt

### Debug a screen recording
- focus on the suspicious time range
- use higher resolution frames
- look for state changes, disabled controls, error banners, modal transitions, or missing renders
- if needed, compare pre-failure and failure frames side by side

### Summarize a long video cheaply
- start with transcript/captions
- do sparse whole-video frames
- if visual ambiguity remains, rerun only on the relevant chapter or timestamp window

## Capability-Parity Audits

When comparing this pipeline with another video tool, separate **overlapping outcomes** from **implementation parity**. Do not say “we already have it” merely because both systems can produce a transcript or sample frames.

Audit the concrete runtime capabilities first:

- one-command executable path versus a prose/manual workflow
- scene-aware selection versus fixed-interval sampling
- keyframe-only fast mode
- near-duplicate removal
- focused-range transcript filtering and denser frame budgets
- forced frames at transcript-cue timestamps
- long-audio chunking and partial-failure recovery
- tests, setup automation, persistence, and provider/network fallbacks

State the result precisely: **already equivalent**, **partially overlapping**, or **missing and worth porting**. Also distinguish what actually ran in the current request. A transcript-only extraction is not a visual watch; say explicitly when no frames were downloaded or inspected.

When a useful external implementation is MIT-compatible and fills real gaps, prefer porting its tested runtime into this class-level Hermes skill while retaining Hermes-specific strengths, rather than installing a parallel duplicate. Preserve attribution, pin the upstream revision, add regression tests before production code, and verify with a synthetic video smoke.

## Common Pitfalls

1. **Scanning a long video end-to-end when the user asked about one moment.** Use focused extraction.
2. **Using too many high-resolution frames.** Token/cost grows fast; bump resolution only for text-heavy scenes.
3. **Assuming transcript-only is enough.** For demos, bugs, hooks, slides, or charts, visuals are often the main signal. Never describe a transcript-only run as having watched the video.
4. **Claiming feature parity from category-level overlap.** Compare the concrete runtime checklist above before telling the user an external tool adds nothing.
5. **Trusting a sparse scan too much.** For videos over 10 minutes, call it a sparse scan and offer a targeted rerun.
6. **Forgetting local files.** This pattern works for local recordings too; not just web URLs.
7. **Wasting time on unbounded yt-dlp variations when egress is challenged.** Confirm the error with current verbose output, use Supadata for transcript/metadata, and follow the bounded proxy verification ladder in `references/youtube-vps-egress-and-account-safety.md`. Test one static ISP egress anonymously before exposing account cookies; do not rotate accounts/IPs after challenges.
8. **Not checking skills first.** Before experimenting with any approach (curl, browser, Python API, Tor), load the relevant skills. The user has established pipelines — youtube-content for metadata/transcripts, video-watch for frames, whisper for audio — that encode known-working approaches.

## Verification Checklist

- [ ] Confirmed `ffmpeg`, `ffprobe`, and `yt-dlp` exist
- [ ] Ran the bundled `scripts/watch.py` rather than rebuilding the extraction loop ad hoc
- [ ] Selected the cheapest sufficient detail mode and used focused ranges when possible
- [ ] Confirmed the report names the expected engine (`keyframe`, `scene`, or `uniform`) and frame count
- [ ] Preferred Supadata/native captions before Whisper fallback
- [ ] Used `vision_analyze` on contact sheets or frames
- [ ] Answer grounded in visible evidence and timestamps
- [ ] Stated clearly if transcript/audio evidence was unavailable
- [ ] For runtime changes, ran `pytest -q` plus a real ffmpeg-synthesized smoke