embedded-video-extraction

/home/avalon/.hermes/skills/media/embedded-video-extraction/SKILL.md · raw

Embedded Video Extraction

When to use

The user asks you to download videos from a website where: - The <video> element's src is a blob: URL (not a real CDN URL) - The site uses Squarespace, Wistia, JWPlayer, Vimeo, or a custom HLS player - Right-click → save doesn't work - The user has legitimate access (logged-in, paid, owns the content)

Do not use for: YouTube (use yt-dlp directly with the page URL), DRM-protected streams (Widevine/PlayReady — yt-dlp cannot decrypt these), or content the user doesn't have rights to.

Core technique

A blob: URL on <video> means the player builds the stream client-side from a master playlist. The real source is exposed somewhere on the page — usually in a data-* attribute, a <script> JSON config, or a network request the page already made. You extract that, then hand it to yt-dlp.

Step 1 — Identify the player

Load the page in the browser tool and inspect what's actually in the DOM:

// browser_console expression
(() => {
  const out = {videos: [], iframes: [], sources: [], dataAttrs: [], urls: []};
  document.querySelectorAll('video').forEach(v => out.videos.push({src: v.src, currentSrc: v.currentSrc, poster: v.poster}));
  document.querySelectorAll('iframe').forEach(f => out.iframes.push({src: f.src}));
  document.querySelectorAll('source').forEach(s => out.sources.push({src: s.src, type: s.type}));
  const html = document.body.innerHTML;
  const dataMatches = html.match(/data-config[^=]*="[^"]+"|data-video[^=]*="[^"]+"|videoUrl[^,}]+/gi);
  out.dataAttrs = dataMatches ? dataMatches.slice(0,10) : [];
  const urlMatches = html.match(/https?:[^"'\s]+\.(?:mp4|m3u8|mpd|webm)[^"'\s]*/gi);
  out.urls = urlMatches ? [...new Set(urlMatches)].slice(0,20) : [];
  return out;
})()

Step 2 — Check the page's resource entries

The player already fetched the playlist while the page loaded; performance entries reveal the real URL:

// browser_console expression
performance.getEntriesByType('resource')
  .filter(e => /\.m3u8|\.mpd|\.mp4|cdn|video/.test(e.name))
  .map(e => e.name)

This is usually the fastest path — the m3u8 master playlist URL is sitting right there.

Step 3 — Map the player to its extraction shape

See references/players.md for the per-player extraction recipes (Squarespace, Wistia, JWPlayer, Vimeo). Add a new entry there whenever you encounter a new player.

Step 4 — Download with yt-dlp

For HLS:

yt-dlp \
  --referer 'https://site.com/' \
  -o 'output-name.%(ext)s' \
  --merge-output-format mp4 \
  'https://cdn.example.com/path/playlist.m3u8'

yt-dlp handles AES-128 encrypted segments natively — no extra flags needed. For login-gated streams where the playlist URL contains a Signature query param, the URL itself is the auth — copy it fresh from the browser, do not store it long-term.

Pitfalls

Signed URLs expire. Squarespace/CloudFront signatures often expire in 1-24 hours. Extract and download in one pass; do not collect URLs today to download tomorrow.
Referer matters. Many CDNs reject requests without a matching Referer. Always pass --referer 'https://origin-site.com/' to yt-dlp.
Master playlist vs. variant playlist. Always grab the master (playlist.m3u8 without mpegts- prefix). yt-dlp will pick the best variant. If you grab a variant directly you lock yourself to that resolution.
blob: URL is never downloadable. Don't waste time on it. It's a MediaSource Extensions handle that only exists inside the page's JS context.
Login-gated sites: browser-side cookies don't transfer to yt-dlp automatically. If a CDN URL needs auth headers beyond Referer, dump cookies via the browser tool and pass --cookies-from-browser or --add-headers.
Don't try sed/awk on HTML. Use the browser tool's browser_console JS evaluation — DOM parsing in shell is fragile.

Verification

After downloading, always:

ffprobe -v error -show_entries format=duration,bit_rate -show_entries stream=codec_name,width,height file.mp4

to confirm you got a real video, not a 4KB error page renamed to .mp4.

Hetzner S3 archival

For uploading downloaded videos to Hetzner Object Storage, see the hetzner-s3-storage skill for bucket creation, access controls, and presigned URLs.

rclone is not always installed on the VPS. The reliable path is python3 + boto3 with creds sourced from a known-good app's .env (/home/avalon/apps/video-story/.env). See templates/catalog-to-s3-pipeline.sh for the full bash+python pipeline that: - creates the bucket with a public-read policy if missing, - iterates a TSV manifest of (filename, asset_id) rows, - dispatches per row to the right yt-dlp invocation (Squarespace native vs Vimeo iframe — extend for other players), - ffprobes each output for sanity, - uploads each MP4, then builds and uploads a .zip archive.

Run the pipeline as a background job

A multi-video catalog download will take minutes to hours. Foreground tool calls in this environment get interrupted whenever the user sends a new message, and an interrupted download leaves partial files. Always launch catalog downloads as a terminal(background=true, notify_on_complete=true) job so:

the pipeline survives any further interactions with the user,
you can immediately report the first finished URL while the rest continue,
you get a single completion notification at the end instead of polling.

Write the manifest TSV to a stable path (e.g. ~/.hermes/jobs/<name>-manifest.tsv) so the bg script can read it, and use a .done marker file per video so a re-run is idempotent if the job is restarted.

Post-task credential hygiene

When the user provides login credentials to access gated content, after the task is complete strip them from any logs you wrote. Common locations to scrub:

The pipeline log (e.g. ~/.hermes/jobs/<name>.log) — usually clean since yt-dlp doesn't log cookies, but check.
The browser tool's session/cache directories.
Any debug dumps or set -x traces that may have captured env vars.
Shell history (history -d for individual entries; not a permanent rewrite but covers the common case).
The conversation's own scratch files under /tmp or staging dirs.

Do NOT save the credentials to memory or skills — they're per-task secrets, not durable user facts.