Extract visual context from video at the moments that matter. The tool analyzes the transcript to identify when visual context is needed, then either extracts frames at those timestamps or flags visual gaps for you to fill manually. Skips talking heads, B-roll, and audio-sufficient content. Four modes:Documentation Index
Fetch the complete documentation index at: https://docs.augent.app/llms.txt
Use this file to discover all available pages before exploring further.
- Query mode (default): describe what you need visual context for. The tool searches the transcript semantically and extracts frames at matching moments.
- Auto mode: autonomously detects visual moments from the transcript using pattern matching and semantic scoring.
- Manual mode: extract frames at specific timestamps you provide.
- Assist mode: analyzes the transcript for visual gaps and returns time ranges where you should provide your own screenshots. Ideal for talking-head videos or podcasts where the speaker describes a workflow but the video doesn’t show it.
![[filename.png]] wikilink embeds. Assist mode returns structured gap data only, no frames extracted.
Example: Query mode
Get screenshots of the Gmail configuration steps from a tutorial video. Request:Example: One-shot with take_notes
Get notes and visual context in a single call.![[frame.png]] embeds inline at the relevant sections. Open in Obsidian and every screenshot renders where it belongs.
Example: Auto mode
Let the tool decide which moments need visual context.Example: Manual mode
Extract frames at specific timestamps you already know.Example: Assist mode
The speaker describes their entire automation workflow on a podcast, but the video is just their face. Assist mode flags the moments where screenshots would complete the picture.Example: Clear and redo
Remove all previous frames for a video and extract fresh ones with a new query.Parameters
| Parameter | Required | Default | Description |
|---|---|---|---|
video_path | No | — | Path to a local video file. Provide this or url. |
url | No | — | Video URL. Downloads the video automatically. Provide this or video_path. |
query | No | — | What you need visual context for. Searches the transcript semantically. |
auto | No | false | Autonomously detect visual moments from the transcript. |
assist | No | false | Analyze the transcript for visual gaps and return time ranges for manual screenshots. No frames extracted. |
timestamps | No | — | List of timestamps in seconds to extract frames at. |
clear | No | false | Remove all previous frames for this video. Can combine with query or auto. |
model_size | No | tiny | Whisper model size for transcription. |
max_frames | No | 30 | Maximum number of frames to extract. |
top_k | No | 10 | Number of transcript matches in query mode. |
context_words | No | 40 | Words of context around each match in query mode. |
How frame selection works
The tool doesn’t blindly grab frames at fixed intervals.- Transcript analysis: each segment is scored for visual necessity using pattern matching and semantic embeddings.
- Smart timing: three candidate frames are extracted per timestamp (t, t+1s, t+2s). The one with the highest edge density wins, capturing the sharpest UI or screen content.
- Deduplication: near-identical frames are detected using perceptual hashing and dropped automatically.
- Intro suppression: the first 30 seconds of a video require much higher confidence to qualify, filtering out B-roll and title cards.

