I Ran 56 Experiments to Find the Best Way to Make AI Watch Videos

I wanted a simple thing — feed a video to an AI running on my Mac and get back useful descriptions of what's happening in each frame. Not a cloud API. Not a $200/month subscription. Just a local pipeline that actually works.

Three days and 56 experiments later, the biggest finding was counterintuitive: telling the model what the speaker is saying matters more than any vision trick, OCR injection, or bigger model.

The Problem With Video Understanding

Most "AI video tools" are wrappers around OpenAI's API. You upload your video, pay per minute, and get back generic summaries. That's fine for some use cases, but I wanted something that runs locally, processes any video, and extracts specific details — option names, numbers, UI labels, before/after states.

Think screen recordings of software, tutorials, product demos. The kind of video where "a person is showing a WordPress admin panel" is useless. I need "the Database Optimizer tool shows 856 KB of transients, with options for Post Revisions, WP Cron tasks, and Orphaned Data."

That's a much harder problem.

The Setup: Ollama + Whisper + Python

The pipeline is straightforward:

Extract frames from the video (1 frame every 10 seconds)
Transcribe audio with Whisper (one-time pass, ~69 seconds for a 14-minute video)
Analyze each frame with a vision model via Ollama
Merge visual descriptions with speech by timestamp

python3 caption.py video.mp4 --serve

That's it. One command. It downloads the video (works with YouTube URLs too), extracts everything, runs the analysis, and serves a web viewer.

The hard part isn't building the pipeline. It's figuring out which model, which prompt, which context, and which settings actually produce good results.

The Scoring System

I borrowed the core loop from Karpathy's autoresearch: define a scalar metric, run experiments, keep improvements. He used it for training optimization — I used it for prompt engineering and model selection.

My scoring system checks each frame's description against ground truth keywords:

Detail keywords (70% of score) — specific text like "transients", "856 KB", "action scheduler"
Required keywords (20%) — must-haves like "database", "optimizer"
Bonus keywords (10%) — nice-to-haves
Hallucination penalty (-30 points each) — if the model invents things that aren't there

This heavily punishes vague descriptions and rewards models that actually read what's on screen. A "WordPress admin page" earns almost nothing. "Database Optimizer showing Transients at 856 KB with a Clear Expired button" earns close to 100.

Round 1: Finding the Right Model (15 Experiments)

Started with what I had — qwen2.5-vl:7b. It was slow (28s per frame), used 14.6 GB of VRAM, and hallucinated constantly. The model would describe people or objects that simply weren't in the frame.

Switched to qwen3-vl and immediately saw improvements. But which size?

Model	Speed	VRAM	Accuracy	Hallucinations
qwen2.5-vl:7b	28s/frame	14.6 GB	Low	100% of frames
qwen3-vl:8b	14s/frame	6.1 GB	Good	0%
qwen3-vl:4b	7.5s/frame	3.3 GB	Same as 8b	0%

The 4b model matched the 8b model's accuracy while being 2x faster and using half the memory. Zero hallucinations across every single inference.

First insight: smaller isn't always worse. The 4b model is just as accurate as the 8b for screenshot analysis. You're paying 2x the latency and 2x the VRAM for nothing.

Round 2: Audio Context Changes Everything (6 Experiments)

Here's where it gets interesting. I used mlx-whisper to transcribe the video's audio track, then injected the speech into each frame's vision prompt:

The speaker is saying: "here you can see the transients
that accumulated over time, we can clear those"

What is shown in this image?

The result? +26% on the combined score. The biggest single improvement across all 56 experiments.

Chart showing +26% improvement when adding audio context to vision prompts

Context	Combined Score	Change
Vision only (baseline)	60.5	—
Vision + Audio	76.0	+26%

Why does this work so well? The speech gives the model semantic context. When the speaker says "here you can see the transients," the model knows to focus on transients in the screenshot. Without audio, it describes the page generically. With audio, it knows exactly what matters.

The audio transcription takes ~69 seconds for a 14-minute video and gets cached. So it's basically free after the first run.

Round 3: Does Chaining Frame Descriptions Help? (10 Experiments)

I tested feeding each frame's description as context for the next frame. The idea was that knowing "the previous frame showed the database optimizer menu" would help the model understand "now we're looking at the cleanup results."

Results were mixed:

Narrative understanding: +6 points (the model tells a better story)
Detail accuracy: -8 points (it over-relies on the chain and misses details)
Net effect: slightly worse overall

The model starts summarizing what the previous frame said instead of looking carefully at the current frame. Chaining helps narrative but hurts the thing I care about most — extracting specific details.

Second insight: context can be a distraction. More information isn't always better. The model has a limited attention budget, and chain context eats into it.

Round 4: The OCR Experiment That Failed (8 Experiments)

This seemed like a slam dunk. macOS has a built-in Vision framework that does OCR in ~0.6 seconds per frame. I compiled a Swift binary that extracts all visible text:

Database Optimizer | Transients 856 KB | Post Revisions |
avatar_default | blog_charset | CLEAR EXPIRED

Perfect. Exactly the keywords the vision model was missing. I injected this OCR text into the prompt:

Text visible in image: "Database Optimizer | Transients 856 KB..."

What is shown in this image?

It made things worse. Every OCR variant scored lower than audio-only.

Context	Combined Score
Audio only	76.0
Audio + OCR + chain	72.8
Audio + OCR	66.5
OCR only (no audio)	61.7

What happened? The OCR text pollutes the context window. The model starts summarizing the OCR output instead of analyzing the image. It becomes a text summarization task, not a vision task.

The OCR provides mostly the same information the model already extracts visually — but in a format that confuses rather than complements. Audio says "here you can see the transients that accumulated over time" — that's semantic, it tells the model what to look at and why. OCR says "Transients 856 KB CLEAR EXPIRED" — that's syntactic, raw data without meaning.

Third insight: semantic context beats syntactic context. Speech tells the model what matters. OCR just dumps text.

The OCR binary is still useful though — just not in the vision prompt. I save it as a separate field for search and indexing.

Round 5: Testing 5 Vision Models (7 Experiments)

Maybe I was overfit to Qwen. Time to test the competition. I pulled every small vision model available in Ollama:

Bar chart comparing 5 vision models: qwen3-vl, minicpm-v, gemma3, granite, llava-phi3

Model	Size	Speed	Frame Accuracy	Combined
qwen3-vl:4b	3.3 GB	7.8s	59.7	72.1
minicpm-v	5.5 GB	8.8s	50.0	68.7
gemma3:4b	3.3 GB	9.5s	53.3	68.6*
granite3.2-vision:2b	2.4 GB	16.3s	5.5	-8.4
llava-phi3	2.9 GB	2.9s	3.3	16.5

*gemma3 score is WITHOUT audio — it actually performs worse with audio context

Qwen wins decisively. But the really interesting finding is about gemma3 and audio context.

When I inject speech context into gemma3's prompt, its accuracy drops by 28%. The same audio injection that gives Qwen a +26% boost destroys gemma3. The speech injection format was optimized for Qwen's architecture and clearly doesn't transfer.

minicpm-v deserves a mention — it scored the best narrative understanding of any model (96.7) but was weaker on specific details. If you need "what's the story across these frames" rather than "what exact text is on screen," minicpm-v is worth testing.

granite3.2-vision:2b and llava-phi3 are completely useless for screenshot analysis. granite scored 0 on nearly every frame despite being the slowest model. llava-phi3 is fast (2.9s) but produces nothing relevant.

Fourth insight: audio context is model-dependent. Don't assume a technique that works with one model transfers to another. Test it.

Round 6: Prompt Engineering for Maximum Detail (10 Experiments)

The last round focused on squeezing more detail out of the winning model. I tested 10 different prompts:

Chart comparing prompt engineering strategies and their effect on accuracy

Prompt Strategy	Frame Accuracy	Combined
"Describe this screenshot. For each section visible, list: the heading, all options/items shown, and any numbers or values displayed."	63.4	73.9
"What is shown in this image?" (baseline)	58.7	72.8
"List everything as bullet points"	52.7	66.9
"Read all visible text, labels, numbers"	51.0	65.7
"You are an OCR system. Extract every text..."	43.4	61.1
Baseline + OCR injection	39.1	54.9

The structured prompt wins by +8% accuracy. It tells the model to walk through each section of the UI systematically — headings, options, values. This matches how screenshots are actually organized.

Asking the model to "read all text" or "be an OCR system" actually hurts. The model loses its visual understanding and becomes a bad text copier. It needs to understand the screen to describe it well.

More tokens don't help either. Giving the model 400 tokens instead of 200 produced worse results — it just rambles more without seeing more.

Fifth insight: structure your prompt to match your data. UI screenshots have sections with headings and options. Tell the model to look for that structure and it delivers.

The Final Configuration

After 56 experiments and 790+ inferences:

python3 caption.py video.mp4 --serve \
  --vision-model qwen3-vl:4b \
  --prompt "Describe this screenshot. For each section visible, \
    list: the heading, all options/items shown, and any numbers \
    or values displayed."

Line chart showing score evolution across all 56 experiments

Model: qwen3-vl:4b (3.3 GB VRAM)
Speed: 10.5s per frame on M-series Mac (up from 7.8s with the short prompt — the structured prompt generates 300 tokens instead of 200, worth the trade-off)
Audio: Whisper transcription injected as speech context
Prompt: Structured section enumeration
Hallucinations: Zero. Across 790+ inferences. Not one.
Combined score: 73.9 (up from 60.5 baseline — a +22% improvement)

What This Doesn't Solve (Yet)

The pipeline works great for screen recordings and software demos. But there are gaps I haven't tested.

Different video types. All 56 experiments used WordPress admin screenshots. Coding tutorials, presentations, product demos with live UI — those might respond differently to audio context or prompt structure. The Karpathy loop makes it easy to test, but I haven't done it yet.

Real-time processing. At 10.5s per frame, this is a batch tool. You can't use it for live captioning. Smaller models like llava-phi3 are fast enough (2.9s) but the quality is unusable. There might be a middle ground with SmolVLM at 2.2B parameters.

Languages other than English. Whisper handles multilingual audio fine, but Qwen3-VL's text recognition in non-Latin scripts? No idea. Worth testing.

The 63% accuracy ceiling. Even the best config only extracts ~63% of specific UI text. The model sees "Database Optimizer" clearly but misses "avatar_default" in a dense options table. That's a model capability limit. Maybe Qwen3-VL at 8B or 32B cracks it — but the speed penalty might not be worth it. Or maybe a future 4B model just gets better at reading small text.

The autoresearch scripts are included in the repo. If you run your own experiments and find something that beats 73.9, I want to hear about it.

The full pipeline code is at video-caption on GitHub. The autoresearch scripts are included if you want to run your own experiments.