I am working on a harness for a video editor. A lot of the work is just poking at weird edge cases and learning what breaks.
This one started when the video editor agent started talking about puppies for videos from my work. I did not recall recording any puppies. So I had to dig deeper.
The Setup
We had some uploaded videos that were around one second long. They were iPhone .mov files, and in practice they behaved like one-frame-ish videos.
When we sent them to Gemini as videos, the summaries were nonsense:
- a puppy running on grass
- a car brake pad tutorial
- a laptop repair tutorial
- cooking videos
But when I extracted a frame from the same videos and sent it as an image, Gemini described it correctly.
So the problem was not that the scene was hard. The problem was the representation.
The Interesting Signal
Gemini returns usage metadata. In the bad cases, the video requests had no VIDEO tokens.
That is the whole thing.
If Gemini does not ingest video tokens, then it is not seeing the video. It still returns text, but now the text is just floating in the void.
Results
Here are some tests:
| Input | Duration | Frames / FPS | Gemini visual tokens | Result |
|---|---|---|---|---|
raw iPhone .mov |
~0.93s | effectively 1 frame | no VIDEO tokens |
hallucinated |
raw iPhone .mov |
~0.99s | effectively 1 frame | no VIDEO tokens |
hallucinated |
| MP4 | 0.20s | 1 frame @ 5fps | no VIDEO tokens |
hallucinated |
| MP4 | 0.33s | 1 frame @ 3fps | no VIDEO tokens |
hallucinated |
| MP4 | 0.40s | 2 frames @ 5fps | no VIDEO tokens |
hallucinated |
| MP4 | 0.50s | 1 frame @ 2fps | VIDEO tokens |
correct |
| MP4 | 0.53s | 8 frames @ 15fps | VIDEO tokens |
correct |
| MP4 | 0.60s | 3 frames @ 5fps | VIDEO tokens |
correct |
| MP4 | 0.90s | 27 frames @ 30fps | VIDEO tokens |
correct |
| MP4 | 1.00s | 1 frame @ 1fps | VIDEO tokens |
correct |
| MP4 | 2.00s | 2 frames @ 1fps | VIDEO tokens |
correct |
| JPEG frame | single image | extracted from bad video | IMAGE tokens |
correct |
The exact boundary is not simply "less than one second". A 0.6s MP4 with enough frames worked. A 0.4s MP4 often failed. A 1fps video with one valid frame usually becomes a 1s video, and that worked.
The failure seems closer to:
if the video is too short or too frame-poor for Gemini's video ingestion path, Gemini may not ingest visual video content at all.
And if that happens, it may still produce a very fluent summary. Just not of your video. Hallucination is never truly solved.
Kinda reminds me Nyquist theorem.
Why This Might Happen
Google's docs make this failure mode plausible. Gemini supports videoMetadata.fps, which defaults to 1.0.
For Gemini 3, Vertex docs say video tokenization uses a variable sequence length, and the default video frame cost is around 70 tokens/frame. Older docs describe low-resolution video frames as around 66 tokens/frame.
So, roughly, the pipeline probably looks like this:
1. decode video
2. sample frames
3. tokenize sampled frames
4. process audio separately
5. feed text + video/image/audio tokens to the model
For normal videos, this is fine. For tiny videos, it can get weird.
A plausible bug shape is:
- duration < 1s
- default sampling ~= 1fps
- sample timestamps land on whole-second boundaries
- usable sampled frame count = 0
- no video frames get tokenized
- Gemini only sees text/audio
- ACTUAL PROBLEM --> Gemini hallucinates a plausible video
That matches what we saw:
- bad raw
.movcalls hadTEXTorAUDIOtokens, but noVIDEOtokens - short normalized clips below roughly 0.5s often had no
VIDEOtokens - once Gemini had
VIDEO:66orVIDEO:132, the summaries became correct - the same visual content sent as JPEG had
IMAGEtokens and produced correct descriptions
The production iPhone .mov files were especially interesting. They were close to one second long, but ffprobe showed them as effectively one visual frame. Gemini treated them like audio/text inputs, not video inputs.
They should add errors/guardrails around this.
For now the best way is to verify Gemini returns VIDEO tokens in its response.
Images Are Better Here
For tiny videos images are a better representation. Images skip the video preprocessing path. There is no frame sampling decision. The model just receives image tokens. If the "video" is really one or two meaningful frames, sending those frames as images is more faithful than sending a tiny video and hoping Gemini samples it correctly.
What To Do Instead
For very short videos, I would not send the raw video to Gemini.
Better options:
- Extract representative frames and send them as images.
- Or normalize/pad the video into a boring H.264 MP4 with enough frames. Slow.
- Check Gemini usage metadata. If a video request has no
VIDEOtokens, do not trust the output.
For our pipeline, the simplest rule is probably:
- normal videos: send as video
- tiny / few-frame videos: send as image(s)
Sometimes the smartest thing you can do with a tiny video is pretend it is a image. I mean it is an image? Duhh