Gemini Hallucinates With Tiny Videos

I am working on a harness for a video editor. A lot of the work is just poking at weird edge cases and learning what breaks.

This one started when the video editor agent started talking about puppies for videos from my work. I did not recall recording any puppies. So I had to dig deeper.

The Setup

We had some uploaded videos that were around one second long. They were iPhone .mov files, and in practice they behaved like one-frame-ish videos.

When we sent them to Gemini as videos, the summaries were nonsense:

a puppy running on grass
a car brake pad tutorial
a laptop repair tutorial
cooking videos

But when I extracted a frame from the same videos and sent it as an image, Gemini described it correctly.

So the problem was not that the scene was hard. The problem was the representation.

The Interesting Signal

Gemini returns usage metadata. In the bad cases, the video requests had no VIDEO tokens.

That is the whole thing.

If Gemini does not ingest video tokens, then it is not seeing the video. It still returns text, but now the text is just floating in the void.

Results

Here are some tests:

Input	Duration	Frames / FPS	Gemini visual tokens	Result
raw iPhone `.mov`	~0.93s	effectively 1 frame	no `VIDEO` tokens	hallucinated
raw iPhone `.mov`	~0.99s	effectively 1 frame	no `VIDEO` tokens	hallucinated
MP4	0.20s	1 frame @ 5fps	no `VIDEO` tokens	hallucinated
MP4	0.33s	1 frame @ 3fps	no `VIDEO` tokens	hallucinated
MP4	0.40s	2 frames @ 5fps	no `VIDEO` tokens	hallucinated
MP4	0.50s	1 frame @ 2fps	`VIDEO` tokens	correct
MP4	0.53s	8 frames @ 15fps	`VIDEO` tokens	correct
MP4	0.60s	3 frames @ 5fps	`VIDEO` tokens	correct
MP4	0.90s	27 frames @ 30fps	`VIDEO` tokens	correct
MP4	1.00s	1 frame @ 1fps	`VIDEO` tokens	correct
MP4	2.00s	2 frames @ 1fps	`VIDEO` tokens	correct
JPEG frame	single image	extracted from bad video	`IMAGE` tokens	correct

The exact boundary is not simply "less than one second". A 0.6s MP4 with enough frames worked. A 0.4s MP4 often failed. A 1fps video with one valid frame usually becomes a 1s video, and that worked.

The failure seems closer to:

if the video is too short or too frame-poor for Gemini's video ingestion path, Gemini may not ingest visual video content at all.

And if that happens, it may still produce a very fluent summary. Just not of your video. Hallucination is never truly solved.

Kinda reminds me Nyquist theorem.

Why This Might Happen

Google's docs make this failure mode plausible. Gemini supports videoMetadata.fps, which defaults to 1.0.

For Gemini 3, Vertex docs say video tokenization uses a variable sequence length, and the default video frame cost is around 70 tokens/frame. Older docs describe low-resolution video frames as around 66 tokens/frame.

So, roughly, the pipeline probably looks like this:

1. decode video
2. sample frames
3. tokenize sampled frames
4. process audio separately
5. feed text + video/image/audio tokens to the model

For normal videos, this is fine. For tiny videos, it can get weird.

A plausible bug shape is:

- duration < 1s
- default sampling ~= 1fps
- sample timestamps land on whole-second boundaries
- usable sampled frame count = 0
- no video frames get tokenized
- Gemini only sees text/audio
- ACTUAL PROBLEM --> Gemini hallucinates a plausible video

That matches what we saw:

bad raw .mov calls had TEXT or AUDIO tokens, but no VIDEO tokens
short normalized clips below roughly 0.5s often had no VIDEO tokens
once Gemini had VIDEO:66 or VIDEO:132, the summaries became correct
the same visual content sent as JPEG had IMAGE tokens and produced correct descriptions

The production iPhone .mov files were especially interesting. They were close to one second long, but ffprobe showed them as effectively one visual frame. Gemini treated them like audio/text inputs, not video inputs.

They should add errors/guardrails around this.

For now the best way is to verify Gemini returns VIDEO tokens in its response.

Images Are Better Here

For tiny videos images are a better representation. Images skip the video preprocessing path. There is no frame sampling decision. The model just receives image tokens. If the "video" is really one or two meaningful frames, sending those frames as images is more faithful than sending a tiny video and hoping Gemini samples it correctly.

What To Do Instead

For very short videos, I would not send the raw video to Gemini.

Better options:

Extract representative frames and send them as images.
Or normalize/pad the video into a boring H.264 MP4 with enough frames. Slow.
Check Gemini usage metadata. If a video request has no VIDEO tokens, do not trust the output.

For our pipeline, the simplest rule is probably:

normal videos: send as video
tiny / few-frame videos: send as image(s)

Sometimes the smartest thing you can do with a tiny video is pretend it is a image. I mean it is an image? Duhh

Vaibhav's Personal Website