I am working on a harness for a video editor. A lot of the work is just poking at weird edge cases and learning what breaks.

This one started when the video editor agent started talking about puppies for videos from my work. I did not recall recording any puppies. So I had to dig deeper.

The Setup

We had some uploaded videos that were around one second long. They were iPhone .mov files, and in practice they behaved like one-frame-ish videos.

When we sent them to Gemini as videos, the summaries were nonsense:

But when I extracted a frame from the same videos and sent it as an image, Gemini described it correctly.

So the problem was not that the scene was hard. The problem was the representation.

The Interesting Signal

Gemini returns usage metadata. In the bad cases, the video requests had no VIDEO tokens.

That is the whole thing.

If Gemini does not ingest video tokens, then it is not seeing the video. It still returns text, but now the text is just floating in the void.

Results

Here are some tests:

Input Duration Frames / FPS Gemini visual tokens Result
raw iPhone .mov ~0.93s effectively 1 frame no VIDEO tokens hallucinated
raw iPhone .mov ~0.99s effectively 1 frame no VIDEO tokens hallucinated
MP4 0.20s 1 frame @ 5fps no VIDEO tokens hallucinated
MP4 0.33s 1 frame @ 3fps no VIDEO tokens hallucinated
MP4 0.40s 2 frames @ 5fps no VIDEO tokens hallucinated
MP4 0.50s 1 frame @ 2fps VIDEO tokens correct
MP4 0.53s 8 frames @ 15fps VIDEO tokens correct
MP4 0.60s 3 frames @ 5fps VIDEO tokens correct
MP4 0.90s 27 frames @ 30fps VIDEO tokens correct
MP4 1.00s 1 frame @ 1fps VIDEO tokens correct
MP4 2.00s 2 frames @ 1fps VIDEO tokens correct
JPEG frame single image extracted from bad video IMAGE tokens correct

The exact boundary is not simply "less than one second". A 0.6s MP4 with enough frames worked. A 0.4s MP4 often failed. A 1fps video with one valid frame usually becomes a 1s video, and that worked.

The failure seems closer to:

if the video is too short or too frame-poor for Gemini's video ingestion path, Gemini may not ingest visual video content at all.

And if that happens, it may still produce a very fluent summary. Just not of your video. Hallucination is never truly solved.

Kinda reminds me Nyquist theorem.

Why This Might Happen

Google's docs make this failure mode plausible. Gemini supports videoMetadata.fps, which defaults to 1.0.

For Gemini 3, Vertex docs say video tokenization uses a variable sequence length, and the default video frame cost is around 70 tokens/frame. Older docs describe low-resolution video frames as around 66 tokens/frame.

So, roughly, the pipeline probably looks like this:

1. decode video
2. sample frames
3. tokenize sampled frames
4. process audio separately
5. feed text + video/image/audio tokens to the model

For normal videos, this is fine. For tiny videos, it can get weird.

A plausible bug shape is:

- duration < 1s
- default sampling ~= 1fps
- sample timestamps land on whole-second boundaries
- usable sampled frame count = 0
- no video frames get tokenized
- Gemini only sees text/audio
- ACTUAL PROBLEM --> Gemini hallucinates a plausible video

That matches what we saw:

The production iPhone .mov files were especially interesting. They were close to one second long, but ffprobe showed them as effectively one visual frame. Gemini treated them like audio/text inputs, not video inputs.

They should add errors/guardrails around this.

For now the best way is to verify Gemini returns VIDEO tokens in its response.

Images Are Better Here

For tiny videos images are a better representation. Images skip the video preprocessing path. There is no frame sampling decision. The model just receives image tokens. If the "video" is really one or two meaningful frames, sending those frames as images is more faithful than sending a tiny video and hoping Gemini samples it correctly.

What To Do Instead

For very short videos, I would not send the raw video to Gemini.

Better options:

  1. Extract representative frames and send them as images.
  2. Or normalize/pad the video into a boring H.264 MP4 with enough frames. Slow.
  3. Check Gemini usage metadata. If a video request has no VIDEO tokens, do not trust the output.

For our pipeline, the simplest rule is probably:

Sometimes the smartest thing you can do with a tiny video is pretend it is a image. I mean it is an image? Duhh