Veo 3.1 Prompt Structure: The Cinematic Template (2026)

A no-fluff Veo 3.1 prompt template covering subject, camera, lighting, and synchronized audio. Seven cinematic recipes and the failure modes to avoid.

By ShortsFast Team

Veo 3.1 is the model to reach for when a shot needs to feel cinematic — directable camera, natural depth, and synchronized audio baked into a single generation. It’s also the model that punishes lazy prompts the hardest. Paste “cinematic coffee shop scene” and you get a warped hand pouring an unreal espresso with a weird hum underneath.

This post is the prompt template we use at ShortsFast to get clean, film-grammar Veo 3.1 outputs in the first or second try. Veo 3.1 is one of the four video models we bundle (alongside Sora 2, Kling 2.5 Turbo and Seedance 2.0) — it’s the default for any shot where audio matters.

Model fact sheet: Veo 3.1 specs, modes, and recipes.

What Veo 3.1 actually is (April 2026)

  • Core model: Google DeepMind’s Veo 3.1, the current production release behind Flow, Gemini, and Vertex AI.
  • Clip length: 4, 6, or 8 seconds per generation. 8 seconds is the ceiling for a single call.
  • Resolution: 720p or 1080p output at 24fps.
  • Aspect ratios: 16:9 or 9:16 native. 9:16 is what you want for TikTok, Reels, and Shorts.
  • Audio: Synchronized dialogue, ambient sound, and music generated inside the same pass. Multi-speaker conversations are on the table.
  • Control handles: Reference images for subject, first/last-frame conditioning, and “Extend” to chain an 8-second clip into a longer sequence.

Sources: Ultimate Prompting Guide for Veo 3.1 — Google Cloud, Veo 3.1 API & Prompting Guide — PiAPI.

The Veo 3.1 prompt skeleton

Veo 3.1 is trained on film grammar. Feed it film grammar and it cooperates. Every prompt we ship follows this seven-part structure in roughly this order. Skip a part and Veo fills it in — rarely how you want.

  1. Subject — a specific noun phrase with distinctive detail. Not “a woman” but “a woman in her late 50s with short silver hair and a charcoal linen blazer.”
  2. Action — precise verb chain with a motion endpoint. “Walks to the window, pauses, then exhales” beats “walking around thinking.”
  3. Environment — 3-4 concrete elements, never an adjective dump. “Industrial kitchen, stainless prep counter, rain on the skylight.”
  4. Camera — one shot and one movement. “Locked 50mm medium,” “handheld slow push-in on a 35mm,” “tracking from the left on a 28mm dolly.” Never chain two moves.
  5. Lighting / mood — direction + quality + emotional word. “Low-key side-light from a window, hard shadows, quiet tension.”
  6. Audio — this is Veo 3.1’s signature lever. Specify dialogue with exact words in quotes, ambient bed, and any sound effect. Skip this and you get random audio that rarely cuts with your other clips.
  7. Style — one or two film references instead of generic adjectives. “Shot on Kodak Vision3 250D” lands better than “cinematic.”

Keep the whole thing between 100 and 150 words. Shorter loses control; longer introduces contradictions the model will surface as visual glitches.

Source: Ultimate Prompting Guide for Veo 3.1 — Google Cloud.

Three rules creators break with Veo 3.1

  • Don’t write dialogue as a description. Write it as dialogue. Veo 3.1 respects quoted speech. The barista says, "Milk or oat?" gives a crisp line with lip-sync. The barista asks the customer about milk gives mumbled off-sync audio.
  • Pick one camera move. Veo 3.1 can pan, tilt, dolly, or push — but asking for two in one 8-second clip is the fastest path to a warped transition halfway through. Use Extend to chain two simple shots instead.
  • Don’t overspecify the face on a reference-image prompt. If you upload a subject reference, describe what the subject does — not their eyes, hair, and jawline again. Redescription fights the reference and produces morph.

Seven cinematic Veo 3.1 recipes

Each recipe is a full prompt. Paste, adjust the nouns, ship. (Want all ten recipes plus a JSON download you can pipe into your own scripts? Grab the free Veo 3.1 Cinematic Prompt Pack.)

1 — Talking-head confession (dialogue + sync audio)

A man in his early 40s in a grey hoodie sits alone in a cramped home studio lit only by a monitor. He looks directly into the camera and says, “I shipped three products this year. Two of them failed, and I am so grateful for it.” Locked 50mm medium shot, shallow depth of field. Cool monitor glow from the front, warm key light from a small lamp on the right. Audio: only the subject’s voice and a soft room tone, no music. Shot like a video diary, 8 seconds.

2 — Product reveal with diegetic sound

A matte black coffee grinder sits on a stainless steel counter in an industrial kitchen. Two gloved hands enter frame, pour whole beans into the hopper, and press the button. The grinder whirs for two seconds, then stops. Handheld slow push-in, 35mm feel. Overcast daylight from a large window on the left, soft shadows. Audio: the pour of beans, the grinder whir, distant rain on the skylight, no music. 8 seconds.

3 — Two-person conversation in a bar (multi-speaker)

Two women in their late 30s lean on a dark wooden bar, both holding amber cocktails. The first says, “So what’s your actual plan?” The second laughs softly, then answers, “I don’t have one yet.” Locked medium two-shot, 35mm, shallow focus. Warm tungsten edge light from behind the bar, cool backlight from a street window. Audio: both voices clear, quiet jazz bed, glassware clink. 6 seconds, handheld feel.

4 — Kinetic POV street shot (faceless narrator)

First-person POV. A leather boot steps off a curb into a wet crosswalk in Seoul. The camera rises and holds on a red-and-yellow taxi passing left to right, then tilts up to a neon storefront. Handheld POV, 28mm wide. Overcast cool blue hour, pink neon fill. Audio: wet footsteps, taxi hiss, distant traffic, one far-off horn. No music. 8 seconds.

5 — Intimate close-up with emotional beat

A locked extreme close-up on a woman’s hand holding a folded letter. Her thumb traces the crease twice, then her hand slowly lowers out of frame. Static 85mm macro, shallow depth. Late-afternoon window light from the right, soft golden fall-off. Audio: rustle of paper, slow exhale, faint vinyl crackle, no music. 6 seconds.

6 — Food top-down beauty shot

Overhead locked shot. Two hands lower a round of fresh dough onto a flour-dusted marble counter, press it flat, then sprinkle torn basil across the top. 50mm overhead, faint motion from the hand movement. Warm key from a pendant lamp above, no harsh shadows. Audio: press of dough, rustle of basil, soft kitchen ambience, light piano bed. 8 seconds.

7 — Reference-image continuation (extend workflow)

Using the reference image as the subject: the man walks three steps forward, stops, and glances over his right shoulder, then exits frame right. Tracking handheld 35mm, following from behind. Lighting consistent with the reference (low sun from the left). Audio: gravel crunch, wind in dry grass, one distant bird call. No music. 6 seconds.

Failure modes and their fixes

FailureLikely causeFix
Warped hands / morph mid-clipTwo camera moves in one promptOne move per 8-second generation
Mumbled or wrong dialogueDialogue described, not quotedUse direct quotes: says, "..."
Mismatched music vibesNo audio direction givenAlways specify audio — even “no music, only room tone”
Subject drifts off referenceReference + face redescriptionDescribe action only when using a reference image
Stiff, static feelNo lighting directionAlways specify light direction, quality, and mood
Weird aspect cropAspect implied, not setPick 9:16 for vertical, 16:9 for horizontal explicitly

The Veo 3.1 workflow that actually works

An 8-second limit sounds tight until you internalize the workflow.

  1. Storyboard as eight-second beats. Write the short as six to ten shots of exactly 4, 6, or 8 seconds each. Do this before you open the generator.
  2. Assign the right model per shot. Veo 3.1 for any shot with dialogue, ambient audio, or camera direction. Kling 2.5 Turbo for kinetic action and start-to-end-frame transitions. Sora 2 for longer talking-head sequences beyond 8 seconds.
  3. Extend when a beat needs more than 8 seconds. Veo 3.1’s Extend chains two generations on the same seed. Use it for sustained dialogue or slow reveals.
  4. Edit in your cutter of choice. CapCut, Resolve, Premiere — ShortsFast exports clean files you can drop in without fighting a built-in editor.

FAQ

Can Veo 3.1 generate more than 8 seconds in one call?

No. 8 seconds is the hard ceiling per generation. For longer beats, chain generations with Extend, or cut two generations together in post. Don’t try to fit a 15-second beat into one prompt.

Does Veo 3.1 really generate audio that’s in sync?

Yes, when you direct it. Quoted dialogue gets lip-synced in most outputs; ambient beds sit naturally under the visual. The common failure is not specifying audio at all — Veo will invent something, and it rarely matches your other clips.

Is Veo 3.1 better than Sora 2 for short-form?

For any clip under 8 seconds that needs synchronized sound or directable camera, Veo 3.1 is usually the pick. Sora 2 wins when you need a longer continuous shot (up to ~25 seconds) or a specific visual grammar Veo doesn’t handle as well. Pick per shot.

What are the commercial rights on Veo 3.1 outputs?

Outputs from Veo 3.1 via Google’s paid surfaces (Flow, Gemini paid, Vertex AI) carry commercial use rights. If you access Veo 3.1 via ShortsFast on a paid plan, the $20/mo subscription covers commercial use across every bundled model.

Sources

Try every Veo 3.1 recipe now

ShortsFast bundles Veo 3.1 with Sora 2, Kling 2.5 Turbo, Seedance 2.0, Nano Banana Pro and Flux Pro Ultra under a single flat $20 monthly plan. Paste any prompt in this post into the generator, pick Veo 3.1 from the model list, and render. If a result doesn’t land in the first two tries, one of the failure modes above almost always explains why.

Written by ShortsFast Team at ShortsFast. Editorial standards →