Lesson 11 · Grok Mastery Pro+ ~10 min read Video with native audio

Imagine video: clips with sound, stitched into stories.

The 2026 jump: Grok Imagine generates 10-second, 720p video with audio built in — ambience, effects, even short dialogue, no separate sound step. And Agent Mode puts it all on an infinite canvas that batch-edits and stitches clips into longer pieces. Here's how to direct it, and where the hard lines are.

01 Prompting moving pictures

The five slots, plus two: motion and sound

Everything from Lesson 5 carries over; video adds the two dimensions stills don't have. Prompt them explicitly or the model improvises both:

Video prompt pattern[SUBJECT + SETTING + STYLE + MOOD — as in Lesson 5] [MOTION: slow push-in toward the storefront; steam rising from the coffee cup; one car passes left to right] [SOUND: quiet morning ambience, distant birdsong, a single shop bell as the door opens. No music, no narration.] [CONSTRAINTS: 10 seconds, no people on screen, no text]

Two craft notes that fix most bad clips: one camera move per clip (push, pan, OR static — combinations turn to soup at 10 seconds), and name what the sound should NOT include — unrequested music is the most common surprise.

Image-to-video: animate what you already approved

The most reliable path to a usable clip: generate (or upload) a still you like, then animate it — "this image, slow dolly forward, dust motes in the light, room ambience." You've locked composition and style in a cheap medium before spending your scarcer video generations on motion. This still→motion ladder is the workflow Imagine quietly rewards.

02 Agent Mode: the infinite canvas

From clips to sequences

Agent Mode (beta) is a creative agent on an open canvas: it generates batches of stills, applies edits across all of them at once ("make these six frames dusk instead of noon"), animates selected frames into clips, and stitches clips into longer pieces. The workflow that works:

Board it like a film: describe the sequence as numbered shots, one line each — the canvas turns your shot list into draft frames.
Batch-fix style drift: consistency across shots is the hard problem of AI video; the canvas's apply-to-all edits are your main weapon. Re-state the style anchor ("35mm, cold dawn light") in every batch instruction.
Animate only approved frames. Same ladder as before — stills are cheap, clips aren't (10–30/day depending on tier).
Stitch, then re-watch for continuity — lighting jumps and object teleportation between shots are the tells; regenerate the offending clip, not the sequence.

Agent Mode is beta software from the fastest-shipping company in AI — the canvas you open next month may be rearranged. The shot-list discipline is the part that transfers no matter what the UI does.

03 What 10-second clips are actually for

Honest sizing: this is not a video-production replacement. It is excellent for social clips and ads (where 6–10 seconds is the native length), product motion shots, concept previsualization ("show the client the vibe before we hire a crew"), and B-roll texture. The moment you need 60 seconds of coherent narrative with consistent characters, you're in real-editor territory — Imagine feeds the editor, it doesn't replace it.

04 The lines

put a real, identifiable person in generated video — not a colleague as a joke, not a public figure for a post, not a customer for an ad. Moving pictures with audio read as evidence to the human brain; fabricating them with real faces is deepfake territory regardless of intent. Same rule we teach seniors defending against scams, from the other side: don't manufacture what fraudsters manufacture.

Softer but real: label generated video as AI-made wherever a reasonable viewer might assume it's footage (renders of work you haven't done, "site photos," product demos). Platforms increasingly require the disclosure anyway; your reputation requires it first.

Ship one sequence

Board a three-shot, 30-second piece for something real — a product tease, a service explainer opener, a social post. Stills first, animate the approved ones, stitch on the canvas, continuity pass. One evening, one real deliverable, and you'll know exactly what this tool is worth to you.

Next: voice & cloning → All Grok lessons →

What you can do now

Prompt video with the seven slots — including explicit motion and sound (and sound exclusions)
Climb the still→motion ladder so scarce video generations land on approved compositions
Run Agent Mode like a director: shot list, batch style anchors, animate approved frames, stitch, continuity pass
Size the tool honestly: social-length clips and previz, not production replacement
Hold the lines: no real faces, label renders, disclose where viewers could mistake it for footage

Pro

Up next in Grok Mastery

Lesson 7 · Voice: modes, cloning, and the voice library

Daily voice limits by tier, Think Fast, custom voices from a short clip — and the deepfake-era judgment that has to ride along. See pricing →