Happy Horse 1.0

by Alibaba · released 2026-04

Alibaba's #1-ranked video model — joint audio-video generation, native multilingual lip-sync, and the largest Elo lead in Artificial Analysis Video Arena history.

When should you use Happy Horse 1.0?

Use Happy Horse 1.0 when joint audio-video and multilingual lip-sync matter — native sync across Mandarin, Cantonese, English, Japanese, Korean, German and French in one 3-15s pass at 720p or 1080p. It's #1 on the Artificial Analysis Video Arena with a 107-Elo lead. Pick Veo 3.1 for harder camera grammar; Seedance 2.0 for 12-asset references or 2K output.

TL;DR — Happy Horse 1.0 wins when joint audio-video and native lip-sync matter — it's #1 on the Artificial Analysis arena with a 107-Elo lead, the biggest gap that leaderboard has ever seen.

Specs

Clip length	3s to 15s
Resolution	1080p or 720p
Aspect ratios	16:9, 9:16, 1:1, 4:3, 3:4
Native audio	Yes — joint audio-video, synchronized in one pass
Modes	Text-to-video, image-to-video, reference-to-video, video editing
Reference inputs	Up to 5 reference images (video-editing mode)
Pricing on fal	$0.14/sec at 720p, $0.28/sec at 1080p
Access	fal.ai (official API partner from launch), aggregators (ShortsFast)

Best for

• Multilingual lip-sync — native sync across Mandarin, Cantonese, English, Japanese, Korean, German, and French in a single generation
• Joint audio-video shots where dialogue, ambient sound, and Foley all need to land in one pass
• Mid-length narrative clips up to 15 seconds where motion coherence and physics realism matter

Weak at

• Hard camera-grammar adherence — Veo 3.1 still follows specific lens / lighting directions more literally
• Reference-heavy compositions — capped at 5 reference images vs Seedance 2.0's 12-asset window
• Frame-rate control — fal does not currently expose a frame-rate flag; default output only

Prompt structure

Subject — clear noun phrase, one defining attribute
Action — single verb-driven beat with a motion endpoint
Environment — 3 concrete elements, no adjective dump
Camera — one shot size + one move
Lighting — direction + quality + color temperature
Audio — dialogue in quotes (lip-sync language matters), ambient bed described separately
Style — film stock, era, or director reference

Paste-ready recipes

Multilingual UGC ad (8s, 1080p)

                A 28-year-old woman in a sunlit Tokyo cafe holds a matte-black coffee mug. She tilts the mug toward camera and says in Japanese, "10秒でできた." Medium close-up, 35mm lens, slow push-in. Soft window light from camera-left, warm 4500K. Audio: faint espresso machine, distant street traffic. Style: 2020s Apple ad, shallow depth of field, 9:16 vertical, 1080p.

Note: Happy Horse lip-syncs Japanese natively — Veo 3.1 / Sora 2 still struggle here. Quote dialogue verbatim in the target language.

Cinematic dialogue scene (12s, 1080p)

                Two friends in their late 20s sit on a Berlin rooftop at golden hour. Beat 1 (0-6s): the first laughs and says in German, "Du hast es wirklich gelauncht." Beat 2 (6-12s): the second smirks and replies, "Tag sechs." Locked medium two-shot, 50mm lens. Warm 5500K key from camera-right, cool fill from sky. Audio: both voices clear, faint city traffic below, no music. Style: indie short film, 16:9, 1080p.

Reference-to-video product shot (6s, 1080p)

                Reference image_1: shoe_packshot_front.png. Animate: the shoe rotates slowly on a glossy black turntable, a soft spotlight pans across it. Locked macro, 100mm lens. Hard rim light from behind, soft fill from front, deep black background. Audio: subtle whoosh on the spotlight pan. Style: sneaker drop ad, 1:1, 1080p.

Multilingual narration with B-roll (15s, 1080p)

                Beat 1 (0-5s): a chef plates a bowl of ramen, steam rising in a low key-lit ramen bar. Beat 2 (5-10s): close-up on the noodles being lifted with chopsticks. Beat 3 (10-15s): the chef looks up and says in Mandarin, "下次再来." Camera: handheld documentary, 35mm. Lighting: low warm tungsten with neon kicker. Audio: ambient slurping, kitchen clatter, then the chef's line over the top. Style: Chef's Table, 16:9, 1080p.

FAQ

What makes Happy Horse 1.0 the #1 model on Artificial Analysis?

Joint audio-video generation in a single pass, multilingual lip-sync across seven languages, and physics realism that human raters prefer ~65% of the time in head-to-head blind matchups. The published Elo ratings are 1381 without audio and 1238 with audio — both #1 on the leaderboard, with a 107-point gap to second place that's the largest in the leaderboard's history.

How does Happy Horse 1.0 pricing on fal compare to Seedance 2.0 and Veo 3.1?

On fal, Happy Horse 1.0 is $0.14/sec at 720p and $0.28/sec at 1080p, which puts it in the mid-to-upper tier — roughly between Seedance 2.0 Fast and Veo 3.1 Fast. A 10-second 1080p clip with audio is $2.80 in raw API cost.

Which languages does Happy Horse 1.0 lip-sync natively?

Mandarin, Cantonese, English, Japanese, Korean, German, and French. Quote the dialogue in the target language inside the prompt and the model lip-syncs it without a separate sync pass — this is a hard advantage over Veo 3.1, Sora 2, and Seedance 2.0 outside English.

Happy Horse 1.0 vs Seedance 2.0 — which one?

Pick Happy Horse when joint audio + multilingual lip-sync + arena-leading physics matter. Pick Seedance 2.0 when you need 12-asset multi-reference compositions or 2K output. Both are audio-native; Happy Horse is the leaderboard #1 today, Seedance is #2 with the wider reference window.

Primary sources

Use Happy Horse 1.0 without the per-model subscription

ShortsFast bundles Happy Horse 1.0 with every other frontier model under one flat $20/mo plan.

Start free See pricing

Last updated 2026-04-28. ShortsFast has no affiliation with Alibaba. Specs are compiled from the vendor's public documentation and verified against primary sources on the date above.