Happy Horse 1.0
by Alibaba · released 2026-04
Alibaba's #1-ranked video model — joint audio-video generation, native multilingual lip-sync, and the largest Elo lead in Artificial Analysis Video Arena history.
When should you use Happy Horse 1.0?
Use Happy Horse 1.0 when joint audio-video and multilingual lip-sync matter — native sync across Mandarin, Cantonese, English, Japanese, Korean, German and French in one 3-15s pass at 720p or 1080p. It's #1 on the Artificial Analysis Video Arena with a 107-Elo lead. Pick Veo 3.1 for harder camera grammar; Seedance 2.0 for 12-asset references or 2K output.
TL;DR — Happy Horse 1.0 wins when joint audio-video and native lip-sync matter — it's #1 on the Artificial Analysis arena with a 107-Elo lead, the biggest gap that leaderboard has ever seen.
Specs
| Clip length | 3s to 15s |
| Resolution | 1080p or 720p |
| Aspect ratios | 16:9, 9:16, 1:1, 4:3, 3:4 |
| Native audio | Yes — joint audio-video, synchronized in one pass |
| Modes | Text-to-video, image-to-video, reference-to-video, video editing |
| Reference inputs | Up to 5 reference images (video-editing mode) |
| Pricing on fal | $0.14/sec at 720p, $0.28/sec at 1080p |
| Access | fal.ai (official API partner from launch), aggregators (ShortsFast) |
Best for
- • Multilingual lip-sync — native sync across Mandarin, Cantonese, English, Japanese, Korean, German, and French in a single generation
- • Joint audio-video shots where dialogue, ambient sound, and Foley all need to land in one pass
- • Mid-length narrative clips up to 15 seconds where motion coherence and physics realism matter
Weak at
- • Hard camera-grammar adherence — Veo 3.1 still follows specific lens / lighting directions more literally
- • Reference-heavy compositions — capped at 5 reference images vs Seedance 2.0's 12-asset window
- • Frame-rate control — fal does not currently expose a frame-rate flag; default output only
Prompt structure
- Subject — clear noun phrase, one defining attribute
- Action — single verb-driven beat with a motion endpoint
- Environment — 3 concrete elements, no adjective dump
- Camera — one shot size + one move
- Lighting — direction + quality + color temperature
- Audio — dialogue in quotes (lip-sync language matters), ambient bed described separately
- Style — film stock, era, or director reference
Paste-ready recipes
Multilingual UGC ad (8s, 1080p)
A 28-year-old woman in a sunlit Tokyo cafe holds a matte-black coffee mug. She tilts the mug toward camera and says in Japanese, "10秒でできた." Medium close-up, 35mm lens, slow push-in. Soft window light from camera-left, warm 4500K. Audio: faint espresso machine, distant street traffic. Style: 2020s Apple ad, shallow depth of field, 9:16 vertical, 1080p.
Note: Happy Horse lip-syncs Japanese natively — Veo 3.1 / Sora 2 still struggle here. Quote dialogue verbatim in the target language.
Cinematic dialogue scene (12s, 1080p)
Two friends in their late 20s sit on a Berlin rooftop at golden hour. Beat 1 (0-6s): the first laughs and says in German, "Du hast es wirklich gelauncht." Beat 2 (6-12s): the second smirks and replies, "Tag sechs." Locked medium two-shot, 50mm lens. Warm 5500K key from camera-right, cool fill from sky. Audio: both voices clear, faint city traffic below, no music. Style: indie short film, 16:9, 1080p.
Reference-to-video product shot (6s, 1080p)
Reference image_1: shoe_packshot_front.png. Animate: the shoe rotates slowly on a glossy black turntable, a soft spotlight pans across it. Locked macro, 100mm lens. Hard rim light from behind, soft fill from front, deep black background. Audio: subtle whoosh on the spotlight pan. Style: sneaker drop ad, 1:1, 1080p.
Multilingual narration with B-roll (15s, 1080p)
Beat 1 (0-5s): a chef plates a bowl of ramen, steam rising in a low key-lit ramen bar. Beat 2 (5-10s): close-up on the noodles being lifted with chopsticks. Beat 3 (10-15s): the chef looks up and says in Mandarin, "下次再来." Camera: handheld documentary, 35mm. Lighting: low warm tungsten with neon kicker. Audio: ambient slurping, kitchen clatter, then the chef's line over the top. Style: Chef's Table, 16:9, 1080p.
FAQ
What makes Happy Horse 1.0 the #1 model on Artificial Analysis?
Joint audio-video generation in a single pass, multilingual lip-sync across seven languages, and physics realism that human raters prefer ~65% of the time in head-to-head blind matchups. The published Elo ratings are 1381 without audio and 1238 with audio — both #1 on the leaderboard, with a 107-point gap to second place that's the largest in the leaderboard's history.
How does Happy Horse 1.0 pricing on fal compare to Seedance 2.0 and Veo 3.1?
On fal, Happy Horse 1.0 is $0.14/sec at 720p and $0.28/sec at 1080p, which puts it in the mid-to-upper tier — roughly between Seedance 2.0 Fast and Veo 3.1 Fast. A 10-second 1080p clip with audio is $2.80 in raw API cost.
Which languages does Happy Horse 1.0 lip-sync natively?
Mandarin, Cantonese, English, Japanese, Korean, German, and French. Quote the dialogue in the target language inside the prompt and the model lip-syncs it without a separate sync pass — this is a hard advantage over Veo 3.1, Sora 2, and Seedance 2.0 outside English.
Happy Horse 1.0 vs Seedance 2.0 — which one?
Pick Happy Horse when joint audio + multilingual lip-sync + arena-leading physics matter. Pick Seedance 2.0 when you need 12-asset multi-reference compositions or 2K output. Both are audio-native; Happy Horse is the leaderboard #1 today, Seedance is #2 with the wider reference window.
Primary sources
Use Happy Horse 1.0 without the per-model subscription
ShortsFast bundles Happy Horse 1.0 with every other frontier model under one flat $20/mo plan.
Last updated 2026-04-28. ShortsFast has no affiliation with Alibaba. Specs are compiled from the vendor's public documentation and verified against primary sources on the date above.