HappyHorse 1.0 Just Took #1 on the Video Arena. Here's What's Different.

Alibaba's HappyHorse 1.0 launched on fal April 26 and it's already #1 on Artificial Analysis Video Arena with the largest Elo gap the leaderboard has ever recorded. Specs, pricing, prompt recipes, and how it compares to Veo 3.1, Sora 2, and Seedance 2.0.

By ShortsFast Team

Alibaba’s HappyHorse 1.0 went live on fal as official API partner on April 26, 2026 at 9 PM PST (source). It is currently ranked #1 on the Artificial Analysis Video Arena with 1381 Elo without audio and 1238 Elo with audio — and the gap to second place is 107 Elo points, the largest in that leaderboard’s history. In blind head-to-head matchups, raters prefer HappyHorse’s output roughly 65% of the time.

Three things matter here: it ships joint audio-video in a single pass, it lip-syncs natively across seven languages, and on fal it costs $0.14/second at 720p, $0.28/second at 1080p (source). A 10-second 1080p clip with audio is $2.80 in raw API cost.

Full fact sheet with prompt recipes: HappyHorse 1.0 model page.

What HappyHorse 1.0 actually is

  • Vendor: Alibaba (revealed April 10, 2026 after climbing the leaderboard anonymously as “happyhorse”) (CNBC)
  • Architecture: 15-billion-parameter unified transformer. Text-to-video, image-to-video, reference-to-video, and a video-editing mode (up to 5 reference images) all share one model.
  • Audio: Joint audio-video generation. Dialogue, ambient, and Foley all render in one pass — same family as Veo 3.1, Sora 2, and Seedance 2.0, opposite Kling 2.5 Turbo.
  • Lip-sync languages: Mandarin, Cantonese, English, Japanese, Korean, German, French. Quote dialogue in the target language and the model lip-syncs without a separate sync step.
  • Clip length: 3 to 15 seconds.
  • Resolution: 1080p or 720p.
  • Aspect ratios: 16:9, 9:16, 1:1, 4:3, 3:4.

Pricing on fal vs the rest of the lineup

ModelPrice (1080p, with audio)10s 1080p clip cost
HappyHorse 1.0$0.28/sec$2.80
Veo 3.1$0.40/sec$4.00
Veo 3.1 Lite$0.05/sec$0.50
Seedance 2.0$0.30/sec$3.00
Sora 2 (API)$0.50/sec$5.00
Kling 2.5 Turbo$0.18/sec (no audio)$1.80

HappyHorse lands in the mid tier on price — cheaper than Veo 3.1 and Sora 2, slightly above Seedance 2.0, well above Veo 3.1 Lite. It is not the budget choice. It is the leaderboard #1 you reach for when you need joint audio + multilingual lip-sync at production quality.

When to pick HappyHorse vs the alternatives

Pick HappyHorse 1.0 when:

  • Dialogue is in Japanese, Mandarin, Korean, or German — Veo 3.1 and Sora 2 lip-sync those weakly, HappyHorse syncs them natively.
  • Joint audio-video matters and you only have one pass to get it right.
  • You want today’s leaderboard #1 quality and the 1080p tier is the ceiling you need.

Pick Veo 3.1 when: the prompt names a specific lens, lighting setup, and camera move that has to land literally. Veo’s directability is still the tightest in the field.

Pick Seedance 2.0 when: you need 12-asset multi-reference compositions (9 images + 3 videos + 3 audio clips) or 2K output. HappyHorse caps at 5 reference images and 1080p.

Pick Sora 2 when: the shot length pushes past 15 seconds and you need 25-second narrative beats in one generation.

Pick Kling 2.5 Turbo when: there is no dialogue, you want the cheapest fast iteration, and audio doesn’t matter.

Prompt recipe: a UGC ad in Japanese

A 28-year-old woman in a sunlit Tokyo cafe holds a matte-black coffee
mug. She tilts the mug toward camera and says in Japanese,
"10秒でできた." Medium close-up, 35mm lens, slow push-in.
Soft window light from camera-left, warm 4500K. Audio: faint espresso
machine, distant street traffic. Style: 2020s Apple ad, shallow depth
of field, 9:16 vertical, 1080p.

The line is "10秒でできた" — “it took 10 seconds.” HappyHorse lip-syncs the kana directly. Veo 3.1 will sync the meaning but the mouth shapes drift; Sora 2 will often default to English mouth shapes on Japanese audio.

Prompt recipe: cinematic dialogue in German

Two friends in their late 20s sit on a Berlin rooftop at golden hour.
Beat 1 (0-6s): the first laughs and says in German, "Du hast es
wirklich gelauncht." Beat 2 (6-12s): the second smirks and replies,
"Tag sechs." Locked medium two-shot, 50mm lens. Warm 5500K key from
camera-right, cool fill from sky. Audio: both voices clear, faint
city traffic below, no music. Style: indie short film, 16:9, 1080p.

Two beats, two languages handled natively, dialogue in quotes, ambient described separately. This is the structure HappyHorse was tuned on.

More recipes (including reference-to-video and a multilingual narration example) are on the model page.

What this means for the overall lineup

For the first time since Veo 3 dropped, the model at the top of the public arena is not from Google or OpenAI. The 107-Elo gap is the biggest the leaderboard has ever recorded — bigger than any gap Veo opened over Runway, bigger than the Sora-2 launch lead. That is partly the natural lift of a fresh model with a long anonymous tail of A/B data, and partly genuine product differentiation: nobody else ships arena-leading physics, joint audio-video, and seven-language lip-sync in one model.

Practical read: if you’re building short-form content and your audience is multilingual, HappyHorse 1.0 just became a default tool in the selector. We added the model on launch day — try it inside ShortsFast on the /text-to-video or /image-to-video flow and it shows up next to Veo 3.1, Sora 2, Seedance 2.0, and the rest.

Primary sources

Written by ShortsFast Team at ShortsFast. Editorial standards →