Home / Blog / AnimationGPT— (English)

AnimationGPT— (English)

By CaelLee | | 5 min read

AnimationGPT— (English)

Generated: 2026-06-21 15:44:39

---

Translate to English, keep the storytelling style:

Last week, an indie game developer buddy of mine suddenly threw a sentence at me that left me completely stumped—

"Can you find an AI where I say 'sword slash from left to right' and it just gives me an animation on the spot?"

The request sounds simple enough, but I dug through all the tools I had on hand, and none of them were particularly hassle-free.

Cascadeur's AutoPosing was half-decent, but expecting it to create something from nothing? Don't even think about it.

The motions from DeepMotion floated around like someone doing drunk boxing—completely unusable.

And Plask? You have to record a video first before it even deigns to work.

Just as I was tearing my hair out, I stumbled across an article on Zhihu by that AnimationGPT crew.

Well, I'll be damned—they'd open-sourced a model that generates combat animations from text, and even included a 14.8k motion dataset with it!

I started downloading the code that very night, and after running it on and off for three days straight, pulling all-nighters,

today I'm spilling all the honest truth—no holding back, giving credit where it's due, and absolutely not pulling any punches when something sucks.

---

1. What the hell is this thing, and how is it different from other AI animation tools?

Put simply and bluntly: you type text, and it spits out 3D combat animations.

For example, you input "right hook followed by a spinning back kick," and it generates an animation on an SMPL skeleton, which you then convert to BVH via a Maya script and drop straight into Unity or Unreal.

What really blew me away was the logic behind it.

They used the MotionGPT framework, treating motion as a language to be learned.

Specifically, they use VQ-VAE to compress a long motion sequence into a few dozen discrete "words," then train a language model to learn the relationship between these words and natural language text.

During training, the model sees both the sentence "thrust three times in quick succession" and the corresponding "motion word sequence," and eventually it learns to combine those "motion words" based on the text.

After countless trials, I realized the key isn't how fancy the model is—the real heavyweight is the dataset.

Their open-source CombatMotion Dataset, 14.8k combat actions, is specifically annotated across eight major categories—attack type, body part, force, speed, direction, number of combo hits, stance, and ending state.

You know what that means? The commonly used academic dataset HumanML3D also has 14.6k motions, but it's all walking, sitting, and waving.

You want it to swing a sword? Dream on. I tried running text generation on that before, and all I got was everyday trivialities. Input "charged heavy chop," and it gave me a "slowly raise arm and lower it again" that made me want to smash my mouse.

Another thing: to annotate these 14.8k actions, the project team built several vocabularies to unify semantics.

For example, when describing "speed," they didn't use vague terms like "pretty fast" or "kind of fast"; they strictly defined five levels: "extremely fast," "fast," "medium," "slow," "extremely slow."

Sounds simple? Try annotating ten actions yourself. I was overwhelmed after five, and maintaining consistency manually is the most soul-crushing hurdle—and they actually pulled it off.

---

2. Is the generated animation any good? Are sliding feet and jitter still a problem?

Let me give you the verdict: It's watchable and usable, but it's still a long, gritted-teeth optimization journey away from being production-ready.

I fed it "continuous spinning back slash" to generate, and when it came out, the overall coherence made me go "wow" involuntarily: the limbs didn't noticeably clip through each other, and the center of gravity was fairly stable during the rotation.

But when I looked closer, the same two old problems were both present.

First, the feet occasionally float off the ground—yep, the classic sliding issue.

Second, the ending pose of the motion looks like someone suddenly hit the pause button; it's jarring.

I asked the project team, and they said it's because the VQ-VAE discretization loses some detail, and the post-processing doesn't include pose alignment.

Okay, I get it, but understanding doesn't get the job done.

My practical advice is: Don't ever expect to nail it in one shot.

First, generate a rough version with text to lock down the rhythm and composition, then go back into Maya to polish the keyframes.

My current workflow: write the description → generate (about 15 seconds on an A100) → export BVH → import into MotionBuilder to check the timeline → manually tweak the fingers and toes.

For a full combo that used to take 4 hours to keyframe by hand, I can now get it done in 1.5 hours. Doubled my efficiency. Pretty satisfying, right?

But I also fell into a big trap: Your input text can't be too poetic. Don't be a poet.

The first time I wrote something like "continuous straight punches like a storm," the model probably interpreted the vague description as "standing still and punching fast"—zero movement forward.

I changed it to: "Step forward with the right foot, deliver three straight punches in a row, retract the hand to guard the face after each punch." The result immediately felt right.

Remember this trick: The closer your description is to "direction + body part + force + duration," the more stable the output. It's like adding salt when cooking—be precise, and the flavor comes through.

---

3. How hard is it to get started? What do I need to prepare?

Not too hard, as long as your hardware can handle it.

Their GitHub provides the full pipeline: download the CM dataset → set up the Conda environment (Python 3.9 + PyTorch 1.11) → run the inference scripts.

I run it on an RTX 3090 (24 GB VRAM), and a single generation takes about 20 seconds.

If you just want to play around with a demo, the project page also has a simplified version on Hugging Face Spaces, where you input text and see the result directly.

But the real hassle comes at the retargeting step.

Academics love the SMPL skeleton, but the industry uses standard BVH (typically a Hip-Foot-Spine five-node system).

They wrote a Maya Python script to convert SMPL to BVH. During my testing, I found that the default skeleton mapping in the script didn't match certain character proportions.

For example, if my character's arms were slightly longer, the generated motion would have the fingers penetrating the chest—a truly horrifying sight.

The solution isn't all that complicated

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free