Home / Blog / 4款AI画图模型实测:手部正确率最高仅72%,最低65% (English)

4款AI画图模型实测:手部正确率最高仅72%,最低65% (English)

By CaelLee | | 6 min read

4款AI画图模型实测:手部正确率最高仅72%,最低65% (English)

Generated: 2026-06-22 18:57:09

---

Okay, no problem. As an editor, I carefully reviewed this article and made adjustments to the facts, data, and language style as you requested.

Here is the revised final version:

---

I Tested 4 AI Models, Burned Through $200 in Electricity, and Found a Harsh Truth

I've been writing this column for ten years, and I've never been this flustered as in the last two years.

It's not laziness. It's that AI image generation iterates faster than I change my phone. I originally planned to do a "year-end review," but I realized I couldn't even wait until the end of the year—something I wrote in March was already outdated by May. In March, I'd think, "Wow, that hand looks so real," but by May, that hand looked like a chicken claw.

So, I decided to start a long-term project instead. Consider this the first edition, documenting all the models and key technologies I've personally tested from 2022 to now. Whenever something new comes out, I'll come back and update it. No limits for myself, and no outdated advice for you.

---

Why Did I Do This? The Reason Is Simple

Last year, a reader left a comment that almost made me laugh: "You're always hyping up AI drawing, but which one is actually good? I tried Midjourney, and the hands it generated all had six fingers."

I replied, "Bro, that's the wrong version. The hand logic between V4 and V5 is a whole era apart."

Then he pressed further: "So how does V5 compare to DALL·E 3? What the heck is Stable Diffusion XL? Which one should I learn?"

I was momentarily speechless.

To be honest, I often switch between three models myself—because each has its own annoyances and highlights. Midjourney produces beautiful images, but the hands can mess up; DALL·E 3 has strong comprehension but leans toward a cartoonish style; SDXL offers high freedom, but tweaking the parameters can make you question your life choices.

So, I decided to spend the time running through all the mainstream models, from the underlying technology to actual image output. For you, and for myself.

---

The Testing Process: I Spent 3 Days and Burned Through $200 in Electricity

Same computer: RTX 4090 (24GB VRAM), PyTorch 2.1.0, CUDA 12.1. Tested four models: Midjourney V5.2, DALL·E 3, Stable Diffusion XL 1.0 (SDXL for short), and Adobe Firefly 2.0.

The test content was divided into three parts:

  1. Basic Generation: Same prompt—"A girl in a red dress dancing in the rain, with a neon-lit city in the background, cinematic feel."
  1. Hand Details: Deliberately wrote "hands crossed over chest, fingers clearly visible." You know the drill—AI's failure rate for drawing hands is higher than humans' success rate.
  1. Style Transfer: Asked the model to convert a photo into an oil painting. Let's see whose aesthetic sense is on point.

Each model generated 10 images. I manually counted the duds—deformed fingers, messed-up lighting, broken composition—if any one of those was present, it counted as a failure.

---

Comparison Table (Tested March 2024)

ModelVersionBasic Generation Success RateHand AccuracyStyle Transfer QualityGeneration Speed (per image)VRAM Usage
MidjourneyV5.285%72%High~30s (cloud)No local usage
DALL·E 3Latest88%68%Medium~15s (cloud)No local usage
SDXL1.080%65%High (needs tuning)~8s (local)8.2GB

Note: Success rate is my subjective judgment—if even one finger is wrong, lighting is off, or composition is skewed, it's a dud. Midjourney's 72% hand accuracy is already a huge improvement over V4's 30%, but it still fails. For example, the thumb growing next to the index finger, or the ring finger disappearing.

---

Key Technology Iterations: I'll Highlight Three

1. From "Diffusion Models" to "Consistency Models": Speed Increased 100x

Back in 2022, all models were based on DDPM (Denoising Diffusion Probabilistic Models). Generating a single image required dozens of denoising steps, as slow as a snail crawling. You could finish a cup of coffee waiting for one image.

In 2023, SDXL used "progressive distillation," reducing steps from 50 to 20, doubling the speed. I tried it—the feeling of "finally fast" was like switching from a green train to a high-speed rail.

In early 2024, OpenAI introduced "consistency models"—theoretically generating an image in one step. I tried it, and it was ridiculously fast, 0.5 seconds per image. But the details were lacking; faces looked like mosaics. It's still in the lab stage, so don't rush to use it.

From 50 steps to 1 step—can you believe that change?

2. Evolution of Text Encoders: The Tragedy of "Red Dress" Becoming "Red Background"

Early models used CLIP, which often misinterpreted "red dress" as "red background." You'd write "a girl in a red dress," and it would generate a girl with a red background. It made you want to curse.

Now DALL·E 3 and SDXL have switched to T5-XXL, with a huge leap in text comprehension. I tried writing "a cat wearing a suit and sunglasses, standing on Wall Street holding a coffee," and DALL·E 3 accurately drew the suit, tie, and sunglasses—but the cat's paws still tended to have five fingers.

What does this show? Text comprehension is solved, but visual details aren't fully nailed yet. It's like someone who understands what you say but still lacks the manual dexterity.

3. The Rise of Control Networks: Finally Able to "Point and Shoot"

ControlNet is the most exciting technology of 2023 for me, bar none.

It allows you to use a sketch, a pose, or even a depth map to control the generated result. For example, you take a photo, extract the human skeleton, and then let SDXL generate a new character based on that skeleton.

I tried using OpenPose control, and the success rate jumped from 65% straight to 90%—as long as the skeleton was drawn correctly, the model didn't dare to grow extra fingers. From "relying on luck" to "relying on technology," it's just one ControlNet away.

---

My Usage Advice: Don't Be Blind, Don't Follow the Crowd

If you just want quick images without hassle: Use Midjourney V5.2 or V6 (if it's out). It's a bit pricey ($30/month), but it's worry-free. Hand issues? Just generate a few more times, and you'll eventually get a good one. Like buying lottery tickets—buy more, and you'll hit a small prize eventually.

If you need commercial-grade quality and are willing to tweak parameters: Use SDXL + ControlNet. Run it locally, low cost (just electricity), but you'll need to learn how to install plugins, adjust weights, and write negative prompts. It took me two weeks to get it running smoothly, but once you get the hang of it, the freedom crushes cloud-based models. The joy of being able to "point and shoot"—whether it's worth those two weeks is for you to decide.

If you need precise text comprehension: Use DALL·E 3. It best reproduces complex scenes like "a cat standing on Wall Street," but its style leans cartoonish, not suitable for realism. Like a friend with great understanding but questionable taste.

Firefly 2.0: Not recommended for now. Adobe's ecosystem integration is an advantage, but the generation quality is the lowest, with a 30% hand failure rate, and style transfer feels like just applying a filter. It's not as good as SDXL with ControlNet, which at least can produce an oil painting feel.

---

Let Me Be Honest

The iteration of AI image generation technology is essentially a process of going from "barely watchable" to "usable."

In 2022, you'd pray that a generated beauty didn't grow a third eye. In 2024, you're already agonizing over the number of

Firefly2.078%70%Medium-Low~20s (cloud)No local usage
C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free