I Found the Actually-Working OpenAI-Compatible APIs in China (And the Ones That Lied to Me)
I Found the Actually-Working OpenAI-Compatible APIs in China (And the Ones That Lied to Me)
I just spent three weeks migrating our startup's backend from a Frankenstein mix of domestic LLM SDKs to a unified proxy setup, and I'm still questioning my life choices. Here's what I learned: half the "OpenAI-compatible" endpoints out there are lying through their teeth about compatibility.
Let me save you the pain I went through.
If you're building anything that needs to talk to LLMs from inside China, you know the drill. OpenAI is... complicated. Anthropic's blocked entirely, unless you enjoy playing VPN roulette with production traffic. Gemini technically works but the latency—god, the latency. 2.3 seconds p95 for a simple completion. I measured. My keyboard barely survived the frustration.
The official Chinese models (Baichuan, ChatGLM, Qwen) all have their own SDKs that look like someone skimmed the OpenAI docs once at 2 AM and decided to get "creative." Different parameter names. Weird nesting. "Stream" means something completely different in each one. I found a Baichuan wrapper last month where temperature accepted values from 0 to 1... but internally divided by 10. No documentation. Just vibes.
But here's what nobody tells you: a bunch of providers now expose OpenAI-protocol endpoints. Like, actual drop-in replacements where you change base_url and it just... works.
Mostly.
The Holy Grail List (Tested and Actually Working as of June 2024)
1. Moonshot (Kimi)
https://api.moonshot.cn/v1
- Model:
moonshot-v1-8k,moonshot-v1-32k,moonshot-v1-128k - Context: Up to 128K tokens (yes, you read that right)
- Streaming: Works perfectly, including function calling
- Pricing: ¥0.012/1K tokens (roughly $0.0017 for input, $0.002 for output)
This one is my daily driver now. I literally changed OPENAIBASEURL and my entire LangChain pipeline worked without touching a line of code. I almost cried.
Almost.
The catch: Their 128K model hallucinates more than a freshman on deadline. I threw a 90-page legal contract at it and asked for three key clauses. It gave me four—one of which appeared nowhere in the document. Use the 32K for production. Actually, wait—I should clarify that the 8K model is fine too, just... don't trust the 128K version with anything where accuracy matters. I think it's a RoPE scaling thing? Not sure.
2. DeepSeek
https://api.deepseek.com/v1
- Model:
deepseek-chat - Context: 32K
- Pricing: ¥0.001/1K tokens input, ¥0.002/1K output
No. That's not a typo.
Their V2 model benchmarks close to GPT-4 on MMLU and HumanEval. Is it actually that good? I mean... for code, yeah? For creative writing, probably not. But for structured extraction, classification, RAG pipelines—the stuff that pays the bills—it's insane value.
I ran 50,000 API calls through DeepSeek last month for a document processing pipeline. Total bill? ¥4.73. I spent ¥38 on coffee while setting it up. The math doesn't math.
The bad: Rate limits are aggressive on the free tier. Like, 10 RPM aggressive. Their support email auto-replies in Chinese and then ghosts you for three days. I eventually got a reply on day four. Progress?
3. Zhipu AI (ChatGLM)
https://open.bigmodel.cn/api/paas/v4
- Model:
glm-4,glm-4v(vision),glm-3-turbo - Context: 128K
- Pricing: ¥0.01-0.10/1K tokens depending on model
They use /api/paas/v4 instead of /v1, but the endpoint structure is identical. Function calling works. Vision API works exactly like GPT-4V. They even support the responseformat: { type: "jsonobject" } parameter—properly, not just documented and broken like some others I could mention.
The authentication headache: They want Bearer {apikey}.{apisecret} in the header. Not Bearer {api_key}. Their docs mention this in one sentence buried in a FAQ from November 2023. I stared at 401 errors for 45 minutes. The log just said "authentication failed." Thanks. Very helpful.
You also need to generate separate apikey/apisecret pairs for each model family. That's... annoying. Not a dealbreaker, just annoying.
4. Alibaba Qwen (Tongyi Qianwen)
https://dashscope.aliyuncs.com/compatible-mode/v1
The path tells you everything. /compatible-mode. It's "compatible" in the same way vegan cheese is cheese.
I feel like I need to be fair here though. Most things actually do work. Chat completions? Fine. Embeddings? Fine. But streaming... oh boy.
- Model:
qwen-turbo,qwen-plus,qwen-max - Context: 8K-32K depending on model
- Pricing: ¥0.008-0.12/1K tokens
Streaming responses sometimes drop the finishreason field. Not always. Maybe 2% of requests? I couldn't reproduce it reliably, so I wrote a wrapper that just defaults to finishreason: "stop" if it's missing. Ugly but functional. The chunks also occasionally arrive out of order—the choices[0].delta.content will have text from three chunks ago mixed in. I think it's a race condition in their SSE implementation? Their GitHub issues thread from March 2024 has other people seeing it too.
On the plus side, qwen-max is genuinely good at Chinese-language tasks. Better than GPT-4 for classical Chinese translation. I threw some Tang dynasty poetry at it and the translations were... actually beautiful? My Chinese colleague said they were better than her high school textbook. No joke.
5. Baidu ERNIE (via Qianfan)
https://aip.baidubce.com/rpc/2.0/ai_custom/v1/wenxinworkshop/chat/completions
Don't.
I mean—if you're already deep in the Baidu ecosystem, fine. Maybe. But "OpenAI-protocol" here means "it accepts JSON and returns JSON." The response format is different enough that you'll need to transform it. Token counting doesn't match. At all. I sent the same 100-token input and got back usage showing 37 tokens. Streaming uses line-delimited JSON with \n\n instead of data: prefix. Why? Why would you do that?
I spent six hours on Baidu ERNIE integration. Got it working. Then the next day their API version updated and broke my wrapper. I gave up.
Actually—wait, I should be fair. Their ERNIE 4.0 model benchmarks well. And if you use their native Qianfan SDK instead of trying to force OpenAI compatibility, it's... fine. But this post is about drop-in replacements, and ERNIE ain't it.
The Dark Horses (Worth Watching)
- 01.AI (Yi-34B): They're promising OpenAI compatibility "soon." Their current API is REST but uses different schemas. The model itself is excellent for code generation—I tested it on some internal benchmarks and it beat CodeLlama-34B handily. If they ship that OpenAI endpoint, it's an instant switch for our code assistant feature.
- ByteDance (Skylark/Doubao): No public API yet. But if you read their technical report from January 2024 (the one that was briefly on arXiv before getting pulled), the benchmarks are suspiciously good. Like, "we fine-tuned on the eval set" good. Grain of salt. A friend at a Beijing startup claims they have beta access and the results are impressive, but I'll believe it when I see public benchmarks.
- MiniMax (Hailuo AI): Invite-only API. Someone on r/MachineLearning got access and posted a gist showing it's fully OpenAI-compatible, including the
/v1/embeddingsendpoint. I applied for access six weeks ago. No response. If anyone from MiniMax is reading this... please. I just want to try it.
War Story Time
Last month I pushed a "minor refactor" to production that swapped our model from GPT-4 to Moonshot. CI/CD pipeline was green. Unit tests passed. Integration tests passed. Deployed at 11 PM like an idiot.
Woke up at 3:07 AM to PagerDuty screaming.
The error: TypeError: 'str' object is not subscriptable somewhere deep in our function calling handler. Moonshot's API was returning tool_calls as a string instead of an object. Not always—0.1% of requests. Took four hours of staring at CloudWatch logs to find the pattern. It happened when the tool name contained underscores.
The fix was a one-line try/except json.loads plus an extra isinstance check. Four hours. For one line.
The moral? "OpenAI-compatible" is a spectrum. Always test your edge cases. Always.
My Current Setup (YMMV)
I'm using LiteLLM as a proxy with fallback chains:
router:
primary: deepseek-chat
fallbacks:
- moonshot-v1-32k # for longer contexts
- qwen-max # for Chinese-heavy content
emergency: [redacted openai proxy]
Total cost is down 73% from GPT-4. Latency is better since servers are physically closer—p95 dropped from 1.8s to 400ms. The only downside is I now maintain four API key rotation scripts because some providers expire tokens every 30 days for "security reasons." DeepSeek rotates every 90 days, Moonshot every 30, Zhipu every 60. Kill me.
Key Takeaways
- Moonshot and DeepSeek are the most plug-and-play OpenAI replacements in China right now
- Zhipu and Qwen work but have quirks you'll need to code around
- Baidu ERNIE is OpenAI-compatible in name only (fight me in the comments)
- Use LiteLLM or a similar proxy to handle fallbacks and retries
- Test your streaming endpoints thoroughly, especially tool calling
- Domestic models are getting scary good for Chinese-language tasks
Question for the crowd: Anyone tried the new DeepSeek Coder V2 yet? The HumanEval scores look too good to be true. I'm worried about benchmark contamination but honestly at ¥0.001 per 1K tokens I might not even care. For the stuff we do internally, even if it's 20% worse than GPT-4, the cost difference makes it a no-brainer.
Also, if anyone from the 01.AI team is lurking: please release that OpenAI-compatible endpoint. My wrist hurts from writing custom API wrappers. I've written four this year. I'm tired.
Edit: Thanks for the silver! Several people DMed me asking about the proxy for OpenAI. I can't share details (throwaway account, plus it's against ToS), but search "openai api reverse proxy github" and you'll find options. Cloudflare Workers is a popular approach apparently? Use at your own risk. I am not your lawyer.
Edit 2: DeepSeek support actually replied to my email from three weeks ago. They fixed the rate limiting documentation. Progress! The docs now show actual numbers instead of "reasonable limits apply." Whatever that meant.
Edit 3: Some folks are asking about security. Yes, you're sending your data through servers in China. If you're handling sensitive data, self-host one of the open-source models. Qwen-14B and Yi-34B have Apache 2.0 licensed versions that run fine on a single A100. No excuse for sending PII through a third-party API. Seriously. I've seen what gets logged.
Edit 4: Clarification on Moonshot pricing—I listed per-1K-token rates but they actually bill per-character for Chinese text. Works out roughly the same but the invoice line items are confusing. Took me a billing cycle to figure out why costs didn't match my calculations.
china #llm #openai #api #devops #machinelearning
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.