I Tested GPT-5.6 on 47 Dynamic Programming Problems — Here’s Where It Shines and Where It Falls Flat
I Tested GPT-5.6 on 47 Dynamic Programming Problems — Here’s Where It Shines and Where It Falls Flat
Last week, I threw 47 DP problems at GPT-5.6’s API. Twelve of them crashed and burned.
The worst offender? A knapsack variant where the model confidently handed me an O(n³) solution. I stared at my screen for a solid ten seconds. Like… you don’t see a problem with cubic complexity on a problem that should be quadratic? Come on.
But here’s the twist — out of the remaining 35, eight solutions were genuinely cleaner than what I’d write myself. Better variable names, thoughtful comments, solid edge case handling. That’s when I got curious. Is this thing actually reliable, or am I just seeing what I want to see? Time for a proper evaluation.
The Test Setup
Let me walk you through the configuration.
I used GPT-5.6’s standard API endpoint with temperature set to 0.3. Too low and the code gets rigid and mechanical — too high and it starts hallucinating creative but wrong solutions. After a few trial runs, 0.3 felt like the sweet spot. Max tokens at 4096, which is more than enough for even the most verbose DP implementations.
The problem set: 47 DP problems pulled from LeetCode (rating 1800+) plus a handful from ACM regional contests. I covered linear DP, interval DP, tree DP, bitmask DP, and digit DP. Wanted to include probability DP too, but couldn’t find enough problems to make it statistically meaningful. Maybe next time.
Now, the prompt design — this is where I messed up initially.
My first approach was lazy: dump the problem description and ask for code. The model kept missing edge cases. Empty arrays, single-element inputs, that kind of thing. So I switched to a three-step process: first, analyze the problem and identify the DP category; second, derive the state definition and transition equations; third, generate the actual code. The one-shot accuracy jumped from 62% to 74%. That’s a bigger improvement than I expected. Prompt engineering really does matter, I guess.
Wait — I should clarify something. That 74% is the first-attempt pass rate, meaning the code gets accepted on LeetCode on the very first submission. If we count solutions that succeed within two rounds of fixes, the overall rate climbs to about 85%. I’ll break this down later.
The Case Studies
Case 1: Edit Distance — Nailed It
This classic problem was handled beautifully. The model immediately recognized it as 2D DP, defined dp[i][j] as the minimum operations to convert the first i characters of word1 to the first j characters of word2, and wrote a crystal-clear transition equation. Empty string edge cases? Handled automatically. All test cases passed on the first run, and the complexity analysis was spot-on.
What caught my eye was a comment in the code: "Consider using a rolling array to optimize space to O(n)." That surprised me. Previous versions rarely volunteered optimization suggestions unprompted. Back in the GPT-4 days, you had to pry that stuff out of it like pulling teeth.
Case 2: Regular Expression Matching — Faceplant
This hard-level problem (matching with '.' and '*') tripped it up badly.
The initial state definition was correct, but the transition logic for handling '' matching zero or more characters missed a crucial branch. When p[j-1] is '' and p[j-2] matches s[i-1], you need to consider the dp[i-1][j] state — but the model only wrote dp[i][j-2]. Classic mistake.
I had it self-debug, and it took three rounds of corrections to get it right. This revealed something interesting: GPT-5.6 still struggles with "choice-heavy" DP reasoning. When each state has multiple branching decisions, it tends to miss branches. It’s like the attention mechanism gets diluted when the reasoning chain gets too long.
Hmm… this is tricky to articulate. I think the core issue is that large language models still have a weakness with "OR" logic. They gravitate toward the most intuitive path and other possibilities just… slip away.
Case 3: Traveling Salesman Problem — Pleasant Surprise
Honestly? I thought this one would be a disaster.
This bitmask DP problem requires tracking visited cities as a bitmask and computing shortest paths. The model gave me a remarkably clean solution: dp[mask][i] representing the shortest path visiting the cities in mask and ending at city i. The enumeration order was correct too — iterate over masks first, then current node, then target node. Get that order wrong and you’ll get subtly incorrect results. The kind of bug where the code runs fine but produces wrong answers. It didn’t fall for that trap.
Minor nitpick: the initialization set dp[1][0] = 0 and everything else to infinity, but didn’t mention how to handle unreachable states. Still, the code passed all test cases. That exceeded my expectations. If I’m being honest, the first time I solved TSP I forgot about unreachable states too and got a WA (Wrong Answer) before figuring it out.
Let the Numbers Talk
Here’s the breakdown across all 47 problems:
- Linear DP (15 problems): 80% first-attempt pass rate, 93% within two fixes
- Interval DP (8 problems): 50% first-attempt, 75% within two fixes
- Tree DP (7 problems): 57% first-attempt, 71% within two fixes
- Bitmask DP (9 problems): 44% first-attempt, 67% within two fixes
- Digit DP (5 problems): 40% first-attempt, 60% within two fixes
- Knapsack variants (3 problems): 100% first-attempt
The pattern is pretty clear. Structured DP with well-defined patterns (linear, knapsack) performs great. Problems with complex state spaces or multi-dimensional decision-making (bitmask, digit DP) are where things get shaky.
Another observation: the model handles natural language problem descriptions much better than math-heavy ones. When a problem relies heavily on mathematical notation and requires reverse-engineering intent from formulas, accuracy drops noticeably. I suspect this reflects the training data distribution — LeetCode solution discussions are mostly text-based explanations, not pure formula derivations.
The Gotchas
Here are some real issues I ran into. If you’re planning to use GPT-5.6 for algorithm problems or code generation, keep these in mind.
Gotcha 1: Boundary Initialization Gets Forgotten
Multiple times, the model nailed the recurrence logic but botched the base case initialization. Take the Longest Palindromic Subsequence problem — it initialized length-1 cases correctly but skipped length-2, causing all even-length palindromes to fail. I added "Please verify all boundary condition initializations" to my prompt, which helped, but occasional misses still happened. This problem existed in GPT-4 and, from what I’ve seen, 5.6 improved it but didn’t eliminate it.
Gotcha 2: Space Optimizations Introduce Sneaky Bugs
I once asked it to optimize a solution from O(n²) space to O(n) using a rolling array. It implemented the rolling array but didn’t adjust the loop order, causing state overwrites. This kind of bug is incredibly subtle — you won’t catch it without running tests. My advice: if you’re using it for space optimization, demand an explanation of why the loop order changed. Don’t just read the code. Read the reasoning.
Gotcha 3: Constraint-Blindness
One problem had n up to 10^5, and the model casually proposed an O(n²) solution. Obviously not going to fly. I followed up with "n can be up to 100,000 — will this complexity pass?" and it immediately pivoted to an O(n log n) approach. Lesson learned: always include the data range in your prompt. Don’t expect it to notice constraints on its own.
Oh, and one more thing that’s not exactly a bug but definitely annoying: GPT-5.6 occasionally uses cryptic variable names like t, cur, tmp. Makes the code a pain to read. You’ll either need to rename everything manually or explicitly request "use meaningful English words for variable names" in your prompt.
Quick Comparison: GPT-4 vs Claude 3.5 vs GPT-5.6
I ran the same 20 problems through GPT-4 and Claude 3.5 Sonnet (October 2024 version). GPT-4 hit 55% first-attempt accuracy, Claude managed 60%, and GPT-5.6 reached 74%. The gap mainly shows up in complex DP problems — on simple ones, all three are roughly comparable.
Claude has one clear advantage though: its code comments are more detailed and its variable naming is more semantic. GPT-5.6’s code is more concise, sometimes reading like something a competitive programmer would write — short but requiring mental backfilling. GPT-4 sits somewhere in the middle, balanced across the board.
Speed-wise, GPT-5.6 is about 40% faster than GPT-4. Running all 47 problems plus fixes took me under 20 minutes. The same workload on GPT-4 would’ve been 30+ minutes. That matters when you’re batch-processing problems — nobody wants to stare at a progress bar.
Where This Is Actually Useful
After all this testing, my take is that GPT-5.6 can handle DP problems covering most interview and day-to-day development scenarios. Linear DP and knapsack problems? Pretty trustworthy. Complex DP? Needs human review and fixes, but it’s a solid starting point that saves real time.
My current problem-solving workflow looks like this: think independently for 15 minutes, and if I’m stuck, throw it at GPT-5.6 to see its analysis process. Then I close the solution and write it myself. This approach is way more efficient than my old habit of banging my head against a problem for an hour before checking the solution. The model’s reasoning unfolds step by step, which helps me understand why a particular state definition makes sense — and that’s exactly the hardest part of DP.
A word of caution though: don’t take AI-generated code directly into an interview or competition. First, it might be wrong. Second, if you can’t explain the logic clearly, the interviewer will spot it immediately. A friend of mine prepped for a ByteDance interview earlier this year using ChatGPT for DP problems. When the interviewer asked about the motivation behind his state definition, he froze. It was… awkward. Very awkward.
Use it as a learning tool and an assistant. Not a replacement.
What about you? Have you used GPT or Claude for algorithm problems? Which categories does AI handle best, and where does it completely fall apart? Drop a comment — I’m planning to test graph algorithms next. I’ve been revisiting Tarjan’s algorithm lately, and I have a strong suspicion AI is going to struggle with that one. Let me know if there are specific problem types you want me to include.
GPT5.6 #DynamicProgramming #AlgorithmBenchmark #APITesting #LeetCode #AICoding
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.