Why GPT-4 Still Can't Do Maths Properly (And What I Built to Fix It)

Last year, a friend doing quant trading in London pinged me at 2 AM with a probability problem. He'd run it through GPT-4 three times. Got three different answers. All wrong.

I remember thinking—hang on, isn't this thing supposed to be a maths whiz? It was only later that I figured out the real issue. The model wasn't stupid. It was just trying to do everything in its head.

Bit like asking a history professor to multiply 17×24 without pen and paper. They'll get there eventually, but it'll be painful to watch. Most of us would've already opened the calculator app. That's exactly the problem with AI—it doesn't lack reasoning ability, it just never learnt to "pick up the calculator".

Where Chain of Thought Actually Fails

When Chain of Thought first appeared—late 2022, if I remember correctly, with that Google paper everyone was sharing—it genuinely felt like a breakthrough. The idea's straightforward enough: prompt the model to show its working, like a teacher asking "walk me through your thinking."

But here's the catch.

It's talking through the problem, not actually computing anything.

Last March, I was working on a financial calculation project. Simple on paper: given a set of cash flows, calculate the IRR. If you've done this before, you know IRR involves solving higher-order equations—no analytical solution exists. You need Newton's method or bisection, iterative approximation. Being lazy, I tried using Chain of Thought with GPT-4.

It spat out pages of algebraic derivations. Looked gorgeous. Every step perfectly formatted. The result was off by over 15%. And the confidence—my god, the confidence. You'd think it had just derived the theory of everything.

I later stumbled across a GitHub issue on LangChain that put it perfectly: LLMs are pattern matchers, not calculators. Symbolic reasoning? Decent enough. Precise numerical computation—especially iteration, matrix operations, Monte Carlo stuff—pure token generation is basically rolling dice.

Actually, let me correct that. It's worse than rolling dice.

In December 2023, I ran a test: 100 mixed maths problems using Chain of Thought. Any question involving floating-point calculations had a 34% error rate. But the same problems, if I just had the model generate the solution approach and code, then ran it through Python—error rate dropped to 6%. That's not a small gap.

Google DeepMind confirmed this in early 2024. On GSM8K, pure Chain of Thought hit about 78% accuracy. Offload the computation to external tools? Jumped past 92%. The model didn't get smarter—it just delegated to the right team member.

How to Embed Code Execution in Reasoning Chains

The idea's almost embarrassingly simple: let the model write code → execute it → feed results back into context.

This mirrors how we actually think. Can't solve an equation analytically? Run a programme. Unsure about a probability distribution? Simulate it ten thousand times.

My current approach has three stages.

Step 1: Deciding When to Write Code

Not every step needs code execution. Simple arithmetic, logical deduction—the model's native abilities handle those fine. But hit any of these, and you've got to switch to code mode:

Precise floating-point calculations
Iterative solving, optimisation problems
Matrix operations, statistical analysis
Anything needing numpy, scipy, or similar libraries
Random simulations

I added a decision rule to my prompt. Looks roughly like this:

If the current step involves the following calculations, output a CODE marker:

- Precise numerical computation

- Equation solving, optimisation

- Matrix/vector operations

- Iterations exceeding 3 steps

- Random number generation

Tested this rule on 50 problems—96% accuracy. Missed twice, both times the model thought an expression looked simple enough to brute-force. It got the right answer eventually, but took the scenic route.

I reckon this decision logic could be better. Right now it's hardcoded rules. Might train a small model specifically for this judgement call later. But that's another project.

Step 2: Code Generation and Sandbox Execution

This is where I collected the most scars.

I initially ran model-generated code locally using exec(). Massive mistake. Once, the model produced os.system('rm -rf /')—thank god I was running in Docker, otherwise I'd be updating my CV right now. Learned my lesson fast: sandbox everything.

Current setup uses Docker containers with resource limits:

Image: python:3.11-slim (upgraded from 3.10 last year)
Memory cap: 256MB
CPU quota: 50% (cpuquota=50000, cpuperiod=100000)
Network: completely disabled
Timeout: 10 seconds, then kill it dead

On the tooling side—Docker SDK for Python, honestly, could be better. The API design feels deliberately obtuse sometimes. I've been eyeing gVisor and Firecracker for lighter-weight sandboxing, but haven't tested them properly yet.

Over six months, roughly 2,300 code executions. Zero security incidents. One instance where the model wrote an infinite loop—while abs(error) > 1e-6 but forgot to update error inside the loop—hit the 10-second timeout and got terminated.

The error log looked like this:


Error: Execution timeout after 10s
Code snippet:
 while abs(error) > 1e-6:
 x_new = x - f(x)/df(x)
 # Missing: error = abs(x_new - x)

Fed that log back to GPT-4. Fixed it instantly.

Step 3: Feeding Results Back

Once the code runs, the output goes back into the conversation context. Crucial detail: send both the code and the execution output. This lets the model see what it wrote and what actually happened. If things went sideways, it can self-correct.

My current prompt template:


You are a maths problem-solving assistant. 
When you encounter complex calculations, output Python code within <CODE> tags.
The code will be executed in a sandbox, and results returned to you.

Result from previous code execution:
<RESULT>
{execution_output}
</RESULT>

Continue reasoning based on these results, or output the final answer.

Test data: 30 MMLU maths questions. With code execution, accuracy jumped 18 percentage points (72% → 90%). Probability and statistics questions improved the most, numerical computation close behind.

A Complete Walkthrough

Let's use that problem my friend was losing sleep over:

A trading strategy's daily returns follow a normal distribution: mean 0.05%, standard deviation 1.2%. After 252 consecutive trading days, what's the probability that cumulative return exceeds 15%?

Pure Chain of Thought? Forget it. This involves sums of normal distributions, high-dimensional integration—you need numerical methods or Monte Carlo.

Here's the code-execution flow:

The model analyses the problem, decides on Monte Carlo. Generates this:


import numpy as np

np.random.seed(42)
n_simulations = 100_000
n_days = 252
daily_mean = 0.0005
daily_std = 0.012

daily_returns = np.random.normal(
 daily_mean, daily_std, 
 (n_simulations, n_days)
)
cumulative = np.sum(daily_returns, axis=1)

prob = np.mean(cumulative > 0.15)
std_error = np.std(cumulative > 0.15) / np.sqrt(n_simulations)

print(f"Probability: {prob:.4f}")
print(f"Standard error: {std_error:.6f}")

Sandbox output:


Probability: 0.0823
Standard error: 0.000869

The model then explains:

Based on 100,000 simulations, the probability of exceeding 15% cumulative return over 252 trading days is approximately 8.23% (standard error ~0.087%). This makes intuitive sense: daily mean 0.05% annualises to roughly 12.6%, but with substantial volatility. Exceeding 15% isn't a high-probability event.

The model handles understanding the problem, designing the approach, and interpreting results. The computation? That's numpy's job. Everyone plays to their strengths.

What I've Tripped Over

Been at this for the better part of a year. Some recurring headaches:

Buggy generated code. This is the most common issue. Beyond that infinite loop I mentioned, there was one time the model wrote a Newton's method root-finder but inverted the iteration formula. Results diverged spectacularly. Now I force it to output intermediate variable values for the first few iterations—makes debugging much easier.

Floating-point precision. Models keep writing 0.1 + 0.2 == 0.3. I've added a spec to the prompt: floating-point comparisons must use np.isclose() or tolerance checks like abs(a-b) < 1e-9.

Missing dependencies. Sometimes models import libraries I haven't installed in the sandbox. My Docker image now comes pre-loaded with numpy 1.24.3, scipy 1.10.1, sympy 1.11, pandas 2.0.1, statsmodels 0.14.0. Covers most mathematical needs. For anything else, the ImportError gets returned, and the model usually finds an alternative.

Current limitations:

Container startup overhead per execution, roughly 200-400ms. Not great for latency-sensitive applications
Complex problems need multiple "write code → execute → feedback" cycles. Token consumption adds up fast. Last month, an optimisation problem took 6 rounds, burnt through 15K tokens
Models sometimes get overconfident and brute-force calculations they should code. Requires constant prompt tuning

What I'm Playing With Next

Two directions.

Automatic tool selection. Not just Python—let the model choose between Wolfram Alpha API, sympy for symbolic computation, even SQL for certain problem types. "Which tool to use" becomes part of the reasoning process. Still experimental, currently hovering around 70% accuracy. Not good enough yet.

Code caching. Many maths problems share computation patterns—Monte Carlo simulations, Newton's method, matrix decomposition. Cache code templates, match similar problems, just tweak parameters. Saves tokens, sure, but mostly I want to stop the model regenerating the same boilerplate every single time.

Both are early-stage. I'll write more when there's something solid to share.

Key Takeaways

Chain of Thought is brilliant for reasoning, rubbish at computation
Offload actual number-crunching to proper execution environments
Sandbox everything—seriously, don't learn this the hard way
Feed both code and output back to the model for self-correction
The 80/20 win: just getting the model to recognise when it should code solves most problems

From Chain of Thought to code execution, this isn't some theoretical breakthrough. It's engineering pragmatism: language models do understanding, planning, and explanation. Computation tools do precise calculation. From what I've seen, plenty of AI agent teams are heading the same direction.

If you're building something similar—or have better sandboxing approaches than Docker—drop a comment. I'd especially love to hear about isolation solutions lighter than 300ms startup times. That delay genuinely annoys me.

Oh, and one last thing. If you're reproducing this setup, tighten those Docker socket permissions. Don't mount /var/run/docker.sock carelessly into containers. Don't ask how I know.

What's your experience with LLMs and mathematical reasoning? Have you tried code execution workflows? Let's chat in the comments.

ai #machinelearning #python #programming #dataengineering

Why GPT-4 Still Can't Do Maths Properly (And What I Built to Fix It)

Why GPT-4 Still Can't Do Maths Properly (And What I Built to Fix It)

Where Chain of Thought Actually Fails

How to Embed Code Execution in Reasoning Chains

Step 1: Deciding When to Write Code

Step 2: Code Generation and Sandbox Execution

Step 3: Feeding Results Back

A Complete Walkthrough

What I've Tripped Over

What I'm Playing With Next

Key Takeaways

ai #machinelearning #python #programming #dataengineering

Cael Lee

Ready to get started?