Can Large Language Models replace mathematicians?
Exploring the mathematical thinking of humans and LLMs
Around the 16th century, aristocrats had a strange way of determining who the king of the hill was: solving algebraic equations. One particular problem was the following: find two numbers whose sum and product are equal to two; that is,
hold.1
If we express y from x + y = 2 and substitute it back into x y = 2, then we immediately obtain the quadratic equation
Quadratic equations are easy to solve; in fact, a general formula has been known for centuries:
However, upon applying the formula to x² - 2x + 2 = 0, we hit a snag. As b² - 4ac = -4, the solutions should be
But what is the square root of -1? Taking the square root of a negative number was as blasphemous as talking about irrational numbers in ancient Greece. You, the reader, might know how to work with this, but 16th-century aristocrats and mathematicians did not.
At least, not until someone pretended that √(-1) exists and kept crunching the computations. By denoting i = √(-1) and using i² = -1 with the famous algebraic identity (a - b) (a + b) = a² - b², we obtain that indeed
that is, our solutions satisfy the original equations.
So, what is i?
Taking the conceptual leap from “you can’t take the square root of a negative number” to “watch me do it anyway” resulted in the discovery of complex numbers. (Or the invention, depending on your philosophical alignment.)
Complex numbers are everywhere in modern mathematics. Without them, we can’t properly factorize matrices, analyze differential equations, train deep neural networks.
Still, why are we talking about complex numbers?
Because with every new foundational model release, the discussion of whether or not AI has replaced mathematicians flares up. Undoubtedly, the problem-solving and reasoning capabilities are seriously improving with each release, but are we using the right metrics? Is solving close-ended math problems the right way of gauging the performance of a model?

Are LLMs capable of such brilliant ingenuity? If not, will they ever be? In general, is genuine invention in the scope of artificial intelligence?
I don’t know. I’m just a mathematician with a slightly above-average understanding of AI. To avoid epistemic trespassing—that is, thinking that I’m an expert in AI just because I’m an expert in an adjacent field—I won’t tell you what AI is and is not capable of.
Instead, I’ll show you what mathematics is and how it is done, perform a couple of LLM experiments, and let you draw your own conclusion. (If you decide to share it in a comment, even better.)
Think of this post as an exploration instead of the ultimate answer to the question posed by the title.
Let’s go!

The flow of mathematics
We started with complex numbers, and we’ll continue with them.
The curious number i we’ve discovered is defined by the expression i² = -1. As no real number exists with such a property, i is called the imaginary number. The term was coined by Descartes, intended to be an insult. Little did he know that it was the ultimate compliment, a testament to the power of human ingenuity.
When you encounter i for the first time, it can feel quite mind-bending.
Recall that the solutions for our original problem were 1 + i and 1 - i. In general, expressions of the form a + bi are called complex numbers, and we can work with them just like multiple algebraic terms:
Even though the algebraic rules fully describe complex numbers, we still don't have a way of imagining them.
So, let's take another conceptual leap and map the expression a + bi to the point (a, b) on the Euclidean plane. In other words, translate algebra to geometry.
Once again, we gain a bundle of new tools by taking the conceptual leap from algebra to geometry: by treating z = a + bi as a planar vector, we can measure its angle (called argument, denoted by ϕ) and its magnitude (called modulus, denoted by |z|). Visual thinking is an integral part of mathematics.
But wait, there's more!
From this geometric viewpoint, we can easily obtain a new representation of complex numbers. To see why, recall how sin and cos are defined in terms of the unit circle.
Based on this, we obtain the so-called polar form:
In mathematics, finding novel representations is often the key to solving problems and gaining a deeper understanding. For instance, using the polar form and the trigonometric addition formulas, we obtain that
In other words, for complex numbers, multiplication is a rotation plus a scaling. You can verify it in the example of i (1 + i) = - 1 + i.
This example highlights two key ideas of mathematical thinking:
sometimes, it’s easier to draw a picture,
and finding alternative representations can be the key to discovery.
Let’s step back and sum up what we’ve done so far. Our starting point was a mathematical problem: finding two numbers whose sum and product equal 2.
Because there are no solutions amongst real numbers, we have built an entirely new toolset: complex numbers. Although complex numbers were introduced by their algebraic properties, we built our mental models by going from a + bi to (a, b); that is, from algebra to geometry.
Although making the jump is not needed in a deductive sense, this step provides an intellectual shortcut to results like the polar representation.
The example of complex numbers sheds light on the fact that mathematics is done by
proposing a problem,
building tools,
then solving the problem with said tools.
According to the legendary Timothy Gowers, mathematicians are classified as either problem-solvers, or theory-builders. Neither of these steps is trivial, and mathematical progress is not merely the result of churning out formal proofs.
To continue our exploration into mathematical thinking, let’s follow the problem-solving and theory-building trail. Translating Gowers’ classification into cognitive processes,
problem-solving is reasoning,
and theory-building is abstraction.
We’ll put both of them under the magnifying glass.
Reasoning
Let’s start with an example. Consider the following problem: given a right triangle with a hypotenuse of length 4 and one leg of length 2, what is the length of the altitude to the hypotenuse?
Since it is a geometric problem, we visualize the triangle and assign names to the vertices.
The problem is simple, but it’s instructive to go through it and formalize our thinking step by step. So, here’s the solution.
As we have a right triangle, the Pythagorean theorem gives that the length of BC is
Now, as the triangles △ABC and △AMC have the same angles, they are similar. Thus, their corresponding sides are proportional: CM/AC = BC/AB, from which CM = √3 follows.
Now, it’s time to formalize, and we shall do this with mathematical logic. Here’s the gist: propositions like P: “the △ABC triangle is a right triangle“ are either true or false, and theorems are P→Q type propositions, which are true if Q is true or both P and Q are false.
(If you are unfamiliar with formal logic, check out the first post of The Palindrome, where I explain it in detail.)
The single most important rule of deduction is modus ponens, which is the formalization of the following process:
If P, then Q.
P is true.
Therefore, Q is true.
In mathematical symbols, modus ponens is written as P→Q, P ⊢ Q.
So! Formally, we have
The proposition P→Q is known as the Pythagorean theorem, and the modus ponens gives that since P and P→Q are true, Q is true as well.
You get the gist. The next step establishes the similarity of the ABC and AMC triangles, and then the final step leverages the similarity to find the length of CM via the ratio of corresponding sides.
Once an idea is formed, formal reasoning is straightforward. Finding the right path is the real problem!
Abstraction
Let’s revisit our tale of complex numbers and focus on the single step from the algebraic expression a + bi to the tuple (a, b).
This is called abstraction: “a process where general rules and concepts are derived from the use and classifying of specific examples,” according to Wikipedia. We take away the concrete — that is, the + symbol and i — and retain only the structure (a, b).
Abstraction is key to forming mathematical concepts. For instance, we can also think about a planar vector as the tuple (a, b), where the coordinates represent the direction of the vector. We’ve seen this when discussing complex numbers:
Another example: polynomials with real coefficients, that is, functions of the form
Although polynomials are functions, we can abstract the specifics and encode each polynomial in a tuple by
What’s the common point of complex numbers, planar vectors, and polynomials? The first is an algebraic object, the second one is geometric, while the third is analytic. We can represent all three with tuples, sure. But is there a pattern?
Yes. The “general rules and concepts” part, as Wikipedia’s definition of abstraction states, is that you can add and scale them. For complex numbers, it’s
for planar vectors, it’s the parallelogram rule
and for polynomials, it’s
Thus, the concept of vector spaces is born! When taking away all the specifics and leaving only the general, we obtain that a vector space is
a set V (called vectors),
an operation + that “behaves nicely” (called addition),
and a scaling operation · that also “behaves nicely” (called scalar multiplication),
compacted into the mathematical structure (V, +, ·).
Abstraction is all over mathematics: concepts are formed by distilling models into a general set of rules, called structures.
Another example is the notion of functions, which, considering that mathematics is thousands of years old, is a relatively young concept. Generalizing from expressions like
“f(x) = 2ˣ,”
“the sine of the angle θ equals the ratio of the opposite side and the hypotenuse,”
or “a rotation of a point by angle θ around the origin is the new point at the same distance from the origin, turned counterclockwise by angle θ,”
functions turn out to be subsets of the Cartesian product, that is, tuples
represented by dots and arrows between them:
Dots and arrows show the skeleton of functions. Given the concrete class of functions you are working with, they have more meat on their bones, whether you are talking about continuous functions or isomorphism or whatever else, but the skeleton is always the same: dots and arrows.
So, can LLMs replace mathematicians? Now that we understand key elements of mathematical thinking, it's time to turn our attention towards them.
Testing the LLMs
If you think that language models can replace mathematicians, I have some bad news for you: due to how LLMs work, it’s extremely unlikely that they can discover truly new knowledge.
LLMs are essentially “statistical bullshit machines”, conjuring their response by predicting the most likely tokens that follow the prompt. If you want to go into the details, I recommend my friend Alejandro’s excellent post.
On the other hand, LLMs are crazy good at digging up information from their training dataset. Even though you (probably) won’t be able to push the boundaries of knowledge only via prompting, you can certainly find a couple of pieces to fit the puzzle on your way toward discovery.
If you are coming from the computer science side, think of solving a mathematical problem as creating an app or a framework. Even though your goal is novel, you’ll write a bunch of functions that perform common tasks, like parsing input, cleaning data, logging a user in, etc.
To see how LLMs perform, we’ll do a couple of tests. Mirroring our scenic trip in mathematical thinking, we’ll test how OpenAI’s o4 fares in
solving open-ended problems,
reasoning,
visual reasoning,
and abstraction.
(I did not intend to evaluate the mathematical capabilities of an LLM with academic rigor; instead, I just designed a couple of prompts to discover the potential flaws in language models. Keep this in mind when you read this, and do your own research before reaching a conclusion.)
Let’s go!
Open-ended problems
Mathematical problems are open-ended and posed as questions instead of claims. As the legendary mathematician Bernhard Riemann once said,
“If only I had the theorems! Then I should find the proofs easily enough.”
LLMs are not particularly good at open-ended problems. For instance, here’s a question: can the four-dimensional unit sphere
be written as the disjoint union of “two-dimensional unit spheres”, that is circles
embedded in four dimensions?
The answer is either yes or no, and when formulated as a proposition, ChatGPT-4o is happy to prove both versions. Here’s the one where I ask “Show that the unit sphere in four-dimensions is the disjoint union of two-dimensional circles.”
(Yes, I know that there is a typo in “four-dimensions”.)

Unfortunately, ChatGPT also shows the opposite when asked to “Show that the unit sphere in four-dimensions cannot be the disjoint union of two-dimensional circles.”

When formulated as a question, ChatGPT-4o argues that the answer is negative; that is, the model argues that the unit sphere in four dimensions cannot be the union of circles.

I’ve read all three answers, and two of them are complete bullshit, the kind that students usually give to hide their cluelessness.
The true answer is affirmative, and this is known as the Hopf-fibration. Check it out.

(If you are interested, check out the video of Richard Behiel, or this interactive visualization of Nicolas Garcia Belmonte.)
Reasoning
Because LLMs generate replies token by token on a probabilistic basis, testing reasoning capabilities is hard. First, language models don’t perform reasoning in the formal sense. Second, answers to our questions might have been used to train the model.
Even if a model can correctly give a proof of, say, the Pythagorean theorem, it’s not a display of reasoning ability; it is the memorization of training data.
In lieu of a mathematical problem that is 1) unsolved to the public, to which 2) I know the correct solution, I have prompted my friend
to come up with some ideas. He suggested using context-free grammars and asking a model if a given string is a valid sentence in a particular grammar.Let’s unravel the idea. A context-free grammar is a set of rules to produce strings like “abbaab“. For instance, the rules
S → Saa,
S → Sbb,
S → ε,
where the third rule “S → ε” means that we terminate string generation, give strings of a-s and b-s, each repeated even times.
You can visualize a context-free grammar as a graph with labeled edges, and each valid string as a walk from the starting symbol S to the symbol ε.
Here, the string “aabbbbbbaa“ is valid, while “abbbbaa” is invalid in this grammar. Checking the validity can be thought of as a form of reasoning, finding the chain of operations used to construct the string.
So, can ChatGPT-4o tell? No. Even though it discovered that each valid string
must have an even number of symbols,
must end with either “aa” or “bb”,
it concludes that “abbbbaa”, “abaaaabb”, and “aabbbbbbaa“ are invalid. (The first two are indeed invalid, but the final one is valid.)

It’s instructive to check out the full conversation here.
Visual reasoning
ChatGPT did not score perfectly on the reasoning test, but what about visual reasoning?
Do you recall the simple geometry problem that we discussed earlier? We were given a right triangle and tasked to find the altitude to the hypotenuse.
Can ChatGPT-4o solve the problem? Sort of. Correct answer, wrong reasoning.

Even though it did manage to find the correct value, it was obtained through guessing. In the very first step, it assumes that M is the halfpoint of AB, which is false, leading to contradiction after contradiction. Ultimately, it concluded that AM must equal 1 (despite M assumed to be the halfpoint of AB and AB = 4); thus, CM = √3.
(Watching this answer unfold in real time was hilarious, as the model stumbled from contradiction to contradiction.)
Abstraction
Abstraction, the formation of concepts, is essential in math. While problem-solvers lead the charge to push the boundaries of knowledge, theory-builders make it possible by clearing up the mess left by the trailblazers. Think of it as the refactoring of math.
So, how good are LLMs at abstraction? I have spent quite some time conjuring up prompts to test this out. I can’t just drill the model on the abstract concept of vector spaces or functions because it would just echo the math textbooks. To truly test abstraction capabilities, we have to come up with something it couldn’t have seen.
That’s tough, and I did not succeed completely. Let’s see what I came up with!
First, I have considered one of the most basic steps of abstraction that we learn in elementary school: replacing numbers with symbols. (In fact, representing quantities with numbers is a step of abstraction, too. But let’s not go back that far.)
Once more, I have asked ChatGPT-4o a tricky question: “Let the symbol 1 denote the quantity 2, and the symbol 2 denote the quantity 1. What’s 2 + 2?“
The correct answer is 1. (Not the regular 1, but the one that denotes 2. You know.) ChatGPT-4o saw right through the trick and gave a correct answer.

Let’s dial up the difficulty a notch and use rare symbols with huge numbers.
“Let the symbol ⁂ denote the quantity 75685831245852, and the symbol ‽ denote the quantity 52389589236789532523523. What’s ⁂ + ‽?”
The answer is 52389589312475363769375, if you are curious. But what does ChatGPT-4o think about it? Its reasoning is correct, but it fails at performing the addition.

My third attempt involved the Klingon language. For this test, I have translated the following “if A, then B” propositions to Klingon:
If it’s raining, then the sidewalk is wet. -> SISchugh vaj SaqmeH He naghDaq bIQ tu’lu’.
If the sidewalk is wet, it is slippery. -> SaqmeH He bIQqu’chugh vaj ‘oy’be’Ha’.
If an integer is divisible by four, it is divisible by two. -> mI' loS boqHa'chugh, vaj cha' boqHa'bej.
(To be extra sure, I have done the translations in temporary mode, and I don’t have any references to Klingon in my entire chat history.)
Using these sentences, I have asked ChatGPT-4o to find the general pattern:
“What’s the pattern in the following sentences? 1. SISchugh vaj SaqmeH He naghDaq bIQ tu’lu’. 2. SaqmeH He bIQqu’chugh vaj’ oy’be’Ha’. 3. loS boqHa'chugh mI' vaj cha' boqHa'.”
The answer was correct!

Of course, we can argue that the three tests we have performed so far are just pattern matching, not really abstraction. Sure, Klingon is rare, but the algorithm could have seen this written down somewhere.
So, I went further in the next step and used Antropic’s claude-3.7-sonnet to come up with a completely new language, translate the above three sentences, and then ask ChatGPT-4o the same question. Thus, the arcane language of Laithesh was born.
Thus, my next question was, “What’s the pattern in the following sentences? 1. Vos meita palu, ko shedar pelu noth. 2. Vos shedar pelu noth, ko pelu shavri. 3. Vos shumar enta kefar, ko enta devar.”
Did ChatGPT figure it out? Sort of. It did catch the “vos A, ko B” structure, and in a temporary chat instance, it even suggested it to mean “if A, then B.” However, I could not reproduce it, so feel free to try this one out.
In the chat I’ll share, it caught the pattern but got stuck on the fact that the second sentence starts as the first one ends.

It concludes the following:
Thus, the core pattern is:
• Each sentence is built such that the "ko" phrase of one becomes the "vos" phrase of the next sentence, possibly with slight edits or reductions.
Example in English (analogous structure):
1. You bring water, and carry the basket.
2. You carry the basket, and lift the stones.
3. You lift the stones, and pile them high.
This is definitely not the pattern, as this fails in relation of the second and third sentence.
Summary
Even though I’m not a huge fan of the Star Wars franchise, there’s a quote that I often refer to:
“Only a Sith deals in absolutes.” — Obi-Wan Kenobi
There’s no ultimate answer to the question posed by the title of this post: Can Large Language Models replace mathematicians?
Claiming this with complete uncertainty is shallow hype, but denying the possibilities is just technophobia.
Yes, LLMs are statistical bullshit machines, inherently weak at
open-ended problems,
reasoning,
and abstraction.
On the other hand, they are insanely good at digging up information from the infinite source that is the internet, and that is invaluable in research and problem-solving.
I recall that when writing the magnum opus of my PhD research, I got stuck on a single step for months. Eventually, I found an arcane inequality in a decades old publication from an entirely different topic that gave me the final piece to the puzzle, completing about a year of work in a single step. ChatGPT would have probably gave me that in a couple of prompts.
So, my answer: no, LLMs won’t replace mathematicians, but they’ll be one of their most powerful tools, accelerating research in the years to come.
The original problem was slightly different, I have changed the numbers for simplicity.
o3 can easily solve the reasoning tasks
https://chatgpt.com/share/680eb4b9-da30-800e-9fb0-2897e460aeec
https://chatgpt.com/share/680eb58e-aab0-800e-9792-d32ac492d68d
And the abstraction tasks:
https://chatgpt.com/share/680eb613-e890-800e-99d0-f99efa522040
https://chatgpt.com/share/680eb653-0928-800e-b735-5ac3335a7ffb
You really should be using reasoning models for all of your tests here. The tests you are showing are at this point ~6 months out of date and a lot has happened in that time frame.