2025-12-12

The Social Proof: How Mathematical Practice is Evolving

TL;DR I interviewed 30 mathematicians about how AI is entering their research practice. The near-term story is not “AI proves new theorems”, but delegation of friction.

What’s real: literature navigation, drafting LaTeX/code, and occasional help on technical lemmas.
Main friction: hallucinated citations + chat-style iteration makes for a poor UX.
Big picture: the near future looks like infrastructure for search and understanding, not an “ultimate prover”.

Here is a familiar scene from the modern “AI for Math” narrative. An AI evangelists (usually from OpenAI it turns out) announces, with maximal volume on social media, that AGI is upon us. The latest AI model has just proved a result beyond the reach of any human mathematician!

A claim of AI Math supremacy (original tweet)

After this year IMO, where AI dominated the field¹, it was Erdös’ problems that became the next target. Paul Erdös, a legendary figure in mathematics with symbolic weight comparable to the IMO itself for the general audience, left behind a vast collection of questions that naturally attract AI systems positioned as aspiring mathematicians. Thomas Bloom, a mathematician at the University of Manchester, has compiled over a thousand such problems from across Erdös’ career on his website www.erdosproblems.com.

An open Erdös problem (note the disclaimer).

A few tweets later, the breakthrough evaporates when mathematicians point out the shortcut — usually that the result was already known, proved in the literature, or simply not that hard. This is exactly what happened in the episode now referred to as the ErdösGate. The AI had not proved anything new, but surfaced an existing solution in the literature.

This was then reframed as “if AI is not superhuman at proving, at least it is superhuman at literature search” (as expressed by Sebastian Bubeck here). While the discourse celebrates AI theorem-proving as if it were a superhuman leap, what is genuinely valuable to working mathematicians is often the much more modest ability to accurately retrieve known results².

In an even more striking twist, some of these resurfaced “solutions” predated the very formulation of the question (this is the case of the Forbidden Sidon subsets problem).

In other cases, the model had solved only a simplified variant, a problem at a level closer to competition mathematics than genuine research. And as if further clarification were needed, Thomas Bloom reminded everyone that an “Erdös problem” is not inherently difficult simply because the name Erdös appears in front of it.

A clarification on the mathematical significance of Erdös problems.

After watching enough of these cycles, it is hard not to grow weary of the relentless dramatization of AI’s influence on Mathematics³. Yet, as a mathematician at heart, I could not help wondering how actual, professional, working mathematicians perceive all of this. Setting aside the noise, what is the real impact of AI on their work? And how do they imagine the future of mathematics and science in an AI-mediated world?

Here is another scene, recounted by a mathematician, this time at a conference. Laptops are open; some attendees are replying to emails, others are taking notes - the usual for the last decades of scientific conference. But there is something new as well. Many members of the audience are quietly conversing with their preferred AI tools, most often Gemini or ChatGPT. While the next speaker fine-tunes his slides, a young postdoc uses the time to ask for a high-level introduction to the topic just presented.

Much of the public narrative remains fixated on AI as an “ultimate prover”. But proving is only a small part of what mathematicians actually do. Most of their time is spent searching, translating, testing ideas, talking to others, navigating the literature, and deciding what is even worth pursuing. What already matters, and what is clearly happening, is that AI is beginning to reshape these ordinary activities. This quieter transformation is far less spectacular than the headlines suggest, but it is precisely where the real impact lies.

Mapping the Everyday Life of Mathematical Research

One can always speculate, whether as an expert or an enthusiastic outsider, drawing on whatever theoretical framework seems convenient whether sociology of science, epistemology, or general intuition about how mathematicians think. But there is also a simpler and more reliable approach which is to ask the people who actually do the work⁴.

With that goal in mind, I spent the past year speaking with thirty mathematicians, spanning all levels of seniority and a wide variety of fields. From PhD students to emeritus professors, from probability theory to algebraic geometry. I wanted to see what mathematicians actually do when they are not driking coffee and sit down to work. How do they choose problems? How do they make progress? What tools do they rely on? And what role, if any, is AI beginning to play in this workflow?

This was not a pre-registered study, but I did design a questionnaire with standard user-discovery principles in mind, aiming to avoid leading questions and minimize bias⁵.

The Interview Questionnaire

1. Current Workflow & Tools
• Walk me through the last time you started working on a new problem. Where did you start, and what did you do next?
• What tools or platforms did you use during that process? (e.g., MathOverflow, Overleaf, your own scripts, libraries, provers)
• How do you keep track of partial results, abandoned ideas, or “almost proofs”?
• When was the last time you looked for similar theorems or related literature? What made it slow or tedious? Can you recall the last lemma you had to search for? How did you try to find it?
• What’s your process for checking whether a statement is already known or provable with current tools?

2. Pain Points in Collaboration & Discovery
• Tell me about the last time you collaborated with other mathematicians online. What worked well? What didn’t?
• Have you ever tried working on an open problem collaboratively (like a Polymath project)? What was frustrating about it?

3. Persistence & Knowledge Management
• How do you keep track of your mathematical notes, drafts, and results over time?
• What’s your process for revisiting old ideas or half-completed proofs?
• Can you think of a time when you lost track of a promising line of thought because it wasn’t documented in a good way?

4. AI & Computational Assistance
• Have you used AI or automated tools in your mathematical work before? Which ones? How did that go?
• Tell me about the last time you thought, “This is too tedious — a machine should be able to help here.” What were you working on?

The responses were then carefully analyzed and distilled, revealing patterns and recurring themes that go beyond individual anecdotes.

Mathematical discovery rarely begins with systematic searches or literature reviews. Instead, it often emerges from small sparks: a remark in a seminar, an unusual notation in a paper, or a casual hallway conversation. Many breakthroughs arise in informal, face-to-face settings, where ideas can be tested, rephrased, or challenged immediately. Julien Sabin, a junior professor at IRMAR, Université de Rennes, described this dynamic: “Two people at a blackboard can go much further than one alone,” highlighting the value of real-time discussion and iterative reasoning.

Ideas also require time to ripen. Olivier Dudas, at Institut de Mathématiques de Marseille, explained that insights often “marinate in your head for weeks or even months,” only revealing their significance much later. These experiences point to something essential about early-stage research: it depends as much on human reasoning than on the search for resonance, a connection or intuition sparked through conversation.

Navigating the Ocean of Literature

Even once a promising idea appears, the literature can be a treacherous landscape. Conceptual searches are often futile: equivalent ideas may be expressed in very different vocabulary, and minor notational differences can obscure connections. Lemmas and intermediate results are often buried deep in long papers with opaque references, creating a persistent fear of rediscovery.

A mathematician from ENS Paris working in control theory noted that literature search is rarely a matter of exact queries. “Try asking: find all the papers that deal with Cauchy–Lipschitz without continuity in time,” he explained. “What you really mean might also be Picard–Delécluse, or something phrased entirely differently.” In practice, the task is one of fuzzy concept matching, i.e. recognizing equivalence across shifting names, traditions, and notations.

Sabin provided a vivid example: a few years ago, a Japanese mathematician published a result close to his own. The references included a proof from the 1980s by Russian mathematicians, but the abstraction was so high that recognizing the connection was far from obvious.

Samuel Le Fourn, a number theorist at the Institut Fourier, described the emotional cost attached with literature search:

“There’s a constant fear of duplicating work that someone, somewhere, proved decades ago.”

Dudas has faced similar obstacles. He joined a lab focused on automorphic forms, far from his original area of expertise, representation theory. Working with a postdoc, they tried to extend a result specific to GLₙ, but it didn’t work. The postdoc suggested replacing it with the “theta correspondence,” yet keyword searches turned up nothing. Notational differences made the papers difficult to access: the π used for p-adic representations instead of the ρ familiar from classical representation theory. “You can’t just skim the document,” Dudas explained. “You have to translate it into your own language and reconstruct the argument yourself.” One of the essential references they eventually relied on was a paper they never would have discovered without already knowing exactly what to look for.

Ocean Literature — A researcher navigating the ocean of literature. (@studio_magga)

Even with better tools, the terrain remains irregular. Thomas Bitoun, professor at the Shanghai Institute for Mathematics and Interdisciplinary Sciences, noted that large models sometimes do succeed where traditional methods fail. Using Gemini’s Deep Research, he once located a correct reference for a conjectural statement attributed to Jean-Pierre Serre, which was something he had been unable to track down even after asking experts. But in the same response, the model also hallucinated another reference…

From Skepticism to Practical AI: today’s frustrations, but tomorrow’s research partner?

A noticeable shift occurred this year in the way mathematicians talk about AI. The initial reaction “leave me alone with this AI slop and nonsense” has softened. Not into enthusiasm, but into a pragmatic acknowledgment: some tasks genuinely benefit from it. Proving small technical lemmas (though in some domains of math more than in others), converting handwritten notes into LaTeX, producing small diagrams or simulation scripts, are now routinely delegated to AI.

Yet even here, the current AI tools remain far from frictionless. Thomas Leblé, CNRS researcher at Paris Cité, notes that the interface is often a hindrance: “The chat is just not meant to do math; each small modification or correction leads to a new generation of text that you have to carefully check…again”.

Technical Lemmas and the Grain of Mathematical Work

During the course of the last year, many mathematicians have noticed that AI systems are becoming surprisingly effective at a very specific task: proving technical lemmas. A technical lemma is a statement that sits inside a larger theory but carries little conceptual overhead. A graduate student from a neighboring field should be able to understand the question.

The experience, however, is often ambivalent. Models can suddenly unlock a key step, only to stumble moments later on something embarrassingly elementary, like dropping a term from one line to the next. This contrast between local brilliance and global fragility is striking, and it reveals both the promise and the limits of current systems.

This observation has motivated me to start a benchmark called Stella, for Set of Technical Lemmas.

Stella Benchmark — A benchmark of technical lemmas.

Most existing benchmarks are poorly aligned with actual research practice. They either target toy problems or aim unrealistically high. Technical lemmas occupy a much more meaningful middle ground as they are often accessible to LLMs, yet genuinely useful to mathematicians.

Formalization, Learning, and Understanding

This frustration has led some mathematicians to separate two very different uses of computation: conversational tools that assist with drafting, coding, or exploration, and formal proof assistants. Samuel Le Fourn, for instance, expressed genuine curiosity about proof assistants such as Lean. He even floated a radical idea. Temporarily pausing the production of new results in order to formally encode all existing knowledge.

Bitoun’s experience adds a complementary perspective. He distinguishes sharply between mathematical training and mathematical practice: “Having a PhD in math is not the same as knowing how to do math.” Some AI for Math teams include researchers with advanced mathematical backgrounds, he said, but they still lack the lived sense of how mathematicians actually work. He believes the most transformative tools will be those capable of writing and adapting SAGE or MAGMA code, allowing mathematicians to test hypotheses more easily and then offload the tedious lemmas no one wants to prove by hand to formal provers.

Yet even in that scenario, formal verification does not carry epistemic authority on its own. As Bitoun put it:

“If someone tells me something is true, I don’t believe it just because it’s a computer saying so. It still has to make sense within my own understanding of things.”

For him, formal provers are valuable insofar as they absorb mechanical burden, not because they replace mathematical judgment.

This distinction becomes particularly clear in teaching. In courses such as group theory, he has found that learning Lean can help students better grasp the structure of proofs, by forcing every logical step to be made explicit⁶. The tool does not confer understanding automatically, but it can train it.

Paul Bourgade, professor at the Courant Institute and editor at the Annals of Probability, sees another dimension: the steady rise of AI-generated submissions. Thankfully, the pattern is unmistakable currently as papers are either composed almost entirely of trivialities or, at the other extreme, wildly speculative claims backed by vague proofs. None are publishable, but the trend signals an impending difficulty. “One day, that won’t be the case anymore.” At some point, distinguishing machine-generated work from genuine submissions may become impossible.

This perspective shapes his vision for scientific publishing. In an ideal world, journals would not serve as reputational gatekeepers. Everything would live on arXiv, with human commentary and AI systems verifying proofs, ensuring consistency, and clarifying exposition; journals would certify correctness and nothing more.

Leblé framed the issue more broadly as a crisis of reading rather than writing:

“If we give people the means to write more papers, but not the means to read more, it makes no sense”.

The problem is not only whether a result is correct or incorrect—beyond formal verification—but whether it is usable, relevant, or genuinely interesting. What mathematicians need, he suggested, are tools that help answer a different question: what is worth reading, and what can I actually use?

Hugo Duminil-Copin, at IHES and Geneva, extended this line of thought to the structure of proofs themselves as mathematicians write in sharply different styles. Some provides exhaustive details while others presents an almost telegraphic chain of ideas. Reading across these stylistic gaps can be surprisingly hard. The prospect of AI tools that can adapt a proof into a different style, denser or more explicit, lighter or more conceptual,would in his view, be genuinely valuable.

Across these conversations, the emerging picture is of course not one of replacement, but of delegation. AI should be entering mathematics first through the pressure points: translation between notations, routine verification, exhaustive searches, and mechanical transformations of proofs. It is not approaching the core act of discovery—the human, social, disciplinary fabric described earlier—but rather the tasks that slow, distract, and exhaust the people engaged in it.

This shift, from replacement to empowerment, is exactly what Terence Tao envisions for the future of the discipline.

Terence Tao on the future of AI in math — Terence Tao on the changing nature of mathematical research.

One final note

For those interested in the broader debate surrounding these changes, two recent articles that capture the current moment well:

On the tension between hype and reality: Emily Riehl investigates why mathematicians question AI performance at the IMO.
On the nature of proof: Quanta Magazine explores the changing landscape of mathematical beauty, truth, and proof in the age of AI.

OpenAI and DeepMind first claimed a gold medal at the 2025 IMO, sparking controversy as OpenAI announced its score before the official release. Shortly after followed byformal models such as ByteDance’s SeedProver and Harmonic’s Aristotle.) ↩
As already discussed in my previous post. ↩
See this reddit thread AI misinformation and Erdos problems. ↩
A brief clarification first. I am not interested here in the philosophical or neuroscientific accounts of mathematical cognition, the interplay between intuition and formal reasoning, or the broader reflections on “how the mind does mathematics” (cf. David Bessis work). The purpose is far more concrete, it is to understand the ordinary, everyday practice of professional mathematicians. ↩
Following the best practice of “The Mom Test”. ↩
Similar to a very nice undergrad course from Patrick Massot where Lean code is displayed next to the informal mathematics based on lean-verbose. ↩