Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: How have you guys found GPT 5?

Alok Singh (Aug 14 2025 at 06:20):

Just wanted to gather impressions at its Lean ability

David Renshaw (Aug 14 2025 at 11:43):

It wrote this full proof for me: https://github.com/dwrensha/compfiles/commit/b4a6b036c6160ea1316c415e150482f7bc3fa272
(I needed to correct some errors, but none were terribly difficult.)

(deleted) (Aug 14 2025 at 11:56):

every time we get our hands on a new model we are always very excited. and we test it on our current problem and even when the solution provided by the model is bad we help it, we help fix the errors so we feel the model is up to some good.

this is coming from your resident pessimist

Matthew Ballard (Aug 14 2025 at 17:07):

We ran an experiment at the last community meeting https://www.youtube.com/watch?v=dUWiG12ZQ4g

The improvement (for pro) at natural language research level math is noticable.

Andy Jiang (Aug 15 2025 at 02:54):

Can I tag onto this thread and ask if people here have tested gemini deep think on research level math? Was thinking of getting a subscription but curious to hear what people's thoughts are before

(deleted) (Aug 15 2025 at 02:57):

The main limitation is you only get 5 queries a day.

Andy Jiang (Aug 15 2025 at 03:05):

but what is the level like? is it definitely above gpt5-pro for example? what sort of problems can it solve? I think it can be useful if the level is high enough

(deleted) (Aug 18 2025 at 17:00):

Hey everyone, I'm interested in an answer to the above question.

(deleted) (Aug 18 2025 at 17:01):

I'm on the fence. And I'm not sure whether I should get a subscription or not.

(deleted) (Aug 28 2025 at 16:49):

@Andy Jiang I just spent the money. Here's what I got from Gemini:

I ran into an issue and wasn't able to finish thinking through this request. Please try again, and don't worry, it didn't count against your Deep Think limit.

Hmm... Not a good impression. Let me try again. I don't want to be pessimistic, I spent a good amount of money expecting it to be able to solve IMO problems...

(deleted) (Aug 28 2025 at 16:51):

I have a lot to gain if I get my hands on an AI that can solve such problems. So let me try again. If there's good news I'll tell you. This Deep Think thing could turn out to be a money printer honestly.

(deleted) (Aug 28 2025 at 17:10):

Second try: the thinking process stopped midway. Let me try the third time. And when I wake up, I'll tell you whether Gemini successfully solves the problem or not.

(deleted) (Aug 28 2025 at 17:28):

Third try: The result is quite good. The AI still makes logical errors as it does not completely adhere to the specification, but in general the deduction steps make sense and are not hand wavy.

(deleted) (Aug 28 2025 at 17:36):

While not magic, Gemini Deep Think will be an indispensable part of my day to day work.

(deleted) (Aug 28 2025 at 17:44):

My review: Gemini Deep Think demonstrates a high ability to follow the specification, to critique its own mistakes and to write a machine-checkable proof. With appropriate prompt engineering, it is possible to make Gemini Deep Think generate Lean 4 code with appropriate preprocessing and postprocessing steps.

(deleted) (Aug 28 2025 at 17:46):

I will figure out how to best use the tool. It has very high potential, despite the initial bugs.

Alfredo Moreira-Rosa (Aug 28 2025 at 18:07):

I Use GPT-5 with my Github Copilot subscription. So i have more than 5 requests per day allowed :grinning_face_with_smiling_eyes:

I'd say it's the best model at the moment for Lean with the new Deepseek as a contender. The plus with GPT-5 is that i have it integrated with VsCode, while i use the DeepSeek from their website. Deepseek 3.1 is really impressive. But it thinks really hard all the time (like Grok), so i don't use it that much.

I use it mainly for proofs on Computer Science side. So my feedback may not scale under other math fields.

(deleted) (Aug 29 2025 at 02:38):

Thank you, @ecyrbe, for the feedback. Could you record a video of you using Copilot with GPT-5 to prove theorems? I watched Terence Tao's video but he had to fight Copilot a lot.

Alfredo Moreira-Rosa (Aug 29 2025 at 06:50):

It's normal, terrence was using Copilot mostly in completion mode. completion mode don't use thinking models, it uses a fast but dumb model (GTP-4o at the moment) .
I use prompt/chat/agent mode in Copilot that allows to select the model between a refined list.
Here the list we get when using Copilot (for me it's the best of all worlds, because you get access to different LLM providers with just one subscription) :
image.png

Arjo (Aug 30 2025 at 15:28):

@ecyrbe Btw do you think there is scope to have a good niche editors completely focussed on writing math and doing research per se? Something that could run lean based verifications natively, have concepts across papers (external and also your own notes) indexed uniquely etc.?

Alfredo Moreira-Rosa (Aug 30 2025 at 15:37):

I think building plugins for existing editors is a better approach. Building niche editors is maintainance costly and without a big community budget i dont think it's worth the effort.

Arjo (Aug 30 2025 at 15:40):

Any specific kind of plugins that you would find very useful (doesn't have to simple could be unreasonably hard to implement as well)?

West (Sep 04 2025 at 03:53):

#lean4 > Vibe Proofing

Cross posting my experience.

Yan Yablonovskiy 🇺🇦 (Sep 04 2025 at 04:21):

For those interested, from conversation i have had in my faculty (Monash University, Australia) , professors are finding GPT 5 a lot more useful.

They do not know Lean and such, their use cases involve asking for proofs where the reference is tedious or obscure. They say prior versions were worse at similar tasks -- having more hallucinations.

Knowing Lean however, i can imagine a world where you can formulate and ask for a proof , in a highly specific context for a sub-lemma, which may be tedious and straightforward in approach but lacks a direct current reference/ lean code( not with something as general purpose / natural language as GPT5 though).

Arnav Rawat (Sep 04 2025 at 16:48):

Is GPT5 acting as a search engine in finding those proofs, or is it coming up with those proofs with novel derivations?

West (Nov 03 2025 at 06:10):

Arnav Rawat said:

Is GPT5 acting as a search engine in finding those proofs, or is it coming up with those proofs with novel derivations?

I know this thread is old, but here's how I used it. I had instructions on how to use the various dev tools, so it would those tools to find relevant theorems, then generate a proof and use the tools to test whether it was valid. It would do this in a loop until it finished the proof.

Kelly Davis (Nov 03 2025 at 07:06):

West said:

I know this thread is old, but here's how I used it. I had instructions on how to use the various dev tools, so it would those tools to find relevant theorems, then generate a proof and use the tools to test whether it was valid. It would do this in a loop until it finished the proof.

Did you formalize this process in terms of say LangGraph code or was it more loosely formalized, e.g. a set of local scripts?

This general idea is something I was thinking about creating a LangGraph application to do, e.g. using say kimina-lean-server, lean-explore, ...

Oliver Nash (Nov 06 2025 at 10:08):

Last week I wanted to advertise the impressiveness of consumer-accessible large reasoning models to somebody so I cooked up an example.

I took a lemma from a paper I wrote years ago which I had proved with a somewhat laborious calculation and which I thought probably had a slick proof, but which I failed to find. Here is the lemma:
image.png
(in this $K(F)$ is complex topological K-theory, and coefficients are complex numbers)

I aksed GPT 5 Pro to prove this last Thursday evening. It thought for ~16 minutes and produced a correct short proof: about half a page compared to my 4--5 page laborious calculation.

I think it would be fair to classify this as early grad school -level mathematics. The fact that a lemma, which I worked hard to prove over several days, 12 years ago, can now be proved with a better proof by a computer in 16 minutes made a fairly big impression on me.

Jason Rute (Nov 06 2025 at 10:10):

Informal proof?

Oliver Nash (Nov 06 2025 at 10:10):

Sorry yes: informal proof.

Yan Yablonovskiy 🇺🇦 (Nov 07 2025 at 02:37):

Since many people had good things to say , I would like to highlight a failure.

#Is there code for X? > Derivative of Holomorphic Functions

I used Codex with GPT-5 on this many times, and it always gave me:

A couple blockers to fix it cleanly:

The types differ: LHS is ℂ, RHS is ℝ. We should coerce the RHS: (iteratedDeriv n g t : ℂ).

As stated, the lemma is generally false unless you assume enough regularity on g and that it indeed is the restriction of the holomorphic cg. At minimum you’ll need differentiability assumptions on g up to order n at t (or a global ContDiff ℝ ⊤ g). Then you can prove by induction using iteratedDeriv_succ and HasDerivAt.comp_ofReal, together with the pointwise equality ht to switch to (fun x : ℝ ↦ cg (x:ℂ)) and then to g.

Whereas you can see that aristotle solved it as is, as is shown in that thread. I wouldn't trust GPT-5 too much.

full_output

Let me search the codebase for the lemma or any references to "cgiteratedDeriv_eq_giteratedDeriv" and related iterated derivative lemmas to identify where to fix it.

[11 tools called]

I found your stub in GibbsMeasure/Mathlib/Probability/Kernel/test.lean:

lemma cgiteratedDeriv_eq_giteratedDeriv (t : ℝ) (n : ℕ) (hcg: Differentiable ℂ cg)
  (ht : ∀ x : ℝ, cg (x : ℂ) = g x) : iteratedDeriv n cg (t : ℂ) = iteratedDeriv n g t := by

A couple blockers to fix it cleanly:
- The types differ: LHS is ℂ, RHS is ℝ. We should coerce the RHS: (iteratedDeriv n g t : ℂ).
- As stated, the lemma is generally false unless you assume enough regularity on g and that it indeed is the restriction of the holomorphic cg. At minimum you’ll need differentiability assumptions on g up to order n at t (or a global ContDiff ℝ ⊤ g). Then you can prove by induction using iteratedDeriv_succ and HasDerivAt.comp_ofReal, together with the pointwise equality ht to switch to (fun x : ℝ ↦ cg (x:ℂ)) and then to g.

Confirm:
- Should I change the RHS to (iteratedDeriv n g t : ℂ)?
- What regularity hypotheses on g (e.g. ContDiffAt ℝ ⊤ g t, or for all orders up to n) do you want?
- Is it acceptable to require AnalyticAt ℂ cg (t : ℂ) (which is implied if cg is holomorphic on a neighborhood), to avoid hand-carrying differentiability of all iterates?

Once you confirm the intended assumptions, I’ll implement the lemma (by induction on n) in test.lean.

Justin Asher (Nov 07 2025 at 05:37):

@Yan Yablonovskiy 🇺🇦 Are you using GPT-5 and not Codex the model? Both can be called via the Codex API, but the latter might work better than the former for functional programming tasks. Would be happy to hear more about your experiences.

https://openai.com/index/introducing-upgrades-to-codex/

David Michael Roberts (Nov 07 2025 at 06:10):

Oliver Nash said:

I took a lemma from a paper I wrote years ago which I had proved with a somewhat laborious calculation and which I thought probably had a slick proof, but which I failed to find. Here is the lemma:
...
I aksed GPT 5 Pro to prove this last Thursday evening. It thought for ~16 minutes and produced a correct short proof: about half a page compared to my 4--5 page laborious calculation.

I think it would be fair to classify this as early grad school -level mathematics. The fact that a lemma, which I worked hard to prove over several days, 12 years ago, can now be proved with a better proof by a computer in 16 minutes made a fairly big impression on me.

That GPT has probably ingested your paper as well as any improvements or stronger results in the literature should not go unsaid. Proving a theorem for the first time is of course the messiest and hardest.

Anh Nguyễn (Nov 07 2025 at 10:07):

David Michael Roberts said:

Oliver Nash said:

I took a lemma from a paper I wrote years ago which I had proved with a somewhat laborious calculation and which I thought probably had a slick proof, but which I failed to find. Here is the lemma:
...
I aksed GPT 5 Pro to prove this last Thursday evening. It thought for ~16 minutes and produced a correct short proof: about half a page compared to my 4--5 page laborious calculation.

I think it would be fair to classify this as early grad school -level mathematics. The fact that a lemma, which I worked hard to prove over several days, 12 years ago, can now be proved with a better proof by a computer in 16 minutes made a fairly big impression on me.

That GPT has probably ingested your paper as well as any improvements or stronger results in the literature should not go unsaid. Proving a theorem for the first time is of course the messiest and hardest.

GPT 5 was trained on almost all the Internet, right?

Martin Dvořák (Nov 20 2025 at 16:23):

I find ChatGPT pretty much useless for writing in Lean because it cannot run lake build or any other command that would attempt to compile the code it comes up with. And its chances for getting anything nontrivial correct on a whim are very slim.

Alok Singh (Nov 20 2025 at 17:50):

Martin Dvořák said:

I find ChatGPT pretty much useless for writing in Lean because it cannot run lake build or any other command that would attempt to compile the code it comes up with. And its chances for getting anything nontrivial correct on a whim are very slim.

At least for that, I’d use cursor or codex or some model wrapped in the terminal so lake build would work

Kim Morrison (Nov 20 2025 at 21:35):

@Martin Dvořák, have you tried Codex (or Cursor, or Claude Code, etc)? Obviously the website interfaces are irrelevant.

Martin Dvořák (Nov 20 2025 at 21:39):

I haven't. Are there any testimonials?

Kim Morrison (Nov 20 2025 at 21:58):

Yes, they work well and can prove things. :-)

They work even better with some specialized prompting. There is the "lean4 skills" plugin for Claude, which I haven't used. You could also try adopting a subset of my ~/.claude/CLAUDE.md file.

Timothy Chow (Nov 21 2025 at 12:57):

This was mentioned by Anh Nguyễn in a different conversation but seems worth cross-posting here as well: Early science acceleration experiments with GPT-5.

Deleted User 968128 (Nov 22 2025 at 15:57):

Someone posted this, which was interesting:
https://www.reddit.com/r/singularity/comments/1p3sx47/some_experiments_in_mix_model_g351_lean_math/

If true, this could mean that there is large gap of capability not yet discovered by blending many models.

typh (Nov 23 2025 at 10:41):

I cross-pollinate Gemini, Claude, and ChatGPT as a regular part of my workflow. They all have strengths, blindspots. You can get them into alignment and they're more than the some of their parts. Good way to filter out some hallucination too.

I've been using Lean as a thought tool to understand perspectives (and some math and physics along with it). I use Gemini to architect and sketch a first draft. It's logical, less prone to pandering, and it'll correct my naivety. Chatgpt I find unreliable, but I'll sometimes make it argue with Gemini when Gemini's offbase. I use Claude for philosophizing, it's warmer in personality, and "curious". Claude Code for the actual work, using lean-lsp-mcp and lean4 skills. That way it can check the proof directly, look up tactics, run many attempts in parallel. It's really magic when it works.

Deleted User 968128 (Nov 23 2025 at 13:56):

Surprising you'd have Claude in the mix as Claude Opus 4 scores 0.3%, with tasks costs of $351.75 on critp https://critpt.com/, versus gpt 5 of 10% with task costs of $55. On frontier math, it's 4.2% versus Gemini 3 at 18.8%. https://epoch.ai/frontiermath

Timothy Chow (Nov 23 2025 at 15:35):

Some time ago, I attended a public lecture by Jon Edwards, who at the time had recently won the 32nd World Correspondence Chess Championship. In this competition, you're allowed to use chess engines as much as you want. To get an edge, you therefore have to (1) mine databases better than your opponents do and (2) combine multiple chess engines better than your opponents do. I remember that he said that there was one engine (Houdini, I think, but I'm not sure) which was definitely not the strongest, but which would sometimes come up with important new lines if you ran it long enough.

I've had similar experiences with SAT solvers, where a solver that doesn't do so well overall on a suite of benchmark problems will unexpectedly crush all the others on specific problems.

Kelly Davis (Nov 23 2025 at 16:18):

The general setup is relatively common in machine learning and goes by the name Mixture of Experts (MoE). There are many variations. Some of the larger LLMs even use MoE internally.

Deleted User 968128 (Nov 23 2025 at 16:21):

Timothy Chow said:

I've had similar experiences with SAT solvers, where a solver that doesn't do so well overall on a suite of benchmark problems will unexpectedly crush all the others on specific problems.

https://www.quantamagazine.org/to-have-machines-make-math-proofs-turn-them-into-a-puzzle-20251110/

Deleted User 968128 (Nov 23 2025 at 16:25):

Personally, I think a more globally collaborative attitude towards AGI would be healthy, especially if it achieves superior results.

typh (Nov 23 2025 at 21:13):

Tim Shephard said:

Surprising you'd have Claude in the mix as Claude Opus 4 scores 0.3%, with tasks costs of $351.75 on critp https://critpt.com/, versus gpt 5 of 10% with task costs of $55. On frontier math, it's 4.2% versus Gemini 3 at 18%. https://epoch.ai/frontiermath

Claude does often pick the worst initial solution sketch of the three, which is why I use Gemini for that. I find Claude Code the best of the terminal agents for actually getting a task done though, in terms in my time.

Deleted User 968128 (Nov 25 2025 at 14:53):

typh said:

Claude does often pick the worst initial solution sketch of the three, which is why I use Gemini for that. I find Claude Code the best of the terminal agents for actually getting a task done though, in terms in my time.

I'd like to understand this better. Can you give a brief example of how Claude can perform better on a lean proof than Codex (now max) or Gemini?

Across every math benchmark I've seen (and tests that I've tried), it seems that Claude Opus 4.5 performs very poorly relative to OpenAI and Google.

Deleted User 968128 (Nov 25 2025 at 15:10):

One possibility that I see is that Claude Code is the best at leveraging MCPs properly. That would explain things, but speaks more to the vscode fork than the model itself.

typh (Nov 25 2025 at 17:51):

Tim Shephard said:

I'd like to understand this better. Can you give a brief example of how Claude can perform better on a lean proof than Codex (now max) or Gemini?

What Claude Code is better at, in my experience, is actually editing files and iterating on a task. I sense Anthropic is more invested in developer ergonomics in general. Even without MCP (in fact, it chronically forgets its MCP tools). My experiences with both codex and gemini-cli have been more frustrating than productive, even though yes, they might be better at math. When Claude can't solve a problem, I just get it to summarize what it's learned and switch back to [web] Gemini/ChatGPT.

And note- I am a Lean beginner, with little math experience. My frictions are likely different from somebody deeper in the domain. I mostly do webdev.

Kim Morrison (Nov 25 2025 at 23:34):

That more or less aligns with my experience. Gemini is way better at informal maths than Claude is, but Claude's "personality" in Claude Code is more effective. It explores the codebase more, tries things out, and is more persistent (sometime too persistent, when I'd prefer it to come back to me and say "I'm confused") than Codex. (However I haven't used Gemini in agent mode much at all.)

Last updated: Feb 28 2026 at 14:05 UTC