Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Role reversal

(deleted) (Sep 23 2025 at 16:07):

Previously, I was in charge of the proof strategy, and the AI only solved simple lemmas.

Now, the AI is in charge, and I end up fixing errors made by AI.

Before diving deeper, I'd like to highlight a few caveats:

AI agents are slow. And they make errors quite often.
My AI setup isn't optimal. AI often has to work based on incomplete error lists and has no visibility into the goal state.

Here is my AI setup.

The problem statement is processed by Gemini Deep Think, and after around 10 to 20 minutes, I get a proof in natural language. I have yet to see an incorrect proof coming from Gemini Deep Think.
I put the proof generated by Gemini in the Lean file as a comment.
I use Cursor with the GPT-5 model. No additional MCP servers are installed. .gitignore is deleted, and the model is instructed to operate in agent mode. I tell the model to use the read_lints tool to check for errors, look into the .lake folder to find helpful mathlib lemmas and not to declare new axioms.
The model works hard, then stops when it thinks there are no errors. I then tell it to run read_lints again and fix remaining errors. I babysit the model so if the model deletes the whole proof then I stop the model and revert the change.
If the model refuses to do any work, I start a new conversation and try again.

At the end, I end up with a proof with some errors. I can just fix the errors manually. And in some cases iterative error fixing means the final proof is perfect without more manual intervention. And I am not in charge of the high level proof strategy. The AI is in charge, and I become a janitor. How humiliating. But it shows that we are clearly in a new age.

(deleted) (Sep 23 2025 at 16:18):

Soon, as the AI becomes faster and more reliable, we can become architects. Our work is not just to formalize individual theorems, but rather we can build a whole theory. Or a whole framework. And we can let AI do the grunt work. I don't think this new development makes humans useless, but rather it elevates us. And this finally makes formal verification practical for use in the industry.

(deleted) (Sep 23 2025 at 16:23):

Screenshot_2025-09-23-23-22-53-850_com.android.chrome.jpg

There are still rough edges however.

Deleted User 968128 (Sep 23 2025 at 16:59):

Huỳnh Trần Khanh said:

Here is my AI setup.

Have you experimented with other setups? LeanDojo, lean copilot and other models like deepseek prover? Agentic search (o3-search, gpt5-search) can help a lot here for making suggestions.

Deleted User 968128 (Sep 23 2025 at 17:02):

Huỳnh Trần Khanh said:

Soon, as the AI becomes faster and more reliable, we can become architects. Our work is not just to formalize individual theorems, but rather we can build a whole theory. Or a whole framework. And we can let AI do the grunt work. I don't think this new development makes humans useless, but rather it elevates us. And this finally makes formal verification practical for use in the industry.

Yeah, the trick of the moment is to find the right point at which we can do the sorry/abstractions. If you go to low, you end up doing all the works the AI can do for you. If you go too high, it just breaks. AI is moving so quickly, this sorry level is increasing just as soon as you find the right balance. It's also different, ofc, depending on the specific domain.

One thing that I believe will become a more important field of study is "Proof Axiology" where the problem is less about proving, but more about what should be proved first.

(deleted) (Sep 23 2025 at 17:19):

I have used Kimina Prover, DeepSeek Prover v2 and Goedel Prover v2. They can only prove easy theorems. GPT-5 is the first model to be able to prove harder theorems.

(deleted) (Sep 23 2025 at 17:19):

I've reviewed these models on this Zulip instance. DeepSeek Prover v2 was the strongest model.

(deleted) (Sep 23 2025 at 17:19):

Goedel Prover v2 is very weak, near useless.

(deleted) (Sep 23 2025 at 17:20):

Kimina Prover is moderately strong.

(deleted) (Sep 23 2025 at 17:20):

I have yet to try LeanDojo, Lean copilot or agentic search

Deleted User 968128 (Sep 23 2025 at 17:24):

Is that pro reasoning? https://chatgpt.com/?openaicom_referred=true#pricing
image.png

I had it for awhile, but at the time it wasn't SOTA for the cost, but I am considering trying again with gpt-5

(deleted) (Sep 23 2025 at 17:25):

I don't have a paid ChatGPT plan. I use GPT-5 through Cursor.

Deleted User 968128 (Sep 23 2025 at 17:34):

https://github.com/aziksh-ospanov/APOLLO looks interesting as well. (removed offtopic stuff)

Bas Spitters (Sep 24 2025 at 09:04):

It's a pity that Apollo does not evaluate on modern benchmarks...

Bas Spitters (Sep 24 2025 at 09:06):

@Huỳnh Trần Khanh What kind of problems are you giving to the LLMs? Basic maths? CS? ...?

Jason Rute (Sep 24 2025 at 09:08):

@Bas Spitters What do you consider a modern benchmark? It evaluates on MiniF2F. Or do you mean benchmarks with less possibility of data leakage?

(deleted) (Sep 24 2025 at 09:20):

Bas Spitters said:

Huỳnh Trần Khanh What kind of problems are you giving to the LLMs? Basic maths? CS? ...?

Here's a sample:

PastedText.txt

Download to view the file without mojibake

(deleted) (Sep 24 2025 at 09:22):

This file will end up in the Project Numina data dump next year. You're given a preview of next year's data dump :))

(deleted) (Sep 24 2025 at 09:24):

Here's another sample. It's a monstrous proof.
PastedText.txt

Bas Spitters (Sep 24 2025 at 09:30):

@Jason Rute i hear that https://huggingface.co/datasets/PAug/ProofNetSharp has less leakage, but I'm happy to be corrected. What are your favorite benchmarks?

(deleted) (Sep 24 2025 at 09:39):

CombiBench is my favorite

(deleted) (Sep 24 2025 at 09:39):

And MiniF2F is my least favorite

(deleted) (Sep 24 2025 at 09:39):

Goedel Prover v2 is a model that performs well on MiniF2F while being useless in real world use

(deleted) (Sep 24 2025 at 09:44):

https://github.com/MoonshotAI/CombiBench/

Alfredo Moreira-Rosa (Sep 24 2025 at 09:45):

interresting idea. i use directly GPT-5 exclusively inside vscode in Ask/Edit or Agent mode. But your idea to get it first write a human readable proof is interresting. i'll try this setup in vscode to see if it makes GPT-5 become even better.

Bas Spitters (Sep 24 2025 at 10:34):

Thanks @Huỳnh Trần Khanh , so basically IMO style problems. That's a good benchmark.

Bas Spitters (Sep 24 2025 at 10:35):

Both Lean-Star and NLIR (for Rocq) use natural language as an intermediate language, since this is what the models are trained on.

Deleted User 968128 (Sep 24 2025 at 21:45):

Bas Spitters said:

It's a pity that Apollo does not evaluate on modern benchmarks...

It is my intuition that benchmarks in general have much less value compared to transparent leaderboards(LBs). A great deal of the effort in evals is just performing the evals, especially as new models / agentic flows are released so frequently. Also, 3rd party evals are always more credible than self-evals, if they're done in a transparent fashion.

If someone in the local community were to create an LB (they can use a weighted blend of benchmarks already available) it would become a hub of innovation for sure. A useful template for this is https://swe-rebench.com/ (date slider is quite nice), though they are not that transparent and could focus more on agentic flows rather than just models.

Benchmarks can be added/removed and re-weighted over time, as long as historical values are kept like swe-rebench and there is credible evidence that there is no untoward bias and it's done in a thoughtful / transparent and evolutionary fashion.

(deleted) (Sep 25 2025 at 19:12):

For the sake of honesty: today, Gemini Deep Think generated a very wrong proof. I'm sad. But not surprised, it's AI after all.

(deleted) (Sep 25 2025 at 19:13):

I told it to prove a theorem, but it only considered the simplest case

Deleted User 968128 (Sep 25 2025 at 19:46):

Huỳnh Trần Khanh said:

For the sake of honesty: today, Gemini Deep Think generated a very wrong proof. I'm sad. But not surprised, it's AI after all.

This would be a really great topic to start. AI failures. I'd like to try this proof out on some different systems

Last updated: Feb 28 2026 at 14:05 UTC