Zulip Chat Archive
Stream: general
Topic: Kimina Prover Preview: discussion thread
Mantas Baksys (Apr 14 2025 at 13:11):
Opening this thread for discussion about the Kimina-Prover Preview release!
Kim Morrison (Apr 14 2025 at 13:24):
What would need to happen before Mathlib users could write by kimina
(and get a "Try this" suggestion)?
Jason Rute (Apr 14 2025 at 13:26):
Interesting! Do you have any plans for the other Lean benchmarks as well like PutnamBench? Also which MiniF2F did you use exactly? There are a lot of wrong problems in MiniF2F. (Someone should also run a prover on the negation of all the MiniF2F problems to find the errors. Or we should switch to the Harmonic version but undo their split.)
Junqi Liu (Apr 14 2025 at 13:35):
(deleted)
Junqi Liu (Apr 14 2025 at 13:35):
Jason Rute said:
Interesting! Do you have any plans for the other Lean benchmarks as well like PutnamBench? Also which MiniF2F did you use exactly? There are a lot of wrong problems in MiniF2F. (Someone should also run a prover on the negation of all the MiniF2F problems to find the errors. Or we should switch to the Harmonic version but undo their split.)
We have fixed 5 wrong problems of minif2f which from Deepseel-Prover V1.5.
Eric Wieser (Apr 14 2025 at 13:41):
Which version of Lean/mathlib is your version of MiniF2F using?
Junqi Liu (Apr 14 2025 at 14:55):
v4.15.0
Auguste Poiroux (Apr 14 2025 at 15:11):
If you plan to evaluate on other benchmarks, consider checking ProofNet#. We fix a substantial amount of entries (>30%). While I am working on autoformalization, I did some tests with different provers, and it seems that results and rankings change slightly compared to the original ProofNet ;) It is compatible with Lean v4.15.0 by the way.
Sabbir Rahman (Apr 14 2025 at 17:44):
I have only glanced at the github repo, so I might be wrong about this. But I noticed that the model generates proofs without any prover feedback? If I understand correctly, that's amazing. But as mathlib updates, theorems are deprecated. what is the plan of updating the model with respect to such updates? Also simp generally gets stronger, so I don't think the model can foresee how simp will work in future versions.
Mantas Baksys (Apr 14 2025 at 18:06):
Auguste Poiroux said:
If you plan to evaluate on other benchmarks, consider checking ProofNet#. We fix a substantial amount of entries (>30%). While I am working on autoformalization, I did some tests with different provers, and it seems that results and rankings change slightly compared to the original ProofNet ;) It is compatible with Lean v4.15.0 by the way.
Thanks for this, we will likely explore other benchmarks in a further release and this was only on our radar.
Mantas Baksys (Apr 14 2025 at 18:09):
Sabbir Rahman said:
I have only glanced at the github repo, so I might be wrong about this. But I noticed that the model generates proofs without any prover feedback? If I understand correctly, that's amazing. But as mathlib updates, theorems are deprecated. what is the plan of updating the model with respect to such updates? Also simp generally gets stronger, so I don't think the model can foresee how simp will work in future versions.
The current version of the model is trained with Lean/mathlib version v4.15.0. We have internally seen little performance degradation when upgrading to newer mathlib versions, so you should be able to use our models on any fairly new release. However, our further work will likely address this in general, so stay tuned!
Eric Wieser (Apr 14 2025 at 18:29):
Are you willing to share the full list of problems that you solved in minif2f, as you did for putnambench?
Joseph Myers (Apr 14 2025 at 21:02):
Do you expect to be able to make a serious attempt at IMO 2025 problems at the time of IMO 2025 (whether the statements are written following my suggested conventions or any other conventions)?
Jia Li (Apr 14 2025 at 22:33):
Releasing them today !
Jason Rute (Apr 15 2025 at 02:04):
Here is what I found in the paper:
We evaluate our models on the miniF2F benchmark (K. Zheng et al. 2022), using the Lean 4 version from DeepSeek-ProverV1.5. To prevent training data leakage (from Numina Math 1.5), we perform 13-gram decontamination and explicitly removed AMC12, AIME, and IMO problems by source.
This brings up a lot of questions. First, is the 80.7% number without AMC12, AIME, or IMO problems? Because my understanding of MiniF2F (see paper) is that these are some of the most difficult problems in the test set (especially IMO and AIME), and they make up almost a third of the test set (80/244). Removing them makes the test set a lot easier, no?
Jason Rute (Apr 15 2025 at 02:21):
Although I now see that Table 2 has numbers for MiniF2F/IMO and MiniF2F/AIME broken out. I don't know what to make of all this. (MiniF2F is becoming more and more of a problematic benchmark.)
llllvvuu (Apr 15 2025 at 03:18):
Doesn’t decontamination mean removal from training data, not test data?
Kim Morrison (Apr 15 2025 at 03:28):
Yes, I'd certainly be happier to see announcements that said "we didn't test on MiniF2F, because it's a terrible benchmark, and instead you should pay attention because..."!
Huỳnh Trần Khanh (Apr 15 2025 at 04:07):
has anyone outside the team tried to run this
Huỳnh Trần Khanh (Apr 15 2025 at 04:07):
the inference code looks simple lemme get ahold of a GPU and run
Jia Li (Apr 15 2025 at 06:13):
Jason Rute said:
Here is what I found in the paper:
We evaluate our models on the miniF2F benchmark (K. Zheng et al. 2022), using the Lean 4 version from DeepSeek-ProverV1.5. To prevent training data leakage (from Numina Math 1.5), we perform 13-gram decontamination and explicitly removed AMC12, AIME, and IMO problems by source.
This brings up a lot of questions. First, is the 80.7% number without AMC12, AIME, or IMO problems? Because my understanding of MiniF2F (see paper) is that these are some of the most difficult problems in the test set (especially IMO and AIME), and they make up almost a third of the test set (80/244). Removing them makes the test set a lot easier, no?
Sorry for the confusion. My bad (I personnally write this sentence).
To be short, we test on all 244 samples on miniF2.
What i meant in the paper, is we have perform decontamination to remove sample in the training set (prompt set). We do it twice (because for some reason, amc probelms have different formulation from the dump in the numina dataset). But the goal is to remove it from the training set (or the RL prompt set if you call it this way). When i say explicitly remove by source, this means i take a AMC12A problem from miniF2F test, i find the same problem in Numina Math 1.5, using the meta data (amc12a, year, and the id of the problem, then to remove it). The goal of saying this, is to say, pay attention to the Numina 1.5 Dataset, it is not decontaminated to minif2f (it is decontaminated to MATH test and AIME 2024 2025)
Jia Li (Apr 15 2025 at 06:18):
Joseph Myers said:
Do you expect to be able to make a serious attempt at IMO 2025 problems at the time of IMO 2025 (whether the statements are written following my suggested conventions or any other conventions)?
It is still very challenging, we will try our best.
Sabbir Rahman (Apr 15 2025 at 06:35):
on huggingface, clicking on AI-MO/mini2f-test gives me error. not sure if this happens for everyone
image.png
Christian K (Apr 15 2025 at 07:57):
AI-MO/mini2f-test works for me, but https://huggingface.co/AI-MO/Kimina-Prover-Preview-Distill-1.5B gives the same Error Code 500.
Jia Li (Apr 15 2025 at 08:19):
Christian K said:
AI-MO/mini2f-test works for me, but https://huggingface.co/AI-MO/Kimina-Prover-Preview-Distill-1.5B gives the same Error Code 500.
I think the issue comes from huggingface, they are experiencing some bugs since yesterday
Mert Ünsal (Apr 16 2025 at 08:39):
We also spent a little bit of compute budget on PutnamBench and got SOTA. Note that results are obtained with our open source model so you can reproduce!
Of course, we are still far from solving actually hard Putnam problems. We are trying our best!
Kevin Buzzard (Apr 16 2025 at 09:34):
Can you clarify what pass@192
means in this context? Sorry for being dense.
Jannis Limperg (Apr 16 2025 at 09:42):
You let the LLM generate 192 potential proofs for each theorem and if Lean accepts one of these proofs, you count the theorem as a success.
Jason Rute (Apr 16 2025 at 09:47):
@Kevin Buzzard (not an author here) but pass@192 here probably means they ran their model 192 times per theorem and for 10 theorems their model solved the theorem in one of those 192 times. (It is a statistic so it is possible they actually ran it longer, like 284 times, and then used that to get a better estimate of the statistic but this is rarely done in theorem proving since one wants the best score.) The fact that it is such a strange number might mean they just picked N just high enough to get SoTA. You can see from their paper that there is a log-linear scaling to their results (meaning for every doubling of N then pass@N is +X% higher).
Mario Carneiro (Apr 16 2025 at 09:48):
It is a statistic so it is possible they actually ran it longer, like 284 times, and then used that to get a better estimate of the statistic
isn't that p-hacking?
Jason Rute (Apr 16 2025 at 09:50):
Why would that be p-hacking?
Mario Carneiro (Apr 16 2025 at 09:51):
you are biasing your results relative to an actually random run of 192 attempts if you run it for more and then report the number that passed in the actual test
Jason Rute (Apr 16 2025 at 09:54):
Oh, what I’m suggesting is running it 284 times and then using statistics to get the best estimate of pass@284. (I’m sure there is a formula or you can do it computationally like randomly draw 192 samples from that dataset if 284, measuring the pass rate, do it many more times, and take an average.)
Kevin Buzzard (Apr 16 2025 at 09:54):
My personal take on it would be that I don't care at all how many goes it took; all I care about is whether the system proved the theorem. I could imagine a system where "trying to prove it until you had a proof which compiled" counted as one "go" rather than [the number of tries] goes.
Mario Carneiro (Apr 16 2025 at 09:55):
the problem with that approach is that then you get people solving the theorem by just brute force and billions of dollars
Mario Carneiro (Apr 16 2025 at 09:56):
I can solve lean theorems using a random number generator if we use that criterion
Kevin Buzzard (Apr 16 2025 at 09:56):
Conversely, am I to conclude from this that these 192 goes are completely independent and no use is made of Lean's feedback on the presumably many failures? It's just "oh that didn't work, let's discard the error message and try again from the beginning"?
Jason Rute (Apr 16 2025 at 09:56):
Yes, they don’t use Lean feedback. :)
Kevin Buzzard (Apr 16 2025 at 09:57):
Is anyone developing a system where they do use Lean feedback or is this not the fashion right now?
Jason Rute (Apr 16 2025 at 09:57):
Basically the model has a conversation with itself and then gives a Lean proof. No interaction.
Kevin Buzzard (Apr 16 2025 at 09:57):
Do we know if GDM used Lean feedback in AlphaProof?
Jason Rute (Apr 16 2025 at 09:57):
I am sure many people are.
Jason Rute (Apr 16 2025 at 09:57):
Yes, AlphaProof uses feedback.
Jason Rute (Apr 16 2025 at 09:58):
DeepSeek-Prover v1 didn’t use feedback, but DeepSeek Prover v1.5 did.
Jason Rute (Apr 16 2025 at 09:58):
So I find this encouraging. Lots of room to improve.
Jason Rute (Apr 16 2025 at 09:59):
Basically their (Kimina Prover's) approach is the same as o1, o3, DeepSeek-r1, etc but trained specially to solve Lean proofs.
Auguste Poiroux (Apr 16 2025 at 10:04):
Kevin Buzzard said:
Is anyone developing a system where they do use Lean feedback or is this not the fashion right now?
Easy scalable methods tend to win I would say. Whole-proof generation without interaction is easier to implement and probably more efficient to run on current benchmarks. But I have found that current open-source models mostly find short proofs. I am not sure whole-proof generation can generate proofs as long as AlphaProof on the IMO 2024. (that said, I will probably be proven wrong later ^^)
Jason Rute (Apr 16 2025 at 10:09):
@Mario Carneiro @Kevin Buzzard As for the k in Pass@k, it is clear, everyone makes k as high as they can to get SoTA. Roughly k is a measure of compute, but it is really imperfect. A tree search might report pass@1, but actually make many more model calls for every branch of the tree. A big model might have a lower k, but use more compute since it has more parameters in the model. A thinking model like o1 or Kimina-Prover also uses more compute since it "thinks" for a while before giving an answer.
Also @Kevin Buzzard:
My personal take on it would be that I don't care at all how many goes it took; all I care about is whether the system proved the theorem. I could imagine a system where "trying to prove it until you had a proof which compiled" counted as one "go" rather than [the number of tries] goes.
There is no easy upper bound on performance, so it is hard to estimate pass@infty. The log linear scaling suggests that pass@infty would be much higher, almost 100%. Since pass@k is k independent samples and the output is literally a random sample from a distribution over possible Lean proofs, then pass@infty would be roughly the number of theorems in the benchmark whose Lean proof is short enough to fit in the answer length of the model.
Jason Rute (Apr 16 2025 at 10:14):
It is a bit tricky to figure out who in this thread is a contributor on the project (since the paper has no authors listed). Is it just @Mert Ünsal, @Jia Li, and @Mantas Baksys?
Mario Carneiro (Apr 16 2025 at 10:15):
the bibtex says Wang, Haiming and Unsal, Mert and Lin, Xiaohan and Baksys, Mantas and Liu, Junqi and Santos, Marco Dos and Sung, Flood and Vinyes, Marina and Ying, Zhenzhe and Zhu, Zekai and Lu, Jianqiao and Saxcé, Hugues de and Bailey, Bolton and Song, Chendong and Xiao, Chenjun and Zhang, Dehao and Zhang, Ebony and Pu, Frederick and Zhu, Han and Liu, Jiawei and Bayer, Jonas and Michel, Julien and Yu, Longhui and Dreyfus-Schmidt, Léo and Tunstall, Lewis and Pagani, Luigi and Machado, Moreira and Bourigault, Pauline and Wang, Ran and Polu, Stanislas and Barroyer, Thibaut and Li, Wen-Ding and Niu, Yazhe and Fleureau, Yann and Hu, Yangyang and Yu, Zhouliang and Wang, Zihan and Yang, Zhilin and Liu, Zhengying and Li, Jia
Mert Ünsal (Apr 16 2025 at 10:22):
Author here - pass@192 here means that we sample 192 iid samples from the model with temperature 1.0 and for 10 theorems at least one sample generated a correct proof.
We have no Lean interaction in this release but this is definitely in the works: we are experimenting with multi-turn RL to achieve this as we found that long CoT model + lean interaction on the go is a bit tricky to make it work.
When it comes to the number 192, I personally ran these evals so I can explain. We had very limited compute left so I just ran pass@64 first to get a feel for it and it got 9. I thought I would try pass@128 next to pass SoTA by better margin and it got 10 (within these 128 samples). I thought that since I generated 128 + 64 = 192 samples in total, I will report the result as pass@192.
@Kevin Buzzard The reason results are reported usually in pass@N format is that you can always throw more computation at a problem and get slightly better results. I think GDM ran their test-time RL for IMO for many many days and it eventually solved the problems.
Mario Carneiro (Apr 16 2025 at 10:24):
(not many many days, it was 3 days IIRC but on the other hand they also certainly have a giant compute cluster)
Mert Ünsal (Apr 16 2025 at 10:28):
http://arxiv.org/abs/2504.11354 We put our paper with some typos fixed on Arxiv as well!
Kevin Buzzard (Apr 16 2025 at 10:35):
What is the largest (in characters, say) correct proof which was discovered by the model?
Mantas Baksys (Apr 16 2025 at 10:35):
To second @Mert Ünsal, we put out this Preview release to share our models and experiments with the community, but there is still lots to work on :working_on_it: . The obvious low-hanging fruit includes iterative interaction via Lean error-feedback and use of external library search tools, which we detail in our Conclusions & Discussion section.
Eric Wieser (Apr 16 2025 at 10:36):
Jason Rute said:
A tree search might report pass@1, but actually make many more model calls for every branch of the tree
In general I guess the "pass@k" metric becomes confusing if your model is obtaining intermediate feedback from a verifier
Mantas Baksys (Apr 16 2025 at 10:37):
Eric Wieser said:
Jason Rute said:
A tree search might report pass@1, but actually make many more model calls for every branch of the tree
In general I guess the "pass@k" metric becomes confusing if your model is obtaining intermediate feedback from a verifier
Pass @K as mentioned before is already even more confusing for methods such as MCTS where, say, the search width and depth is also a chosen parameter :smile:
Eric Wieser (Apr 16 2025 at 10:37):
We have no Lean interaction in this release
So just to be clear: not only do you not get intermediate feedback from Lean, but the only thing you observe in training is whether the final proof was accepted, and any error information (message or line number) if it was rejected is discarded?
Mantas Baksys (Apr 16 2025 at 10:38):
Yes, that is correct for this release.
Jason Rute (Apr 16 2025 at 10:39):
I like the DeepSeek-Prover approach, which other papers have adopted, of breaking out the model calls into a product. Still not perfect since some models are a lot more expensive to call.
Jason Rute (Apr 16 2025 at 10:41):
(c.f. table 1 in this paper)
Kevin Buzzard (Apr 16 2025 at 10:42):
People are pointing out that pass@k gets confusing with feedback and that monkeys with typewriters can prove theorems if you let them try for long enough, but for me these issues are just noise and the real question is how well things scale. In some sense I don't care how many millions or billions were spent, or how many goes it took, or whether it took seconds or weeks; what I care about is how long are the proofs which are currently discoverable. This is a metric which people tend not to report on but it's highly relevant to me because in practice we need systems to discover nontrivial proofs, and there's a very strong correlation between nontriviality and length.
Mantas Baksys (Apr 16 2025 at 10:43):
Eric Wieser said:
Are you willing to share the full list of problems that you solved in minif2f, as you did for putnambench?
We have updated our GitHub repository to include all of the miniF2F-test proofs we found (so far). Check it out!
They can be found here in a zip file format to avoid contamination.
Mert Ünsal (Apr 16 2025 at 10:43):
Hi @Kevin Buzzard , you can check the Appendix of the paper for some examples. There are some quite long proofs, although not every long proof is necessarily hard. We realized during RL that sometimes model learned to enumerate, e.g. just show that a property is not satisfied for integers up to 68 and is satisfied for 69 so the minimum integer that satisfies the property is 69.
Mario Carneiro (Apr 16 2025 at 10:43):
the real question is how well things scale
In some sense I don't care how many millions or billions were spent, or how many goes it took
But the amount of billions it took and the amount of time it takes to get the results is literally the answer to how it is scaling
Kevin Buzzard (Apr 16 2025 at 10:44):
I know we're a long way from nontrivial proofs right now (FrontierMath is nontrivial numbers, and all the Lean databases which exist so far are what I would call trivial proofs) but hopefully some nontrivial proof databases will appear in the future.
Jason Rute (Apr 16 2025 at 10:45):
You mean as benchmarks?
Kevin Buzzard (Apr 16 2025 at 10:45):
But the amount of billions it took and the amount of time it takes to get the results is literally the answer to how it is scaling
No, monkeys scale exponentially badly with proof length. Scaling is the word "exponential" not the total amount it took to do X.
Mario Carneiro (Apr 16 2025 at 10:46):
It's exponential in every proposed method, the thing that is changing is the base and constant of the exponent
Mario Carneiro (Apr 16 2025 at 10:47):
the changes in the base are the really interesting "algorithmic" changes we would like to detect, but changes in the constant are confounding factors when someone decides to randomly pour a huge compute budget into a given project
Kevin Buzzard (Apr 16 2025 at 10:47):
Jason Rute said:
You mean as benchmarks?
Yes, for example
looks like an interesting proposal. When it comes to proofs, right now no system can do anything which I couldn't do when I was a schoolkid, which is a problem.Mario Carneiro (Apr 16 2025 at 10:49):
I wish this was reported in something more neutral and harder to hack, e.g. number of FLOPS
Mario Carneiro (Apr 16 2025 at 10:50):
because ultimately I want to know what I can do with consumer grade hardware in a bounded period of time and energy
Jason Rute (Apr 16 2025 at 10:52):
As I said in my Last Mile talk, the best test is just to try it in real life. If they open source this model (or release it through an API) , you can see if it actually helps with real world theorem proving.
Mert Ünsal (Apr 16 2025 at 10:52):
One thing we want to experiment with is doing test-time-RL: GDM folks seem to generate variants for a problem using Gemini/general purpose LLM and we are brainstorming within the team how to do something similar. I believe one question that would be valuable to answer is:
"When trying to solve a problem/prove a theorem, what questions need to be asked to generate problems that are relevant to solving the main problem?"
On top of my ahead, one can propose some intermediate lemmas, try special cases, relax the problem, etc. It would be great if we could have a list for this.
Would be really happy to have your thoughts on this.
Zhengying Liu (Apr 16 2025 at 11:09):
miniF2F-test proofs found by Kimina-Prover Preview released here:
https://github.com/MoonshotAI/Kimina-Prover-Preview/blob/master/minif2f_test_solved.zip
Qian Hong (Apr 16 2025 at 11:19):
Zhengying Liu said:
miniF2F-test proofs found by Kimina-Prover Preview released here:
https://github.com/MoonshotAI/Kimina-Prover-Preview/blob/master/minif2f_test_solved.zip
@Kevin Buzzard I'm not an author, but I think they published the proofs found by Kimina-Prover Preview, which could answer your question about proof length. Out of curiosity, I made a histogram of the number of characters from the 196 proofs found, excluding one outlier (aime1984p7.lean). This outlier is 23,636 characters long but seems to be constructed in an unreadable way with a weird, likely redundant pattern.
number of characters in generated Lean proofs.png
Auguste Poiroux (Apr 16 2025 at 11:44):
Are there any comments in the generated proofs? This can bias the character-based approach when computing proof length
Zhengying Liu (Apr 16 2025 at 11:47):
There is a .jsonl file in the zip containing all generated chain-of-thought.
(btw, I'm an author too)
Auguste Poiroux said:
Are there any comments in the generated proofs? This can bias the character-based approach when computing proof length
Qian Hong (Apr 16 2025 at 11:48):
Auguste Poiroux said:
Are there any comments in the generated proofs? This can bias the character-based approach when computing proof length
Based on a quick estimate using grep commands, I believe about 2% of the lines in the proofs are comments.
Zhengying Liu (Apr 16 2025 at 11:49):
Auguste Poiroux said:
Are there any comments in the generated proofs? This can bias the character-based approach when computing proof length
Sorry I misunderstood. Yes there could be some comments in the generated proof. It's the model's choice. For example:
Auguste Poiroux (Apr 16 2025 at 12:02):
I went manually through the released proofs, and there are indeed some long proofs. Impressive!
Just like in the picture you shared, the proofs seem to rely a lot on have
. Did you bias/prompt the model in that direction? Or did that happen spontaneously during the training?
Zhengying Liu (Apr 16 2025 at 12:22):
Auguste Poiroux said:
I went manually through the released proofs, and there are indeed some long proofs. Impressive!
Just like in the picture you shared, the proofs seem to rely a lot onhave
. Did you bias/prompt the model in that direction? Or did that happen spontaneously during the training?
It happens spontaneously during training :)
But it did rely on human-annotated data for doing SFT so I think the model learns the human style this way
Mantas Baksys (Apr 16 2025 at 12:22):
(deleted)
Simon Sorg (Apr 16 2025 at 12:53):
Is the Numina REPL environment handler used for Kimina Prover available somewhere?
Thomas Zhu (Apr 16 2025 at 13:00):
Auguste Poiroux said:
I went manually through the released proofs, and there are indeed some long proofs. Impressive!
Just like in the picture you shared, the proofs seem to rely a lot onhave
. Did you bias/prompt the model in that direction? Or did that happen spontaneously during the training?
(I am not an author) My guess is using have allows the model to use terminal tactics like linarith, omega, and ring. Because of these tactics, perhaps the "ideal proof distribution" for high school algebraic problems is skewed toward this structure. Another potential reason is that because there is no interaction with Lean, the model tries to avoid non-terminal tactics like simp, because it is hard to predict the proof state after a simp call.
Mantas Baksys (Apr 16 2025 at 13:18):
Simon Sorg said:
Is the Numina REPL environment handler used for Kimina Prover available somewhere?
We are hoping to release this very soon, stay tuned!
Kevin Buzzard (Apr 16 2025 at 13:24):
Auguste Poiroux said:
I went manually through the released proofs, and there are indeed some long proofs. Impressive!
I just did the same and this was also my reaction! I was expecting 5-liners. Just to make absolutely sure I've got this right: you're saying that in the code below, the system was given the first 11 lines (statement of question) and it wrote the next 100+ itself without any further human input specific to the question? Is there a possibility of data leakage, e.g. is it possible that the training data for the model included a Lean proof of this?
import Mathlib
import Aesop
set_option maxHeartbeats 0
open BigOperators Real Nat Topology Rat
theorem aime_1983_p1 (x y z w : ℕ) (ht : 1 < x ∧ 1 < y ∧ 1 < z) (hw : 0 ≤ w)
(h0 : Real.log w / Real.log x = 24) (h1 : Real.log w / Real.log y = 40)
(h2 : Real.log w / Real.log (x * y * z) = 12) : Real.log w / Real.log z = 60 := by
have hx_pos : (x : ℝ) > (1 : ℝ) := by exact_mod_cast ht.left
have hy_pos : (y : ℝ) > (1 : ℝ) := by exact_mod_cast ht.right.left
have hz_pos : (z : ℝ) > (1 : ℝ) := by exact_mod_cast ht.right.right
have h31 : Real.log x ≠ 0 := by
have h_log_x_pos : Real.log x > 0 := by
apply Real.log_pos
norm_num
linarith
linarith
have h41 : Real.log y ≠ 0 := by
have h_log_y_pos : Real.log y > 0 := by
apply Real.log_pos
norm_num
linarith
linarith
have h51 : Real.log (x * y * z) ≠ 0 := by
have h_log_x_y_z_pos : Real.log (x * y * z) > 0 := by
have h1 : Real.log (x * y * z) = Real.log (x * y) + Real.log z := by
rw [Real.log_mul (by positivity) (by positivity)]
have h2 : Real.log (x * y) = Real.log x + Real.log y := by
rw [Real.log_mul (by positivity) (by positivity)]
have h3 : Real.log x > 0 := by
apply Real.log_pos
norm_num
linarith
have h4 : Real.log y > 0 := by
apply Real.log_pos
norm_num
linarith
have h5 : Real.log z > 0 := by
apply Real.log_pos
norm_num
linarith
have h6 : Real.log (x * y) > 0 := by linarith [h3, h4]
have h7 : Real.log (x * y * z) > 0 := by linarith [h1, h6, h5]
linarith
linarith
have h3 : Real.log w = 24 * Real.log x := by
have eq1 : Real.log w = 24 * Real.log x := by
field_simp [h31] at h0
linarith
exact eq1
have h4 : Real.log w = 40 * Real.log y := by
have eq1 : Real.log w = 40 * Real.log y := by
field_simp [h41] at h1
linarith
exact eq1
have h5 : Real.log w = 12 * Real.log (x * y * z) := by
have eq1 : Real.log w = 12 * Real.log (x * y * z) := by
field_simp [h51] at h2
linarith
exact eq1
have h6 : Real.log (x * y * z) = Real.log x + Real.log y + Real.log z := by
rw [Real.log_mul (by positivity) (by positivity), Real.log_mul (by positivity) (by positivity)]
all_goals linarith
rw [h6] at h5
have h7 : Real.log z ≠ 0 := by
have hz1 : (z : ℝ) > (1 : ℝ) := by exact_mod_cast ht.right.right
have h_pos : Real.log z > 0 := by
apply Real.log_pos
norm_num
linarith
linarith
have h8 : Real.log z = (2 / 3) * Real.log y := by
have eq1 : Real.log x = (5 / 3) * Real.log y := by
have h10 : Real.log w = 24 * Real.log x := by linarith [h3]
have h11 : Real.log w = 40 * Real.log y := by linarith [h4]
have h12 : 24 * Real.log x = 40 * Real.log y := by linarith [h10, h11]
linarith
have eq2 : Real.log x = Real.log y + Real.log z := by
have h13 : Real.log w = 24 * Real.log x := by linarith [h3]
have h14 : Real.log w = 12 * (Real.log x + Real.log y + Real.log z) := by linarith [h5]
have h15 : 24 * Real.log x = 12 * (Real.log x + Real.log y + Real.log z) := by linarith [h13, h14]
have h16 : Real.log x = Real.log y + Real.log z := by
nlinarith
exact h16
have eq3 : Real.log z = (2 / 3) * Real.log y := by
linarith [eq1, eq2]
linarith
have h9 : Real.log w / Real.log z = 60 := by
have h21 : Real.log z ≠ 0 := by exact h7
have h23 : Real.log w = 40 * Real.log y := by linarith [h4]
have h24 : Real.log z = (2 / 3) * Real.log y := by linarith [h8]
have h25 : Real.log w / Real.log z = 60 := by
have h26 : Real.log z ≠ 0 := by exact h7
have h27 : Real.log w = 60 * Real.log z := by
linarith [h23, h24]
have h28 : Real.log w / Real.log z = 60 := by
field_simp [h26]
linarith [h27]
exact h28
exact h25
exact h9
Kevin Buzzard (Apr 16 2025 at 13:29):
(off-topic: what's with the (w : ℕ) (hw : 0 ≤ w)
in the statement?!)
Jason Rute (Apr 16 2025 at 13:31):
Kevin Buzzard said:
Is there a possibility of data leakage, e.g. is it possible that the training data for the model included a Lean proof of this?
If not this time, maybe next time since you just posted a Lean proof of aime_1983_p1
on the internet. :smile:
Kevin Buzzard (Apr 16 2025 at 13:31):
line 74 is an all_goals linarith
which gets flagged by a linter because there are no goals left at that point. An LLM could just have easily written linarith
which would have created an instance of the best error in the game ("no goals to be solved" aka "quit already! You're done!"). At the very least a machine should surely learn to recognise that one and delete the remaining lines of code in that block and try again.
Zhengying Liu (Apr 16 2025 at 13:32):
Kevin Buzzard said:
ust to make absolutely sure I've got this right: you're saying that in the code below, the system was given the first 11 lines (statement of question) and it wrote the next 100+ itself without any further human input specific to the question?
Yes :)
Zhengying Liu (Apr 16 2025 at 13:34):
Although before generating the Lean 4 code, there is a (often long) chain-of-thought to generate beforehand
Mantas Baksys (Apr 16 2025 at 13:34):
Kevin Buzzard said:
Auguste Poiroux said:
I went manually through the released proofs, and there are indeed some long proofs. Impressive!
I just did the same and this was also my reaction! I was expecting 5-liners. Just to make absolutely sure I've got this right: you're saying that in the code below, the system was given the first 11 lines (statement of question) and it wrote the next 100+ itself without any further human input specific to the question? Is there a possibility of data leakage, e.g. is it possible that the training data for the model included a Lean proof of this?
import Mathlib import Aesop set_option maxHeartbeats 0 open BigOperators Real Nat Topology Rat theorem aime_1983_p1 (x y z w : ℕ) (ht : 1 < x ∧ 1 < y ∧ 1 < z) (hw : 0 ≤ w) (h0 : Real.log w / Real.log x = 24) (h1 : Real.log w / Real.log y = 40) (h2 : Real.log w / Real.log (x * y * z) = 12) : Real.log w / Real.log z = 60 := by have hx_pos : (x : ℝ) > (1 : ℝ) := by exact_mod_cast ht.left have hy_pos : (y : ℝ) > (1 : ℝ) := by exact_mod_cast ht.right.left have hz_pos : (z : ℝ) > (1 : ℝ) := by exact_mod_cast ht.right.right have h31 : Real.log x ≠ 0 := by have h_log_x_pos : Real.log x > 0 := by apply Real.log_pos norm_num linarith linarith have h41 : Real.log y ≠ 0 := by have h_log_y_pos : Real.log y > 0 := by apply Real.log_pos norm_num linarith linarith have h51 : Real.log (x * y * z) ≠ 0 := by have h_log_x_y_z_pos : Real.log (x * y * z) > 0 := by have h1 : Real.log (x * y * z) = Real.log (x * y) + Real.log z := by rw [Real.log_mul (by positivity) (by positivity)] have h2 : Real.log (x * y) = Real.log x + Real.log y := by rw [Real.log_mul (by positivity) (by positivity)] have h3 : Real.log x > 0 := by apply Real.log_pos norm_num linarith have h4 : Real.log y > 0 := by apply Real.log_pos norm_num linarith have h5 : Real.log z > 0 := by apply Real.log_pos norm_num linarith have h6 : Real.log (x * y) > 0 := by linarith [h3, h4] have h7 : Real.log (x * y * z) > 0 := by linarith [h1, h6, h5] linarith linarith have h3 : Real.log w = 24 * Real.log x := by have eq1 : Real.log w = 24 * Real.log x := by field_simp [h31] at h0 linarith exact eq1 have h4 : Real.log w = 40 * Real.log y := by have eq1 : Real.log w = 40 * Real.log y := by field_simp [h41] at h1 linarith exact eq1 have h5 : Real.log w = 12 * Real.log (x * y * z) := by have eq1 : Real.log w = 12 * Real.log (x * y * z) := by field_simp [h51] at h2 linarith exact eq1 have h6 : Real.log (x * y * z) = Real.log x + Real.log y + Real.log z := by rw [Real.log_mul (by positivity) (by positivity), Real.log_mul (by positivity) (by positivity)] all_goals linarith rw [h6] at h5 have h7 : Real.log z ≠ 0 := by have hz1 : (z : ℝ) > (1 : ℝ) := by exact_mod_cast ht.right.right have h_pos : Real.log z > 0 := by apply Real.log_pos norm_num linarith linarith have h8 : Real.log z = (2 / 3) * Real.log y := by have eq1 : Real.log x = (5 / 3) * Real.log y := by have h10 : Real.log w = 24 * Real.log x := by linarith [h3] have h11 : Real.log w = 40 * Real.log y := by linarith [h4] have h12 : 24 * Real.log x = 40 * Real.log y := by linarith [h10, h11] linarith have eq2 : Real.log x = Real.log y + Real.log z := by have h13 : Real.log w = 24 * Real.log x := by linarith [h3] have h14 : Real.log w = 12 * (Real.log x + Real.log y + Real.log z) := by linarith [h5] have h15 : 24 * Real.log x = 12 * (Real.log x + Real.log y + Real.log z) := by linarith [h13, h14] have h16 : Real.log x = Real.log y + Real.log z := by nlinarith exact h16 have eq3 : Real.log z = (2 / 3) * Real.log y := by linarith [eq1, eq2] linarith have h9 : Real.log w / Real.log z = 60 := by have h21 : Real.log z ≠ 0 := by exact h7 have h23 : Real.log w = 40 * Real.log y := by linarith [h4] have h24 : Real.log z = (2 / 3) * Real.log y := by linarith [h8] have h25 : Real.log w / Real.log z = 60 := by have h26 : Real.log z ≠ 0 := by exact h7 have h27 : Real.log w = 60 * Real.log z := by linarith [h23, h24] have h28 : Real.log w / Real.log z = 60 := by field_simp [h26] linarith [h27] exact h28 exact h25 exact h9
I would encourage you to look at the proofs in the appendix of our report, I'm sure you will find them even more impressive!
Jia Li (Apr 16 2025 at 13:43):
Simon Sorg said:
Is the Numina REPL environment handler used for Kimina Prover available somewhere?
Will be released very soon (in 1-2 days). Will keep you posted
Jia Li (Apr 16 2025 at 13:57):
Kevin Buzzard said:
Auguste Poiroux said:
I went manually through the released proofs, and there are indeed some long proofs. Impressive!
I just did the same and this was also my reaction! I was expecting 5-liners. Just to make absolutely sure I've got this right: you're saying that in the code below, the system was given the first 11 lines (statement of question) and it wrote the next 100+ itself without any further human input specific to the question? Is there a possibility of data leakage, e.g. is it possible that the training data for the model included a Lean proof of this?
import Mathlib import Aesop set_option maxHeartbeats 0 open BigOperators Real Nat Topology Rat theorem aime_1983_p1 (x y z w : ℕ) (ht : 1 < x ∧ 1 < y ∧ 1 < z) (hw : 0 ≤ w) (h0 : Real.log w / Real.log x = 24) (h1 : Real.log w / Real.log y = 40) (h2 : Real.log w / Real.log (x * y * z) = 12) : Real.log w / Real.log z = 60 := by have hx_pos : (x : ℝ) > (1 : ℝ) := by exact_mod_cast ht.left have hy_pos : (y : ℝ) > (1 : ℝ) := by exact_mod_cast ht.right.left have hz_pos : (z : ℝ) > (1 : ℝ) := by exact_mod_cast ht.right.right have h31 : Real.log x ≠ 0 := by have h_log_x_pos : Real.log x > 0 := by apply Real.log_pos norm_num linarith linarith have h41 : Real.log y ≠ 0 := by have h_log_y_pos : Real.log y > 0 := by apply Real.log_pos norm_num linarith linarith have h51 : Real.log (x * y * z) ≠ 0 := by have h_log_x_y_z_pos : Real.log (x * y * z) > 0 := by have h1 : Real.log (x * y * z) = Real.log (x * y) + Real.log z := by rw [Real.log_mul (by positivity) (by positivity)] have h2 : Real.log (x * y) = Real.log x + Real.log y := by rw [Real.log_mul (by positivity) (by positivity)] have h3 : Real.log x > 0 := by apply Real.log_pos norm_num linarith have h4 : Real.log y > 0 := by apply Real.log_pos norm_num linarith have h5 : Real.log z > 0 := by apply Real.log_pos norm_num linarith have h6 : Real.log (x * y) > 0 := by linarith [h3, h4] have h7 : Real.log (x * y * z) > 0 := by linarith [h1, h6, h5] linarith linarith have h3 : Real.log w = 24 * Real.log x := by have eq1 : Real.log w = 24 * Real.log x := by field_simp [h31] at h0 linarith exact eq1 have h4 : Real.log w = 40 * Real.log y := by have eq1 : Real.log w = 40 * Real.log y := by field_simp [h41] at h1 linarith exact eq1 have h5 : Real.log w = 12 * Real.log (x * y * z) := by have eq1 : Real.log w = 12 * Real.log (x * y * z) := by field_simp [h51] at h2 linarith exact eq1 have h6 : Real.log (x * y * z) = Real.log x + Real.log y + Real.log z := by rw [Real.log_mul (by positivity) (by positivity), Real.log_mul (by positivity) (by positivity)] all_goals linarith rw [h6] at h5 have h7 : Real.log z ≠ 0 := by have hz1 : (z : ℝ) > (1 : ℝ) := by exact_mod_cast ht.right.right have h_pos : Real.log z > 0 := by apply Real.log_pos norm_num linarith linarith have h8 : Real.log z = (2 / 3) * Real.log y := by have eq1 : Real.log x = (5 / 3) * Real.log y := by have h10 : Real.log w = 24 * Real.log x := by linarith [h3] have h11 : Real.log w = 40 * Real.log y := by linarith [h4] have h12 : 24 * Real.log x = 40 * Real.log y := by linarith [h10, h11] linarith have eq2 : Real.log x = Real.log y + Real.log z := by have h13 : Real.log w = 24 * Real.log x := by linarith [h3] have h14 : Real.log w = 12 * (Real.log x + Real.log y + Real.log z) := by linarith [h5] have h15 : 24 * Real.log x = 12 * (Real.log x + Real.log y + Real.log z) := by linarith [h13, h14] have h16 : Real.log x = Real.log y + Real.log z := by nlinarith exact h16 have eq3 : Real.log z = (2 / 3) * Real.log y := by linarith [eq1, eq2] linarith have h9 : Real.log w / Real.log z = 60 := by have h21 : Real.log z ≠ 0 := by exact h7 have h23 : Real.log w = 40 * Real.log y := by linarith [h4] have h24 : Real.log z = (2 / 3) * Real.log y := by linarith [h8] have h25 : Real.log w / Real.log z = 60 := by have h26 : Real.log z ≠ 0 := by exact h7 have h27 : Real.log w = 60 * Real.log z := by linarith [h23, h24] have h28 : Real.log w / Real.log z = 60 := by field_simp [h26] linarith [h27] exact h28 exact h25 exact h9
I think this model is actually capable of writing code without compiler feedback. It is not overfitting:
- we have a lot of problems (outside of minif2f) where we only formalise the statement, and the model manage to find very long proof (will release the data when ready)
- a lot of problems in minif2f solved by the model, they have never been formalized by anyone on the internet
We are preparing a demo so that you can test the model yourself ! With some new statements !
Kevin Buzzard (Apr 16 2025 at 14:05):
Will it cost me $1,000,000 per question?
Jia Li (Apr 16 2025 at 14:21):
Kevin Buzzard said:
Will it cost me $1,000,000 per question?
It is going to be free, haha. No worries, numina is still non profit. The first version of demo might be a bit simple, please provide feedback
Johan Commelin (Apr 16 2025 at 14:38):
Free is awesome. But I think Kevin was also asking about the raw compute costs. Hardware, and the electricity bill.
Aaron Liu (Apr 16 2025 at 14:41):
Will it cost you $1,000,000 per question?
Michael Stoll (Apr 16 2025 at 14:42):
What I find remarkable (and annoying when reading the proofs) is that many steps are redundant, e.g.
have h3 : Real.log w = 24 * Real.log x := by
have eq1 : Real.log w = 24 * Real.log x := by
field_simp [h31] at h0
linarith
exact eq1
and a lot more in the proof Kevin shared.
Is there a reasonable way to avoid this? Maybe train another model to improve proofs? (I assume it may be a bit tricky to generate a good reward signal. There was a discussion some time ago on the "optimal" way to define real numbers (and prove the basic properties), but IIRC there was no consensus reached on what the measure should be [I can't find this thread now].)
Michael Stoll (Apr 16 2025 at 14:44):
Here is the proof in a form closer to what I would hope to see eventually from these models:
import Mathlib
theorem aime_1983_p1 (x y z w : ℕ) (ht : 1 < x ∧ 1 < y ∧ 1 < z) (hw : 0 ≤ w)
(h0 : Real.log w / Real.log x = 24) (h1 : Real.log w / Real.log y = 40)
(h2 : Real.log w / Real.log (x * y * z) = 12) : Real.log w / Real.log z = 60 := by
have hx_pos : (x : ℝ) > (1 : ℝ) := by exact_mod_cast ht.left
have hy_pos : (y : ℝ) > (1 : ℝ) := by exact_mod_cast ht.right.left
have hz_pos : (z : ℝ) > (1 : ℝ) := by exact_mod_cast ht.right.right
have hx_1 : 0 < Real.log x := Real.log_pos hx_pos
have hy_1 : 0 < Real.log y := Real.log_pos hy_pos
have hz_1 : 0 < Real.log z := Real.log_pos hz_pos
have h6 : Real.log (x * y * z) = Real.log x + Real.log y + Real.log z := by
rw [Real.log_mul (by positivity) (by positivity),
Real.log_mul (by positivity) (by positivity)]
have hxyz_1 : 0 < Real.log (x * y * z) := by
rw [h6]
linarith
have h3 : Real.log w = 24 * Real.log x := by
field_simp [hx_1.ne'] at h0
exact h0
have h4 : Real.log w = 40 * Real.log y := by
field_simp [hy_1.ne'] at h1
exact h1
have h5 : Real.log w = 12 * Real.log (x * y * z) := by
field_simp [hxyz_1] at h2
exact h2
rw [h6] at h5
field_simp [hz_1.ne']
linarith
Yaël Dillies (Apr 16 2025 at 14:44):
The good thing about Lean is that syntax can be traversed. One could "simply" look for common redundant combinations (like by exact
, or have := by have := exact
) and remove them
Yaël Dillies (Apr 16 2025 at 14:44):
This does not have to be AI!
Michael Stoll (Apr 16 2025 at 14:47):
It gets trickier when statements are repeated in different sub-proofs, for example. Or when several similar sub-proof can be extracted as a more general separate lemma etc.
Jia Li (Apr 16 2025 at 15:02):
Michael Stoll said:
What I find remarkable (and annoying when reading the proofs) is that many steps are redundant, e.g.
have h3 : Real.log w = 24 * Real.log x := by have eq1 : Real.log w = 24 * Real.log x := by field_simp [h31] at h0 linarith exact eq1
and a lot more in the proof Kevin shared.
Is there a reasonable way to avoid this? Maybe train another model to improve proofs? (I assume it may be a bit tricky to generate a good reward signal. There was a discussion some time ago on the "optimal" way to define real numbers (and prove the basic properties), but IIRC there was no consensus reached on what the measure should be [I can't find this thread now].)
This is a known issue. We have some ideas to how to fix it.
The reason why this happened, is the model is only rewarded to find the correct proofs, having redundant tactics increase the change of solving the problem (model do not have access to lean during generation). But we have some great ideas how to improve this, hopefully can be realize in next release
Thomas Zhu (Apr 16 2025 at 15:04):
Michael Stoll said:
Here is the proof in a form closer to what I would hope to see eventually from these models:
import Mathlib theorem aime_1983_p1 (x y z w : ℕ) (ht : 1 < x ∧ 1 < y ∧ 1 < z) (hw : 0 ≤ w) (h0 : Real.log w / Real.log x = 24) (h1 : Real.log w / Real.log y = 40) (h2 : Real.log w / Real.log (x * y * z) = 12) : Real.log w / Real.log z = 60 := by have hx_pos : (x : ℝ) > (1 : ℝ) := by exact_mod_cast ht.left have hy_pos : (y : ℝ) > (1 : ℝ) := by exact_mod_cast ht.right.left have hz_pos : (z : ℝ) > (1 : ℝ) := by exact_mod_cast ht.right.right have hx_1 : 0 < Real.log x := Real.log_pos hx_pos have hy_1 : 0 < Real.log y := Real.log_pos hy_pos have hz_1 : 0 < Real.log z := Real.log_pos hz_pos have h6 : Real.log (x * y * z) = Real.log x + Real.log y + Real.log z := by rw [Real.log_mul (by positivity) (by positivity), Real.log_mul (by positivity) (by positivity)] have hxyz_1 : 0 < Real.log (x * y * z) := by rw [h6] linarith have h3 : Real.log w = 24 * Real.log x := by field_simp [hx_1.ne'] at h0 exact h0 have h4 : Real.log w = 40 * Real.log y := by field_simp [hy_1.ne'] at h1 exact h1 have h5 : Real.log w = 12 * Real.log (x * y * z) := by field_simp [hxyz_1] at h2 exact h2 rw [h6] at h5 field_simp [hz_1.ne'] linarith
This can be further golfed: the lines starting from "have h3" can be simply replaced with field_simp [*] at *; linarith
.
I think the procedure for minimizing a proof like this should be in scope for ImProver.
Bolton Bailey (Apr 16 2025 at 19:17):
Given the strange proofs that LLM-based provers often give, I feel there might be value in developing a golfing program that attempts to take a working lean proof and tidy it up to something easily understandable.
Mario Carneiro (Apr 16 2025 at 19:18):
I would like to run that on some human proofs too ;)
Auguste Poiroux (Apr 16 2025 at 19:42):
When reading the CoT of Kimina-Prover, it sometimes comes up with clever ideas. But are these ideas really generated out of the blue by the model? Given that most of the problems in miniF2F have informal solutions online, it is likely that these informal proofs were memorized.
In the end, evaluating on miniF2F is probably measuring proof autoformalization capabilities, and not theorem proving. What do you think?
Even if this is the case, it is still impressive in my opinion. It would be interesting to test this hypothesis by running the model on new problems of the same difficulty.
Kevin Buzzard (Apr 16 2025 at 20:31):
I think that autoformalization is a hard problem and even if the machine has read a human proof it's certainly not a given that it can translate it into Lean -- look at all the AI slop which has been posted recently on this Zulip to see that! If we can solve autoformalization then we can start feeding in ArXiv and then we're one step closer to game over.
llllvvuu (Apr 16 2025 at 20:38):
re: using Lean feedback, it’s worth noting that tree search doesn’t imply “using Lean feedback” in the obvious (to a human) way: that is, prompting with feedback. Rather, it implies backtracking on errors (and optionally, updating tree statistics and/or penalization during training).
Auguste Poiroux (Apr 16 2025 at 20:46):
I do agree that autoformalization is a hard problem (this is my current research topic ^^), and solving it would be amazing (happy to chat about this topic to anyone who is interested by the way).
My point is that proof autoformalization is likely easier to achieve than automated theorem proving because the high-level thinking (planning) is already done. Let's say that in the next iteration, the Kimina model is capable of solving all the IMO problems that have been released so far. Does it mean that the model will be capable of solving new IMO problems? If the model does indeed perform proof autoformalization using memorized informal proofs, then probably not.
This is probably a subtle distinction, but I believe it is important to have it in mind. If the Kimina team plans to participate to the IMO, I believe they should probably measure this to avoid a (potential) bad surprise.
But again, I do agree that having good autoformalizers is super interesting, and I am all for it :)
Auguste Poiroux (Apr 16 2025 at 20:58):
For example, page 12, section D.1 of the report:
This proof shows an excellent ability of the model to plan its proof sketch. Such emergent behavior in the output shows a clear improvement in reasoning ability compared to previous models.
If the model memorized the informal proof, then the model didn't really plan.
I don't want to make ennemies though ^^ To the Kimina team: your model is amazing, I truly think this is an excellent research project :heart: I am just interested in well-motivated claims, and helping in doing rigorous research.
Jason Rute (Apr 16 2025 at 21:58):
As far as memorization of the informal proof, that is certainly possible. I have been warning of this for a while. But at the same time models trained exactly like this, e.g. DeepSeek-r1, do pretty well on AIME problems (using majority voting). A big question is if that “reasoning” can be made sound or they always short-cut to find a solution. I think we are close to finding out. If one is worried about overfitting to public problems (as one should be), partner with https://matharena.ai and start testing on formal versions of all the math competitions right after they are released. (Or even have the model come up with the formalization via auto-formalization and a proof, and you only need to check the formalization, not the proof.)
Zhengying Liu (Apr 17 2025 at 01:01):
On data contamination, I think there are two points that reduced the probability/level of data leakage:
- As recoreded in Section 3.1 in the paper, we specifically did a decontamination effort to avoid data leakage for miniF2F.
- We are based on Qwen2.5-72B (the pre training base), not Qwen2.5-Math-72B. So the base model does not favor the memorization of math corpus
Screenshot_20250417_085030.jpg
David Renshaw (Apr 17 2025 at 19:45):
Minor nitpick: is there are reason that the 1.5B param model calls itself AI-MO/Kimina-Prover-1.5B-Distill
, but the 7B param model calls itself AI-MO/Qwen2.5-Math-7B-Instruct-formal-prover-ft-v7.1-020425-CSFT
?
Mert Ünsal (Apr 17 2025 at 20:05):
David Renshaw said:
Minor nitpick: is there are reason that the 1.5B param model calls itself
AI-MO/Kimina-Prover-1.5B-Distill
, but the 7B param model calls itselfAI-MO/Qwen2.5-Math-7B-Instruct-formal-prover-ft-v7.1-020425-CSFT
?
It is wrong naming by me - just fixed it!
Jannis Limperg (Apr 18 2025 at 16:18):
Bolton Bailey said:
Given the strange proofs that LLM-based provers often give, I feel there might be value in developing a golfing program that attempts to take a working lean proof and tidy it up to something easily understandable.
I use something like this for the tactic proofs generated by aesop?
(paper). The implemented optimisations are somewhat specific to the ways in which Aesop proofs are non-idiomatic, but the system could maybe serve as a starting point for something more general. I probably won't have time to seriously work on this any time soon, but if anyone would like to jump in, please let me know!
Jia Li (Apr 23 2025 at 15:58):
Hi We have created a demo page where you could try our 72b model !
Let us know your feedbacks :)
#Machine Learning for Theorem Proving > Kimina Prover Preview Demo
Aaron Liu (Apr 23 2025 at 16:07):
I tried this with the default input
Given that the real numbers $a$ and $b$ satisfy:
\[ a^3 - 3ab^2 = 39, \quad b^3 - 3a^2b = 26 \],
prove that $ a^2 + b^2 = 13 $.
It generated the statement
import Mathlib
theorem my_favorite_theorem {a b : ℝ} (h₀ : a^3-3*a*b^2=39) (h₁ : b^3-3*a^2*b=26) :
a^2+b^2=13 := by sorry
and then got caught in a loop when generating the proof.
image.png
Is this normal?
Aaron Liu (Apr 23 2025 at 16:12):
It seems to have generated a correct proof on the second pass (after I restarted it manually)
Mert Ünsal (Apr 23 2025 at 16:16):
The model can sometimes get into recursive loops or change languages :sweat_smile: If this happens, please try again or try running pass@16.
This effect seems to be common in RL as QwQ-Preview-32B model had the same issue. We are aware of this and working on it for the full release.
Kevin Buzzard (Apr 23 2025 at 16:20):
I also saw this looping phenomenon when asking the system to prove that the sum of the last three digits of 5^100 was 13.
Edison Xie (Apr 23 2025 at 16:54):
it fails IMO 2002 Problem 1 :laughing:
Kevin Buzzard (Apr 23 2025 at 17:08):
Oh its answers can change! So you mean "it failed once".
Jia Li (Apr 23 2025 at 17:28):
Edison Xie said:
it fails IMO 2002 Problem 1 :laughing:
This model is still very far from solving any IMO level problems. (it could be a good starting point)
Here i have put up a list of problems for you to get an idea of what the model can solve with a low pass compute budget.
https://docs.google.com/document/d/1I3Vr-Np_4Fk20rJhpcItD9OY6UKOi1UyeNjivgx54U8/edit?usp=sharing
These problems' statements are not even formalized in our database.
When using this model, i feel its formal capacity, with a low compute budget, is a bit like gpt-4o 's informal math capacity 6 monthes ago. Hopefully we could make some more progress soon
Jia Li (Apr 23 2025 at 17:34):
Kevin Buzzard said:
the sum of the last three digits of 5^100 was 13.
I supposed you did not use the autoformalized statementm :) Otherwise it worked
It is a headache, the model rely too much on the computional brute force method instead of "solving it mathematically".
image.png
Kevin Buzzard (Apr 23 2025 at 17:35):
Someone was posting homework problems on another thread and I asked one of them (the 35 question) to the model and it got it very quickly :-)
Kevin Buzzard (Apr 23 2025 at 17:35):
Jia Li said:
Kevin Buzzard said:
the sum of the last three digits of 5^100 was 13.
I supposed you did not use the autoformalized statementm :) Otherwise it worked
It is a headache, the model rely too much on the computional brute force method instead of "solving it mathematically".
image.png
Now that is surprising because 5^100 % 1000 is 625.
Jia Li (Apr 23 2025 at 17:38):
Kevin Buzzard said:
Jia Li said:
Kevin Buzzard said:
the sum of the last three digits of 5^100 was 13.
I supposed you did not use the autoformalized statementm :) Otherwise it worked
It is a headache, the model rely too much on the computional brute force method instead of "solving it mathematically".
image.pngNow that is surprising because 5^100 % 1000 is 625.
I have not even noticed ! We just discover a bug. It is probably due to our lean verification. Let's try to fix it
Kevin Buzzard (Apr 23 2025 at 18:00):
I should say that for the 35 question I independently checked that the machine's answer compiled, and it did!
Jia Li (Apr 23 2025 at 18:24):
Kevin Buzzard said:
I should say that for the 35 question I independently checked that the machine's answer compiled, and it did!
This is a button that bring you to https://live.lean-lang.org/ with the code if you have not already noticed
Jason Rute (Apr 23 2025 at 22:29):
Given the bug you just mentioned, did you use just that lean verifier, or also standard lean (or better yet Lean with an external checker) to verify the proofs used to calculate your benchmarks?
Bolton Bailey (Apr 24 2025 at 00:44):
Jason Rute said:
Given the bug you just mentioned, did you use just that lean verifier, or also standard lean (or better yet Lean with an external checker) to verify the proofs used to calculate your benchmarks?
While I didn't work on the benchmarking in the paper, I just checked myself that the files in the zip archive provided above compile fine.
It is pretty weird that the demo UI would say the proof was valid there, especially since I was trying it before and getting some proofs resulting in failure - clearly it must be checking something. Perhaps the issue is that it's not a tactic failure but rather unsolved goals?
Jia Li (Apr 24 2025 at 01:44):
Jason Rute said:
Given the bug you just mentioned, did you use just that lean verifier, or also standard lean (or better yet Lean with an external checker) to verify the proofs used to calculate your benchmarks?
It is not related to the lean verification. For some reason, the backend of this demo show the proof is valid when there is a timeout (too many queries). This should not impact the benchmark numbers. Should be very easy to fix
To answer to your question, we use lean repl (4.15) to check if proof is. We have wrapped it into a fast api server to make it scalable. We are going to release this lean server soon
Jia Li (Apr 24 2025 at 01:50):
Kevin Buzzard said:
I also saw this looping phenomenon when asking the system to prove that the sum of the last three digits of 5^100 was 13.
I finally get what you meant here. The model tends to self verify if the result is uncertain. However we have only train the model to solve the problem, so it will try to solve it anyway (a better behavior will be to point out the error and fix it)
Jia Li (Apr 24 2025 at 01:51):
Jason Rute said:
Given the bug you just mentioned, did you use just that lean verifier, or also standard lean (or better yet Lean with an external checker) to verify the proofs used to calculate your benchmarks?
image.png
Now the server seems to work fine
And sometimes model even manage to correct itself
image.png
Kim Morrison (Apr 28 2025 at 07:46):
Jia Li said:
To answer to your question, we use lean repl (4.15) to check if proof is. We have wrapped it into a fast api server to make it scalable. We are going to release this lean server soon
@Jia Li is this something it would make sense to integrate directly into the REPL?
Jia Li (Apr 28 2025 at 09:27):
Kim Morrison said:
Jia Li said:
To answer to your question, we use lean repl (4.15) to check if proof is. We have wrapped it into a fast api server to make it scalable. We are going to release this lean server soon
Jia Li is this something it would make sense to integrate directly into the REPL?
https://github.com/project-numina/kimina-lean-server
we have release our lean server. We would love to discuss how to make lean repl better. We first try to use lean dojo, and encounter some issues when querying at scale. Do you know someone we can talk to ? We need to make an improvement for large scale RL
Bashar Hamade (Apr 28 2025 at 13:48):
Kim Morrison said:
What would need to happen before Mathlib users could write
by kimina
(and get a "Try this" suggestion)?
Kim Morrison (Apr 28 2025 at 17:33):
I had a similar question for @Robert Joseph regarding LeanCopilot.
Kim Morrison (Apr 28 2025 at 17:36):
My vision here has always been that we create a mostly-model-agnostic repository, that provides a tactic for AI proof generation/search.
The first time a user runs the tactic, it would error with "no backend selected", and give an instruction for an in-VSCode command to select a preferred backend (e.g. Model X via a server, Model Y via local inference), and to do the required setup --- and the bar for inclusion would be that this setup could be done without using the command line.
Kim Morrison (Apr 28 2025 at 17:38):
This repository could then be considered for adding to Mathlib's lakefile --- I would help make the case for this.
Kim Morrison (Apr 28 2025 at 17:39):
When new models come along, the authors would make the PR to the upstream repository to include themselves in the list of available options, and add appropriate installation scripts.
Kim Morrison (Apr 28 2025 at 17:40):
(adding to above: further criteria for inclusion would be cross platform support, not putting anything outside certain approved directories, not assuming even python was available, and presumably some minimal standards for actually proving things)
Kim Morrison (Apr 28 2025 at 17:41):
Making all this happen will definitely require input and contributions from the teams releasing open models. But the Mathlib and Lean communities can help too.
Kim Morrison (Apr 28 2025 at 17:41):
I think without doing something like this we'll stay stuck in the present situation where essentially zero working mathematicians even try out the new models.
Kim Morrison (Apr 28 2025 at 17:43):
In fact, if we can get this system working smoothly, I think it would be reasonably likely "we" (no specific promises) could line up funding for inference servers for all Mathlib (Lean?) users.
Bolton Bailey (Apr 28 2025 at 20:06):
Kim Morrison said:
My vision here has always been that we create a mostly-model-agnostic repository, that provides a tactic for AI proof generation/search.
The first time a user runs the tactic, it would error with "no backend selected", and give an instruction for an in-VSCode command to select a preferred backend (e.g. Model X via a server, Model Y via local inference), and to do the required setup --- and the bar for inclusion would be that this setup could be done without using the command line.
Two examples of this or something that could become this are LLMLean and Lean Copilot. I have been pretty optimistic about the former because it uses Ollama, which seems like a convenient way to handle downloading and serving models in a cross-platform way - I've been working on getting our open models running with it. As I recall, Ollama doesn't make you use the command line to install it, but perhaps even installing an application external to lean is too much friction?
What are your/other's thoughts about this? Has anyone else on this Zulip tried using Ollama or had trouble with it? Do people think they would be willing to do this, or is it too much?
Bolton Bailey (Apr 28 2025 at 20:31):
Here, for example, is a ~4min video explaining how to install Ollama on the three major platforms. Installing something in under a minute makes it look easy, but I am cognizant of the fact that it's easy to make installation look easier than it really is, and that the activation energy of installing anything at all might be a key factor here.
Robert Joseph (Apr 28 2025 at 21:03):
Kim Morrison said:
I had a similar question for Robert Joseph regarding LeanCopilot.
Happy to discuss here with the Lean community and Mathlib on how to do this together and what it would take. For LeanCopilot we have an API allowing that allows you to bring Your Own Model and so one can
In principle, it is possible to run any model using Lean Copilot mayb it be LLMLean or Deepseek to Claude to Kimina! What would it take for the next stage to make it like #by_suggest_model in Mathlib so that it automatically can download some model or any one depending on what the user likes like what Kim suggested is a different story but happy to hear thoughts on what do you all think would be good to implement.
Eric Wieser (Apr 29 2025 at 00:12):
Kim Morrison said:
What would need to happen before Mathlib users could write
by kimina
(and get a "Try this" suggestion)?
I think VSCode / the lean server needs support for a replacement tactic that keeps running even if the file is edited. I want to be able to set an agent on the last theorems in a file while I work on some of the earlier theorems, but today that would result in the request being cancelled on the lower ones
Eric Wieser (Apr 29 2025 at 00:13):
(incremental compilation could also solve this problem; but even then, we don't want to discard an agent-found proof just because the user added a comment after the tactic it was run on)
Robin Carlier (Apr 29 2025 at 00:17):
Kim Morrison said:
This repository could then be considered for adding to Mathlib's lakefile --- I would help make the case for this.
I hope if this ever happens, there will be ways (and instructions) for users to remove this entry from their lakefile.
I may be old-school for this, but I do not feel comfortable with lines (that might get executed just by opening someone else's files, since lean's lsp just executes everything anyway, or just added by hitting tab completion wrong) in my text editor potentially querying servers I don't know anything about, or download and install things in my computer.
The features you describe such as swapping models easily etc, would be nice to have, but should be opt-in, not opt-outs.
I'd much much rather have a command-line only installation process, which I feel is a way more controlled environment.
Kim Morrison (Apr 29 2025 at 16:32):
I think it's easy to set this up so that it is both zero-configuration and opt-in, e.g. a challenge-response style interaction before any commands installed anything.
Kim Morrison (Apr 29 2025 at 16:33):
Eric Wieser said:
Kim Morrison said:
What would need to happen before Mathlib users could write
by kimina
(and get a "Try this" suggestion)?I think VSCode / the lean server needs support for a replacement tactic that keeps running even if the file is edited. I want to be able to set an agent on the last theorems in a file while I work on some of the earlier theorems, but today that would result in the request being cancelled on the lower ones
This would be nice, but it certainly not necessary for useful interaction.
Robin Carlier (Apr 29 2025 at 17:09):
If you make users (and every project downstream of mathlib, who might only care about the math part of the library, but don’t necessarily want the "development environment niceties" like this in the bundle) download that repo by adding it to the lakefile, I don’t really call that opt-in. Even if it’s only just the "installer" part and requires more interactions to actually install anything.
Bolton Bailey (Apr 29 2025 at 17:31):
Robin Carlier said:
I hope if this ever happens, there will be ways (and instructions) for users to remove this entry from their lakefile.
I may be old-school for this, but I do not feel comfortable with lines (that might get executed just by opening someone else's files, since lean's lsp just executes everything anyway, or just added by hitting tab completion wrong) in my text editor potentially querying servers I don't know anything about
You might want to be aware, if you aren't already, of polyrith
, which is an already-existing lean tactic that makes queries to a server over the internet. If we are going to build out the infrastructure to make opt-in for internet-enabled tactics, perhaps opting in to polyrith could be added to that infrastructure.
Niels Voss (Apr 29 2025 at 17:32):
Does this have to be a tactic or could it maybe be something like a VSCode extension? That would basically be enough to make it opt-in
Niels Voss (Apr 29 2025 at 17:35):
Also, would this require downloading kimina-specific code and running it locally, or would it just query a website? If it's the former, we should probably take more precautions
Bolton Bailey (Apr 29 2025 at 17:37):
Niels Voss said:
Does this have to be a tactic or could it maybe be something like a VSCode extension? That would basically be enough to make it opt-in
Possibly! One interesting alternative for VSCode users is perhaps some kind of customization to GitHub Copilot, which I believe was recently made free. Copilot is based on its own extension, so hopefully people installing it have a clear idea of what they are agreeing to.
Robin Carlier (Apr 29 2025 at 17:37):
Bolton Bailey said:
Robin Carlier said:
I hope if this ever happens, there will be ways (and instructions) for users to remove this entry from their lakefile.
I may be old-school for this, but I do not feel comfortable with lines (that might get executed just by opening someone else's files, since lean's lsp just executes everything anyway, or just added by hitting tab completion wrong) in my text editor potentially querying servers I don't know anything aboutYou might want to be aware, if you aren't already, of
polyrith
, which is an already-existing lean tactic that makes queries to a server over the internet. If we are going to build out the infrastructure to make opt-in for internet-enabled tactics, perhaps opting in to polyrith could be added to that infrastructure.
I am aware, and I don’t like it either for that matter.
Same goes for basically everything in the current lakefile that provides features that helps development for mathlib (tactic search, diagnostics, etc.) but are not strictly required for actually building mathlib (i.e, you can remove these entries, the few files in the Tests
and Tactic
folders that mentions them, and mathlib would still build perfectly fine).
But I believe this is straying off-topic, and I don’t intend to derail this any further.
Niels Voss (Apr 29 2025 at 17:39):
Bolton Bailey said:
Niels Voss said:
Does this have to be a tactic or could it maybe be something like a VSCode extension? That would basically be enough to make it opt-in
Possibly! One interesting alternative for VSCode users is perhaps some kind of customization to GitHub Copilot, which I believe was recently made free. Copilot is based on its own extension, so hopefully people installing it have a clear idea of what they are agreeing to.
This is a possibility, but that would also prevent people from running it on VSCodium and would mean that a NeoVim and Emacs version could not be easily made
Niels Voss (Apr 29 2025 at 17:52):
I like Kim Morrison's generic LLM interface idea but like Robin Carlier I'm a bit concerned that this makes it a bit too easy to install proprietary local code, and might even give the impression that the leanprover community or the Lean FRO endorses such code.
Niels Voss (Apr 29 2025 at 17:54):
Maybe to install local LLMs, users should be required to install them like they would a regular desktop program? Windows users probably think more about the risks involved when downloading and running an exe file, more than when they click on a button in a VSCode extension
Kim Morrison (Apr 30 2025 at 11:17):
Personally I think it's extremely unlikely that we will end up with users wanting to install local LLMs, because their performance will be so much worse than what is available via hosted models running on real hardware --- but my suggestions above for a generic interface included that possibility because some people have expressed preference for using local models.
I think for a first implementation of this we should just aim at providing a generic interface only for remote models.
suhr (Apr 30 2025 at 12:11):
I only use local models, no matter what cloud LLMs have to offer. Fortunately, models one can run on an RTX 4090 are not too bad.
Robin Carlier (Apr 30 2025 at 13:28):
A backend of a "generic" interface querying an Oollama server would probably be not that different from a remote backend in terms of actual implementation, and would let people run it locally.
Last updated: May 02 2025 at 03:31 UTC