Stream: Machine Learning for Theorem Proving

Topic: LeanAgent

Gridiron Player (Oct 11 2024 at 21:51):

Check out these stats guys
IMG_1498.png

https://x.com/animaanandkumar/status/1844756761510859034

I guess somebody is already deep into this.

Not sure if its real.

Vincent Beffara (Oct 11 2024 at 22:18):

The examples in the paper are not really impressive. E.g., Advanced Abstract Algebra: LeanAgent shows significant advancement in proving a key result in abstract algebra, wedderburn (Wedderburn’s Little Theorem), which is a profound result in abstract algebra, stating that every finite division ring is a field. Sounds good but the proof is

theorem wedderburn (h: Fintype R): IsField R :=
  apply Field.toIsField

which is likely found by exact?. The core of the proof is in the instance docs#littleWedderburn in Mathlib, the proof in the example is empty.

And the next paragraph in the paper says LeanAgent’s proof of the wedderburn theorem impressively represents a deep understanding of algebraic structures. By using the Field.toIsField premise, LeanAgent shows that it has grasped what makes a ring a field and can apply this knowledge efficiently. This requires a comprehensive understanding of ring theory and field properties. Impressively, the human proof of wedderburn written after LeanAgent’s proof contains 175 lines, while LeanAgent’s proof is only one line. This suggests that LeanAgent has developed a deep, intuitive grasp of abstract algebra.

This feels like it was written by a LLM...

Notification Bot (Oct 11 2024 at 23:05):

2 messages were moved here from #Machine Learning for Theorem Proving > Proper LLM Lean4 Integration with recursive checks for error by Jason Rute.

Jason Rute (Oct 11 2024 at 23:06):

Paper: https://arxiv.org/abs/2410.06209

Jason Rute (Oct 11 2024 at 23:13):

It is a weird paper for sure, but they also claim they solve 225 theorems in MiniF2F. That seems like a lot for either test or train, since there are just 244 theorems in both. Maybe it is out of all problems, or maybe it is just the ones marked sorry? 46% isn't bad, especially if they are using a model similar to ReProver, but I feel like I'd like to read all the details since the paper is very different from others.

Jason Rute (Oct 11 2024 at 23:17):

Maybe the authors want to respond: @Adarsh Kumarappan, Mo Tiwari, @Peiyang Song, @Robert Joseph, Chaowei Xiao, Anima Anandkumar

fzyzcjy (Oct 11 2024 at 23:23):

In appendix, seems they pick all theorems with sorry, and ReProver does 85 while the new method does 99.

~~Not very sure where "225 vs 96" in table 2 comes from though...~~ Looks like "225" is the "TPPS" score, which is defined in 4.2 as "#reprover proved theorems + 10 x #newly proved theorems + 1"

image.png

Mario Carneiro (Oct 11 2024 at 23:34):

The TPPS scoring would imply that the difference between the ReProver baseline and the LeanAgent score should always be a multiple of 10, but 225 - 86 is not a multiple of 10. The other examples are, so this seems like an error. (This also seems like a very convoluted and misleading way to report theorems proved.)

Jason Rute (Oct 11 2024 at 23:35):

Where do these 162 “previously unproved by humans” theorems come from? Are they including MiniF2F, MIL as “previously unproved”?

Mario Carneiro (Oct 11 2024 at 23:36):

it seems to be defined as "sorry theorems"

Mario Carneiro (Oct 11 2024 at 23:36):

so yes

Mario Carneiro (Oct 11 2024 at 23:39):

1+1+1+2+15+3+86+25+2+2 = 138 is the reprover baseline, and 1+1+1+1+7+1+14+3= 29 new theorems are proved (based on TPPS differences, give or take one error), which sums to 167 which is within striking distance of the metric

Mario Carneiro (Oct 11 2024 at 23:40):

in other words it's the total number of theorems proved across all repositories tested

Mario Carneiro (Oct 11 2024 at 23:43):

Interestingly, LeanAgent has a strong enough understanding of logical manipulation within the PFR repository to prove 5 sorry theorems with a 0 = 1 placeholder theorem statement (not included in TPPS) (details in Appendix A.5).

In other words, it proved false. I guess there is something funny going on in the PFR repo?

Mario Carneiro (Oct 11 2024 at 23:44):

Oh, no I think it's simpler than that, it used one False to prove another one

Terence Tao (Oct 12 2024 at 01:10):

Yes, this was an unanticipated weakness of using 0=1 as a placeholder for a statement that we had not gotten around to formalizing yet. In retrospect proof_wanted <desired_theorem_name> : 0 = 1 would have been slightly better than theorem <desired_theorem_name> : 0 = 1 := by sorry to avoid this sort of unintended side effect (or can proof_wanted statements also be used in other proofs?)

Jason Rute (Oct 12 2024 at 01:27):

Something like this snuck through our eval of Graph2Tac. Our agent figured out it could use a blatantly false axiom to prove many of the theorems in the TLC Coq package we used for testing. We had to remove TLC from our benchmark because of it. :cry: (Appendix Q in https://arxiv.org/pdf/2401.02949.)

Adarsh Kumarappan (Oct 12 2024 at 05:11):

Mario Carneiro said:

The TPPS scoring would imply that the difference between the ReProver baseline and the LeanAgent score should always be a multiple of 10, but 225 - 86 is not a multiple of 10. The other examples are, so this seems like an error. (This also seems like a very convoluted and misleading way to report theorems proved.)

Thank you very much for the valuable feedback. You are correct, the number should be 226. We are updating the preprint now.

Mukesh Tiwari (Oct 12 2024 at 06:23):

Jason Rute said:

Maybe the authors want to respond: Adarsh Kumarappan, Mukesh Tiwari, Peiyang Song, Robert Joseph, Chaowei Xiao, Anima Anandkumar

Now I see why I was receiving the messages from this thread. Unfortunately, I am not one of the authors so I can’t comment :)

Mario Carneiro (Oct 12 2024 at 09:35):

@Adarsh Kumarappan could you comment on how the number 162 was derived? With your correction it should be 167 if I have correctly guessed the calculation method, but it's possible you are counting something else?

Jason Rute (Oct 12 2024 at 20:09):

@Adarsh Kumarappan Do the sorry theorems you try to prove only include theorems, or also sorry definitions?

Charles Walker (Oct 14 2024 at 14:47):

Section 4.2:

So, we propose a Theorem Proving Performance Score (TPPS) that emphasizes newly proven sorry theorems. Specifically, LeanAgent TPPS = (# ReProver Theorems Proved)+(# New Theorems Proved∗X)+1, where X represents the importance of proving a new theorem, and ReProver TPPS = (# ReProver Theorems Proved)+1. Then, Improvement Factor = (LeanAgent TPPS)/(ReProver TPPS). We choose X = 10, which is relatively modest considering the large difficulty gap between basic arithmetic and abstract algebra.

If I'm understanding correctly, the authors designed a new evaluation metric, which essentially allows them to magnify their advantage over the baseline by an arbitrary factor X? They choose X = 10 and consider it "modest"? @Adarsh Kumarappan Please feel free to correct me if that's not accurate.

Charles Walker (Oct 14 2024 at 15:36):

Below is the paper's main table and presumably is where the 11x improvement in the tweet comes from.
image.png

The results were significantly distorted by X = 10, and below is the actual number of theorems they proved (derived by simple calculations from their table):

Repository	LeanAgent	ReProver
PFR	1	0
Hairy Ball Theorem	1	0
Coxeter	1	0
PFR (ALTERNATE COMMIT)	2	1
Mathematics in Lean Source	20	14
Formal Book	3	2
MiniF2F	99	85
SciLean	27	24
Carleson	1	1
Lean4 PDL	1	1

Mario Carneiro (Oct 14 2024 at 15:50):

@Terence Tao said:

In retrospect proof_wanted <desired_theorem_name> : 0 = 1 would have been slightly better than theorem <desired_theorem_name> : 0 = 1 := by sorry to avoid this sort of unintended side effect (or can proof_wanted statements also be used in other proofs?)

No, proof_wanted statements have no effect on the environment so they cannot be used in other proofs.

Terence Tao (Oct 14 2024 at 15:54):

OK, then I will adopt the proof_wanted <desired_theorem_name> : 0 = 1 template in the future for statements that have not yet been formalized (let alone proved) in Lean, as there do not appear to be major unintended side-effects of this template (e.g., for the purposes of testing automated tools); of course it reduces the chance of the proof_wanted being established to 0%, but given that the statement is not even formalized, I think this is a reasonable consequence. In the specific case of PFR, I think all statements have now been formalized anyway in the weeks since that test was conducted, but good to know for future projects.

Eric Wieser (Oct 14 2024 at 16:04):

You can also use theorem desired_theorem_name : (sorry : Prop) := sorry for things without even a statement yet

Eric Wieser (Oct 14 2024 at 16:04):

(It's a little annoying that the : Prop is necessary, without it Lean complains that you should have use a def)

Terence Tao (Oct 14 2024 at 16:05):

I like that better, it more closely matches the semantic intent of the placeholder.

Floris van Doorn (Oct 14 2024 at 16:33):

I'm awaiting the next paper that claims they can prove desired_theorem2 using this new way of (not) writing statements, using their deep understanding of desired_theorem1.

theorem desired_theorem1 : (sorry : Prop) := sorry

/- ... -/

theorem desired_theorem2 : (sorry : Prop) := desired_theorem1

Terence Tao (Oct 14 2024 at 19:02):

Oof, that is undesirable. So would proof_wanted desired_theorem1 : (sorry : Prop) be the most exploit-proof solution?

Alex Meiburg (Oct 14 2024 at 20:58):

I would be curious to see the longest proof in their proved statements. Many of their examples seem to be just be simp or constructor <;> simp or rfl, or a single exact or apply. Those kinds of things that I think aesop "could" reasonably prove already, for instance (if it tried to discharge with exact?). Similarly a single line of omega does not "showed a more advanced understanding of number theory" in my opinion.

The most interesting I see in their paper is induction_12dvd4expnp1p20 from MiniF2F, which reads

theorem induction_12dvd4expnp1p20 (n : ℕ) : 12 ∣ 4^(n+1) + 20 := by
  norm_num
  induction' n with n hn
  simp
  omega

... which does show awareness that it should use induction. And I'd like to see more multi-step proofs like that! :) (But a simple induction n <;> omega works too.)

Charles Walker (Oct 14 2024 at 21:02):

Floris van Doorn said:

I'm awaiting the next paper that claims they can prove desired_theorem2 using this new way of (not) writing statements, using their deep understanding of desired_theorem1.
theorem desired_theorem1 : (sorry : Prop) := sorry

/- ... -/

theorem desired_theorem2 : (sorry : Prop) := desired_theorem1

It seems likely that these are just unintended mistakes made by machine learning researchers who may not be experts in Lean. However, if the evaluation metric was indeed set to amplify the improvement by 10x, that would raise concerns about academic dishonesty. I am surprised this is coming from a Caltech team. Would the senior members of the team like to respond? @Peiyang Song, @Kaiyu Yang, @Aidan Swope, @Alex Gu, @Rahul Chalamala, @Saad Godil, @Ryan Prenger.

Aidan Swope (Oct 14 2024 at 21:28):

Charles Walker said:

Would the senior members of the team like to respond?

Hi @Charles Walker, I do not know much about this work (in fact, I haven't even read the paper yet!). Many of us on the LeanDojo paper (@Kaiyu Yang, @Saad Godil, @Ryan Prenger) are not currently working in or collaborating with that group.

Jason Rute (Oct 14 2024 at 21:40):

I’m afraid we might be a bit too negative here. I think I read that the main author is an undergraduate, which might explain the overly enthusiast language. I think when you read past the odd metrics and hyperbole, there is a result where they make measurable improvements over ReProver. I need to read closer to fully understand the details. I hope I can get to that soon and share my understanding of what is going on. In the end I hope this doesn’t turn the main author off. I look forward to a revised version of this paper.

Kaiyu Yang (Oct 15 2024 at 00:05):

Charles Walker said:

Floris van Doorn said:
I'm awaiting the next paper that claims they can prove desired_theorem2 using this new way of (not) writing statements, using their deep understanding of desired_theorem1.
theorem desired_theorem1 : (sorry : Prop) := sorry

/- ... -/

theorem desired_theorem2 : (sorry : Prop) := desired_theorem1
It seems likely that these are just unintended mistakes made by machine learning researchers who may not be experts in Lean. However, if the evaluation metric was indeed set to amplify the improvement by 10x, that would raise concerns about academic dishonesty. I am surprised this is coming from a Caltech team. Would the senior members of the team like to respond? Peiyang Song, Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Saad Godil, Ryan Prenger.

I cannot comment on the LeanAgent paper as I have zero involvement in it. And as Aidan pointed out, I'm not currently collaborating with that group.

Rahul Chalamala (Oct 15 2024 at 00:21):

Charles Walker said:

Floris van Doorn said:
I'm awaiting the next paper that claims they can prove desired_theorem2 using this new way of (not) writing statements, using their deep understanding of desired_theorem1.
theorem desired_theorem1 : (sorry : Prop) := sorry

/- ... -/

theorem desired_theorem2 : (sorry : Prop) := desired_theorem1
It seems likely that these are just unintended mistakes made by machine learning researchers who may not be experts in Lean. However, if the evaluation metric was indeed set to amplify the improvement by 10x, that would raise concerns about academic dishonesty. I am surprised this is coming from a Caltech team. Would the senior members of the team like to respond? Peiyang Song, Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Saad Godil, Ryan Prenger.

Hey @Charles Walker, I cannot comment on the work either, as I am not currently collaborating with the group.

Alex Gu (Oct 15 2024 at 03:44):

I am also not collaborating with the LeanAgent team and can't comment on the work either.

Tim (Oct 15 2024 at 12:43):

Your ciritique about LeanAgent could be very helpful here https://openreview.net/forum?id=Uo4EHT4ZZ8

Anima Anandkumar (Oct 15 2024 at 20:44):

Thank you for looking into our paper in detail. We really appreciate all the comments. We apologize if we missed some of the nuances in math since our background is in computer science. A lot of this work is driven by undergraduate researchers. I hope you do not judge this too harshly since this is still preliminary in many ways. There are only a few Lean repositories online and not enough data to train LLMs, otherwise, this would have been a lot easier. Our LeanAgent work is focused on lifelong learning under this harsh constraint of limited data. A lot of our focus is on ML methods and we focus less on the details of the theorems proven. We wrote this paper with an ML audience in mind to describe how LeanAgent outperforms existing (static) ML methods in theorem proving with the help of lifelong learning. Some of these proofs may be trivial for mathematicians to formalize. However, by comparing with the previous ML model, ReProver, we emphasize that these proofs are non-trivial for ML. We welcome your comments and will continue to respond to them here.More detailed comments are below

Anima Anandkumar (Oct 15 2024 at 20:46):

We define sorry theorems as “previously unproved by humans.” We use this definition to describe this Lean-specific construct in a general fashion. Please note that this paper was written for an ML audience, so we used this definition. LeanAgent can assist with formalizations by formalizing the sorry theorems in the existing repositories. The sorry theorems that we prove are theorems and lemmas. Going forward, we would like to extract data from defs after updating LeanDojo accordingly.

Anima Anandkumar (Oct 15 2024 at 20:46):

Here is the breakdown of LeanAgent’s 162 proven theorems: 27 (SciLean) + 8 (PFR 2 commits) + 21 (MIL) + 3 (Formal Book) + 1 (Carleson) + 1 (Hairy Ball Theorem) + 1 (Coxeter) + 1 (PDL) + 99 (MiniF2F) = 162 Theorems

Anima Anandkumar (Oct 15 2024 at 20:47):

The proofs brought up in this channel can only be generated by LeanAgent. ReProver cannot generate these proofs. Despite proofs, such as that for wedderburn, which may seem short and simple, they are relatively challenging to find. We did not compare with Aesop because it is not an ML technique. The purpose of LeanAgent was to improve upon ML research for theorem proving, such as ReProver. Moreover, Aesop is not a framework, but ReProver was included as the starting point of the LeanAgent framework, which is why we compare LeanAgent to ReProver. We will include this justification in the revised version.

Anima Anandkumar (Oct 15 2024 at 20:47):

Thank you very much for the comments regarding the TPPS.

There is no standardized measure of proof difficulty and proving performance, especially in a life-long learning setting. TPPS was an attempt towards this: LeanAgent TPPS = (# ReProver Theorems Proved) + (# New Theorems Proved∗X) + 1. X is a hyperparameter that can be freely adjusted and in the paper we set it as 10. The core idea is to have a higher reward for newly proven theorems, which aligns with the objective of lifelong learning. In future works, we also plan to investigate some more sophisticated measures that reward not only newly proved theorems but also difficult ones, and we welcome your suggestions on such metrics.
The TPPS on MiniF2F is 226. The preprint has been updated now.
In addition to TPPS, we have more straightforward metrics such as a direct comparison with ReProver in terms of total number of proven theorems. However, this does not incorporate the dynamic nature of lifelong learning well, and hence, we also added TPPS. As mentioned in the main paper, there is no standardized metric for proof complexity, so we decided on e^{number of proof steps} for the curriculum (justification in the paper). However, for comparing with ReProver, we decided this would not be a fair comparison because LeanAgent tends to generate shorter proofs, even if the theorems are on the difficult side. For example, the human proof of the wedderburn theorem is 175 lines, but ours is only one. Thus, using a metric with proof length for proving performance would not accurately reflect this difficulty.
We also considered another metric that weighted the proving performance based on the domain of math that the theorem was from. However, none of the authors of this paper are mathematicians. We are all computer scientists. After speaking to mathematicians, we saw no consensus on the difficulty among math domains. As such, we felt it would be incorrect to use this metric.
We recognize that understanding which math theorems are more difficult and meaningful is an important and challenging task for designing objectives and evaluations. We welcome help and contributions on labeling or classifying those theorems according to their difficulty level.

Charles Walker (Oct 15 2024 at 21:50):

@Anima Anandkumar Thank you for your response. As someone from an ML background, I find it concerning that the term "ML audience" is being used to justify the lower quality of the work. Many recent papers on AI/ML/LLMs and theorem proving have been authored by ML researchers and presented at ML conferences, yet they do not exhibit the same overclaims or problematic aspects seen here.

I want to clarify that my critique isn't directed at the undergraduate students involved in this project. They are in the process of learning how to conduct research, and that's to be expected. However, as their faculty advisor, I would assume you were overseeing the work to ensure it met basic scientific standards. With your name listed on the paper, it implies that you've read and endorsed its content. By posting it on arXiv and promoting it on Twitter, I assume it was deemed ready to be shared with the broader community, rather than being in a "preliminary" state.

I won't go further into the individual points you've raised. Many contributors in this space are well-versed in both Lean and machine learning, and are more than capable of evaluating these issues on their own.

Eric Wieser (Oct 15 2024 at 22:37):

One very minor comment on the paper: I think there is no such project as "Matrix Workshop", and this is a typo for lean-math-workshop

Jason Rute (Oct 15 2024 at 22:41):

@Anima Anandkumar Welcome to this Zulip channel! I'm glad to see you here. I just want to say right now that I regret saying on Twitter that "there are better papers out there". What I meant to say was that I didn't want the broader Twitter ML community to think this was a new phenomenon, and to point out that for this specific metric of sorry theorems, there are already papers that demonstratively prove many more theorems from subsets of that same theorem list. (For example, DeepSeek-Prover v1.5 proves about twice as many MiniF2F theorems as ReProver does from miniF2F, so it should be able to prove about ~160 sorry theorems from MiniF2F alone. Of course, it uses more compute, so it isn't an apples-to-apples comparison.) But at the same time, I don't want to make this about being SoTA or hill climbing.

Jason Rute (Oct 15 2024 at 22:42):

While I don't necessarily agree with the way this work was framed and advertised, I hope to look past that and focus on the scientific results themselves. I hope to write up my understanding here of this work and how it fits into the broader research field (as I have done for many other papers in this field).

Jason Rute (Oct 15 2024 at 22:45):

Anima Anandkumar said:

The sorry theorems that we prove are theorems and lemmas. Going forward, we would like to extract data from defs after updating LeanDojo accordingly.

If this is in reference to my question above about sorrys in definitions, please don't take that as a suggestion to add definitions to the sorrys in your benchmark. On the contrary, I was worried you might be doing that, and that would be an issue since most sorrys definitions could be filled in trivially in such a way that Lean accepts them.

Jason Rute (Oct 16 2024 at 01:14):

Some questions I have for @Anima Anandkumar, @Adarsh Kumarappan or others who understand the paper better than me:

Jason Rute (Oct 16 2024 at 01:14):

I don't understand "single repository" vs "merge all". My guess would be that "single repository" just fine-tunes on the current repo, while "merge all" fine tunes on all repositories. But if so, what does it mean to order the repositories by either curriculum order or popularity during "single repository" if there is only one repo? (I clearly am mistaken here.)

Jason Rute (Oct 16 2024 at 01:14):

Similarly, if "merge all" means to train on all other repos, are you training on all repos which come before in curriculum order or all other repos (including those which come later in curriculum order)?

Jason Rute (Oct 16 2024 at 01:14):

I'm still not exactly sure how fine-tuning vs proving happens. Say you have a repo which consists of ten theorems A B C D E F G H I J across different Lean files. Do you fine-tune using A, B, C, in that order until you get to a sorry theorem and then try to prove the sorry theorem? If so, what is the order? Topological sort? File dependencies (so not linear, but a tree)? Easy, med, hard, sorry theorems? Other? If you are doing "merge all", do you fine-tune on each repo before moving to the next repo to fine-tune, or do you mix theorems in the training order?

Jason Rute (Oct 16 2024 at 01:15):

Clearly, you are updating the premise selection database with the most recent premises (as your examples show where you reference a theorem in that same repo)? Do you also do that for the ReProver baseline? (In other words, would it be possible for ReProver to use multidist_eq_zero to prove MultiTau_min_exists in PFR, or is that just not in it's retrieval database.)

Jason Rute (Oct 16 2024 at 01:15):

I'm unclear how mathlib is used here. Your model seems to have a decent understanding of the content of mathlib. Have you fine-tuned on mathlib, recently? Is this one of the repos used in curriculum training? If not, how old is the base ReProver model you are using? Mathlib changes a lot, and I'm curious if ReProver is also up-to-date with the version of Mathlib you used? (Of course, that would somewhat be mitigated as long as ReProver had access to the most recent theorems.)

Jason Rute (Oct 16 2024 at 01:15):

I'm trying to understand what you fine-tuned? You say you fine-tuned the "retriever". My understanding is that ReProver has two neural network parts: (1) the tactic suggestion model (which takes in the current goal as well as a list of selected premises) and outputs a tactic, and (2) the premise selection model which finds premises to put into the context. The premise selection model additionally has a key and a query embedding I think (but maybe I'm mistaken and they are the same embedding model). Do you just update the premise selection model? Is that what you mean by retriever? If so, do you update both the key and query embedding parts?

Jason Rute (Oct 16 2024 at 01:15):

Do you know if there were also theorems that ReProver solved but LeanAgent didn't? How many were there? (This is very common in theorem proving and I would be suspicious if this wasn't the case.)

Jason Rute (Oct 16 2024 at 01:15):

From the examples, it seems that most of LeanAgent's power comes from a better premise selector. Have you looked any more into this (for example, looking into what premises are returned by both models)? Do you have a hypothesis why one is better?

Jason Rute (Oct 16 2024 at 01:31):

What you mean when you write "by the end of life-long learning"? You seem to suggest the model changes over life-long learning, and you are doing many evaluations on the same sorry theorem during this time? Is that correct? If so, what is the order of training and evaluation? (See question 3.)

Alex Meiburg (Oct 16 2024 at 02:23):

To my ears, saying that you considered a metric for difficulty but (it sounds like) rejected it because the system tended to generate solutions that were poor by that metric ... does not sound very good. This is exactly why people do preregistration of studies. Of course that's not a practice that's caught on very well in ML - but I've seen it a couple times! - but the principle is still solid.

Personally I would be very interested to see scores based on e^(proof length), and that aligns quite well with my earlier message wondering what the longest proof found was :)

Alex Meiburg (Oct 16 2024 at 02:32):

If an AI found a proof that was really much simpler (by some metric of, how large the whole theorem dependency tree is) than any previous approach, that would be very interesting.

But using wedderburn as a particular example is a bit strained. As Vincent pointed out, the apply Field.toIsField actually uses this theorem (as an inferred instance, an implicit argument):

https://github.com/leanprover-community/mathlib4/blob/2ae1eb8bb18eb00a62dae319aaef09bc2ae5ed8a/Mathlib/RingTheory/LittleWedderburn.lean#L62

That proof is ~100 lines, perfectly on par with 175 lines, and is the direct result of a human putting in the effort to prove wedderburn. :) So the fact that this theorem (in Mathlib) could be used, in just one line, is exactly very good evidence that proving your wedderburn (when Mathlib is accessible) is indeed pretty easy. A e^(length) metric seems pretty reasonable in that sense.

If I had to pick a metric myself, I would vote for: train a Huffman decoder on Mathlib, and then weight as c^(bits of compressed proof) for some constant c. That way commands like "simp only" get compressed well, while particular theorem names are given more weight, etc etc. A very old school language modeling approach, I guess.

Moritz Firsching (Oct 16 2024 at 06:20):

Anima Anandkumar said:

Here is the breakdown of LeanAgent’s 162 proven theorems: 27 (SciLean) + 8 (PFR 2 commits) + 21 (MIL) + 3 (Formal Book) + 1 (Carleson) + 1 (Hairy Ball Theorem) + 1 (Coxeter) + 1 (PDL) + 99 (MiniF2F) = 162 Theorems

Just since I just received a pull request (https://github.com/mo271/FormalBook/pull/67) for the three proofs for Formal Book:
Here's the breakdown:

proving Wedderburn using apply Field.toIsField -- has been discussed
proving quadratic_reciprocity_2 with exact book.quadratic_reciprocity.quadratic_reciprocity_1 p q hp hq, where quadratic_reciprocity_1 is the exact same statement as quadratic_reciprocity_2
one proof that is constructor <;> linarith

This is summarized in the paper as
"In the Formal Book repository, LeanAgent progresses from proving basic real analysis and number theory theorems to mastering advanced abstract algebra, exemplified by its proof of Wedderburn’s Little Theorem."
I think it might be better summarized
"In the Formal Book, only trivial/tautological proofs are found by the LeanAgent"

Malvin Gattinger (Oct 16 2024 at 07:22):

Today I got a LeanAgent PR from @Anima Anandkumar and was surprised that my PDL project was even used for this and counts +1 to the stats. The proof found is actually good / found an existing Lemma in Mathlib (which I guess apply? would also have found). But the PR is not useful because it is based on 4 months old code. In the meantime that particular sorry is no longer there in our repository. See https://github.com/m4lvin/lean4-pdl/pull/6

Suggestion: Before making these PRs please check that the sorry you are fixing still exists in the latest version of the project!

Eric Wieser (Oct 16 2024 at 08:00):

I think it's quite valuable to have these PRs for transparency, even if they end up being out of date and unmergeable.

Malvin Gattinger (Oct 16 2024 at 08:01):

Oh, true, that is a good point. I don't mind getting it then.

Eric Wieser (Oct 16 2024 at 08:06):

Looking at the full set of PRs, the most genuinely helpful one looks to be the Scilean one; while none of the results look that hard, they do look like the area where an agent writing the proofs saves on human time.

Fabian Glöckle (Oct 16 2024 at 09:57):

I think two different meanings of "metric" are being conflated here:

As model developers, we are free to decide on our "loss" or "reward" used for model training. We are free to do reward shaping by putting exponentials, favoring previously unsolved problems etc as we like.
As members of the scientific community, on the other hand, we should evaluate our models on standard benchmarks with standard metrics for comparability. Researchers know that getting from 40% to 50% accuracy on ImageNet is way easier than getting from 80% to 90%, still they report the numbers on the benchmark as-is and leave it to the reader's judgement to evaluate the quality of the model based on said numbers. (We do not report ImageNet scores as (90-accuracy)^(-alpha).) Artificially inflating benchmark numbers creates a false sense of where we are: all published models score < 70% on MiniF2F for instance, and that is something we as a community should acknowledge and take as a challenge, instead of presenting our methods as game-changing when in fact they still aren't.

It is true that if a new method is developed for a task that is not captured by existing benchmarks, we can (and maybe should) create our own measures for progress on this task. In this case, however, I don't see the advantage of TPPS over proof rates for each of the considered repositories. The ML theorem proving community is already on the transition to cumulative proof rates (cf. HTPS; DeepSeekProver) which is a perfect fit for a "life-long learning" method. [The TPPS metric is also flawed since it introduces an asymmetry. Given that theorem proof rates are stochastic, even an independent second run of ReProver could reach an "up to 11x improvement".]

Terence Tao (Oct 16 2024 at 15:36):

The LeanAgent PFR proofs have just been PR'ed into the PFR main repository. Regarding the eight filled sorries, the outcomes were as follows.

One of the sorries filled (condRho_of_translate) was valid and useful, and has now been merged. I thank the LeanAgent team for contributing this lemma; the proof is ultimately one line (simp only [condRho, rho_of_translate]), and "obvious" in retrospect, but we would probably not have found this short proof initially (though perhaps it would emerge after some rounds of golf).
Five of the sorries were artefacts of having 0=1 placeholder theorems in the repository, and were not PR'ed. It was valuable to know that there was an exploitable weakness in my choice of placeholders in this repository, which was designed for human provers rather than automated provers; I think the discovery that such placeholders need to be designed in future to prevent unintended "exploit"s by too-helpful automated theorem provers is a useful point that could be stressed in a revision of the LeanAgent paper.
The remaining two sorries were PR'ed but were artefacts of a placeholder definition noncomputable def multiDist {m:ℕ} {Ω: Fin m → Type*} (hΩ: (i:Fin m) → MeasureSpace (Ω i)) (X : (i : Fin m) → (Ω i) → G) : ℝ := sorry that made it possible for results regarding multiDist to be artificially provable by tactics such as rfl. Also, by the time of the PR, this definition had been filled, as had the sorries for the relevant theorems involving that definition. So this PR was not merged. Again, this points out an exploitable weakness with placeholder definitions (not just placeholder theorems) that would be an important consideration for future formalization projects to keep in mind, given that we will soon be entering the era of AI-assisted formalization, and worth pointing out in a revision of the paper. In particular, there may be a need for a definition_wanted keyword in Lean with similar functionality to proof_wanted in order to prevent this sort of exploit. (EDIT: opened up a discussion on this here.)

While I am not an AI/ML researcher myself, my general impressions of the work are as follows.

The LeanAgent tool does indeed appear to be a significant improvement over the state of the art in automated theorem proving; its ability to automatically contribute to formalization repositories in the "wild" is still somewhat modest, but non-trivial.
The dearth of standard benchmarks to assess automated theorem prover performance is a significant problem that this area as a whole will need to somehow address. The paper's experiment to use repositories in the wild as a substitute for such benchmarks is an interesting and creative one, and (when analyzed properly) is certainly better than having no evaluation metric whatsoever, but it is an untested approach, with several previously unnoticed weaknesses. On the other hand, the identification of these weaknesses is a genuinely useful contribution of this paper, even if that was perhaps not the authors' original intention when using these repositories to test their tool. Furthermore, the PR'ing of the valid proofs to ongoing projects was appreciated by myself at least.
I believe that most of the issues pointed out about this paper (which mostly center around interpretation of the results obtained using the metrics selected) can be resolved in a subsequent revision of the paper. I myself have had to perform minor or major revisions on several preprints of my own (and even on some occasions issue formal errata for published papers) due to omissions or technical issues that were pointed out by others. To some extent, this is a natural consequence of working at the frontier of a subject, where previous guidance is sparse.
I find no evidence to suggest any intentional attempt to misrepresent the results here; most likely the authors of the paper were excited by the high evaluation scores that their tool was obtaining on their chosen metrics, and did not spend enough time on critical review of the methodology as a consequence. But I expect that this will not occur again after the feedback received from this current paper.

Adarsh Kumarappan (Oct 16 2024 at 23:05):

Thank you again for your detailed comments! A new preprint should be out shortly with the following changes:

Additional discussions on the limitations of our proposed score.
Additional details on the general lack of standardized benchmarks and scores for our lifelong learning setting.

We would like to address some of the further questions raised in this channel by @Jason Rute :

In the “single repository” setting, every new dataset is created from just a new repository. In the “merge all” setting, every new dataset is created from all previous repositories, including the new one. So, the last dataset in this setting will contain all previous repositories. The repository order is determined before creating these datasets. With the “single repository” setting, the datasets could be created from repositories A, B, C, … with the popularity order, but C, B, A, … with the curriculum order.
In the “merge all” setting, every new dataset is created from all previous repositories, including the new one.
LeanAgent progressively trains on the entire dataset before any proving happens. During lifelong learning, we only prove the sorry theorems from the newly discovered repository, even if we used the “merge all” setting for progressive training. To clarify, the “merge all” setting mixes the theorems in the training order and accounts for duplicates.
Yes, we update the premise corpus for both LeanAgent and ReProver.
ReProver was fine-tuned on LeanDojo Benchmark 4, which was constructed from Mathlib4. Since LeanAgent uses ReProver as its initial retriever, it understands the content of Mathlib4. We used the version of ReProver that is openly available on HuggingFace.
ReProver consists of (1) a premise retriever (or premise selection model) that takes as input (a) the current state and returns as output (b) the most relevant premises. It also contains (2) a tactic generator that takes as input (a) both the current state and retrieved premises and returns as output (b) the most promising proof steps. Progressive training updates the retriever, not the tactic generator. Specifically, the premise retriever uses contrastive learning and DPR (details in the LeanDojo paper). This updates the embeddings for (a) the query (state) and (b) the premises.
There are some theorems that ReProver could solve that LeanAgent could not. As we mention in the paper, this is largely because ReProver is “more tailored to simpler computation problems.” These theorems include the 7 MiniF2F theorems from Appendix A.5 that ReProver could prove but LeanAgent could not:

Theorem: mathd_algebra_141
File path: MiniF2F/Test.lean
Theorem statement: theorem mathd_algebra_141
  (a b : ℝ)
  (h₁ : (a * b)=180)
  (h₂ : 2 * (a + b)=54) :
  (a^2 + b^2) = 369 :=
Proof:
  nlinarith

Theorem: mathd_algebra_329
File path: MiniF2F/Test.lean
Theorem statement: theorem mathd_algebra_329
  (x y : ℝ)
  (h₀ : 3 * y = x)
  (h₁ : 2 * x + 5 * y = 11) :
  x + y = 4 :=
Proof:
  linarith

Theorem: mathd_algebra_547
File path: MiniF2F/Valid.lean
Theorem statement: theorem mathd_algebra_547 (x y : ℝ) (h₀ : x = 5) (h₁ : y = 2) : Real.sqrt (x ^ 3 - 2 ^ y) = 11 :=
Proof:
  simp [h₀, h₁, sq]
  rw [Real.sqrt_eq_iff_sq_eq] <;> norm_num

Theorem: mathd_algebra_484
File path: MiniF2F/Test.lean
Theorem statement: theorem mathd_algebra_484 :
  Real.log 27 / Real.log 3 = 3 :=
Proof:
  field_simp
  rw [← Real.log_rpow]
  all_goals norm_num

Theorem: mathd_numbertheory_254
File path: MiniF2F/Test.lean
Theorem statement: theorem mathd_numbertheory_254 :
  (239 + 174 + 83) % 10 = 6 :=
Proof:
  norm_num

Theorem: mathd_algebra_304
File path: MiniF2F/Test.lean
Theorem statement: theorem mathd_algebra_304 :
  91^2 = 8281 :=
Proof:
  norm_num

Theorem: mathd_numbertheory_342
File path: MiniF2F/Test.lean
Theorem statement: theorem mathd_numbertheory_342 :
  54 % 6 = 0 :=
Proof:
  norm_num

Although both LeanAgent and ReProver have access to the premise corpus, one reason LeanAgent is superior is that progressive training increases the space of possible proof states. This means that LeanAgent understands proof paths that ReProver does not know.
LeanAgent does change over lifelong learning due to progressive training over the datasets. We distinguish between during and after lifelong learning: (a) During lifelong learning, we try proving sorry theorems for each new repository. (b) At the end of lifelong learning, meaning after seeing all of the repositories, we try to prove the sorry theorems in all repositories that we could not prove during lifelong learning. We notice the progression mentioned in the paper, where LeanAgent starts by proving simpler theorems and then progresses to relatively more challenging theorems at the end of lifelong learning.

We would also like to respond to some of the other comments:

We want to clarify that we did not reject a metric to inflate our results. We added a detailed analysis of the limitations of our proposed score in Section 4.2 and Appendix A.7. As mentioned previously, finding a metric that would reasonably represent our lifelong learning setup was quite challenging. Again, we recognize that understanding which math theorems are more complex and meaningful is an important and challenging task for designing objectives and evaluations. We welcome additional discussions on new metrics.
None of the authors are mathematicians. We are all computer scientists. We did not recognize that apply Field.toIsField uses a relatively long proof behind the scenes. We will take this into account in our revised preprint.
We would like to emphasize further that our work is focused on lifelong learning under the harsh constraints of limited data (more details in Appendix A.7). As such, as applicable to repositories like PFR and Formal Book, the metric can be susceptible to artifacts that artificially inflate performance. This underscores the need for more robust and well-tested benchmarks. Again, we focus on ML methods, and we focus less on the details of the theorems that have been proven. Some of these proofs may be trivial for mathematicians, but we aim to emphasize that these proofs are non-trivial for ML. In addition, CPS still represents a linear percentage improvement. This is why we refrain from using it in our analysis.
Some of the PRs were from branches on older commits, as these were commits used in the experiments in the paper. However, we hope that making these proofs public helps increase transparency.
We will mention how LeanAgent points out weaknesses with current repositories that would be important to consider for future formalization projects.

Kaiyu Yang (Oct 17 2024 at 04:46):

This morning, my Lean4 MiniF2F repo received a PR from LeanAgent (https://github.com/yangky11/miniF2F-lean4/pull/6). My impression was that most of the sorry theorems proved in the PR should be within the reach of existing off-the-shelf ML tools, such as LeanCopilot. Therefore, I tried running LeanCopilot out of the box using my laptop, and here is what I found:

Experimental Setup

Follow LeanCopilot's instructions to set up LeanCopilot v1.6.0 for the latest commit of miniF2F-lean4.
Import LeanCopilot and replace all sorry in Valid.lean and Test.lean with search_proof.
Wait and babysit (restart the file if it crashes).

Results

To make the comparison clear, I also made a PR for LeanCopilot (https://github.com/yangky11/miniF2F-lean4/pull/7) similar to LeanAgent's PR. LeanAgent proved 99 theorems (85 from Test.lean and 14 from Valid.lean), which is the same as the number reported in the paper. However, my run of LeanCopilot proved 100 theorems (85 from Test.lean and 15 from Valid.lean).

Caveats

LeanCopilot can be understood as a user-facing version of ReProver. They use the same tactic generation model, though there are a few differences:

ReProver is typically evaluated on GPU servers, whereas LeanCopilot is typically run on consumer CPUs, e.g., on the user's laptop.
ReProver uses a wall time limit of 10 minutes, whereas LeanCopilot uses the number of expansions during proof search as the limit. In practice, LeanCopilot is often run for less than 10 minutes, though that's not guaranteed.
ReProver performs premise retrieval before tactic generation. LeanCopilot generates tactics directly, without retrieval.
ReProver's proof search is implemented in Python. LeanCopilot's proof search is based on the same algorithm (best-first search) but implemented by Aesop in Lean.

Given these differences, I expect ReProver to be at least as strong as LeanCopilot in terms of the number of theorems it can prove (and likely stronger due to GPUs, retrieval, and more search time). However, a more thorough investigation may be necessary.

Jason Rute (Oct 17 2024 at 11:48):

@Kaiyu Yang Great analysis! But a few follow-up questions:

How does the retriever compare between ReProver (HuggingFace version) and LeanCopilot? It looks like the ReProver retriever is out of date in the LeanAgent paper (even if it is given recent Mathlib theorems in its database, it can't find them). Is the LeanCopilot retriever trained more recently?
Do you think that Aesope's symbolic capabilities (splitting on goals, always trying simple tactics like rfl/simp, and normalizing the goal) may actually help LeanCopilot be stronger than ReProver in some situations? (I was a bit shocked in the LeanAgent paper when they gave examples were ReProver couldn't find a rfl or simp theorem in 10 minutes.)
Did you ever in the past compare ReProver to LeanCopilot (on say MiniF2F, or another test)?
Now that you know what theorems in MiniF2F this paper tests on, do you still have the data for ReProver for MiniF2F in Lean4 (or even Lean3)? If so, in your data, what did a previous run of ReProver solve out of this theorem list. Is it comparable?

Jason Rute (Oct 17 2024 at 12:09):

Hi @Adarsh Kumarappan, thank you for the detailed reply. I understand the paper much better and could probably summarize it now. I do however have some additional follow-up questions:

Jason Rute (Oct 17 2024 at 12:09):

So if I understand, there are two times you test theorems in a given package C: (a) Right after you trained on packages A, B, C (during life-long learning), and (b) after training on all packages A, B, C, D, ... (after life long learning). In tables 2 and 9 and in your 162 theorem statistic, are you adding both attempts (a) during and (b) after? (I also understand you don't have statistics for just (b) after since you only run the after model for theorems not solved with the during model, correct?)

Jason Rute (Oct 17 2024 at 12:09):

What exactly is the motivation for the life-long learning setting? I can’t quite see how it is relevant to theorem proving. For example, if I wanted to create an agent for the PFR repo, wouldn't it be best to take a generalist agent pre-trained on all available Lean packages, and then fine-tune it on PFR? While, it dooes makes sense that having an agent fine-tuned most recently on PFR would be better than one tuned on all repositories, I don’t see the need for ordering the curriculums or training on them linearly. To make my point clear, here are some possible future experiments:
a. Compare just the life-long learning models: Show numbers the settings in table 9, but only after lifelong learning (training on all repos).
b. Compare to the usual approach: Finetune a model on all repositories together—including yours, as well as lean, batteries, and mathlib—directly. Then compare it to the lifelong learning models.
c. Take the model (b) above, and fine-tune it just on a given repo like PFR. (I suspect this would do as well, if not better, than the LeanAgent model during lifelong learning.)
d. Take ReProver and only fine-tune it on the given repo like PFR. Does this do worse than the model also fine-tuned on other repositories first?

Jason Rute (Oct 17 2024 at 12:09):

Do you have any worries (if I understand correctly), that the numbers in tables 2 and 9 and the 162 theorem count) have twenty minutes to solve a theorem using two different solvers (10 mins for during, and 10 mins for after) vs the baseline which has only 10 mins using one model? It is known that both time and diversity help a lot in proving theorems. (Using the number of theorems proved is logirithmic in time. Also, it is usually the case that running two different models for 5 mins is better than either model run for 10 minutes. Diversity really is the easiest way to improve a theorem prover.)

Jason Rute (Oct 17 2024 at 12:09):

Do you have any worriess that for MIL, the solutions for each exercise is in the code and you are fine-tuning on those? (But, I think this doesn’t show very strongly in your results since you are only updating the retriever and not the tactic-predictor.)

Jason Rute (Oct 17 2024 at 12:09):

Why do you only updating the retriever and not the tactic predictor? I would hope (although it clearly isn’t the case here) that the retriever would be robust to the list of available lemmas, while the tactic predictor could learn a lot from the code of the given repo.

Jason Rute (Oct 17 2024 at 12:09):

What do you mean when you write “progressive training increases the space of possible proof states”?

Kaiyu Yang (Oct 17 2024 at 14:47):

Jason Rute said:

Kaiyu Yang Great analysis! But a few follow-up questions:

Hi @Jason Rute,

LeanCopilot does not have a retriever.
It could, though I haven't done a systematic analysis. However, I also observed cases when rfl/simp/etc. tactics generated by aesop take a long time to run or even block the proof search. So they're not always helping.
No, I'm not aware of such a comparison. I've been no longer actively engaged in LeanCopilot's development since I joined Meta. So, it's possible that I missed something.
I haven't recently evaluated ReProver on the Lean 4 version of MiniF2F (The results in the LeanDojo paper were Lean 3 and outdated). That's on my TODO list, though the priority is not high. I'll follow up if I end up having the new results. Meanwhile, this experiment has been done independently by other members of the community (see https://github.com/yangky11/miniF2F-lean4/issues/5). They reported 35.7% on the test set of Lean 4 MiniF2F. Note that all proofs in Test.lean are sorry, so 35.7% is also out of all sorry theorems. That translates to 87 theorems proved in Test.lean.

Jason Rute (Oct 17 2024 at 15:29):

Kaiyu Yang said:

Hi Jason Rute,

LeanCopilot does not have a retriever.

But LeanCopilot does have premise selection, correct? So you are saying that, unlike ReProver, premise selection is not used in LeanCopilot for either tactic suggestion or proof search?

Kaiyu Yang (Oct 17 2024 at 15:35):

Jason Rute said:

Kaiyu Yang said:

Hi Jason Rute,

LeanCopilot does not have a retriever.

But LeanCopilot does have premise selection, correct? So you are saying that, unlike ReProver, premise selection is not used in LeanCopilot for either tactic suggestion or proof search?

Exactly. I was talking about the search_proof tactic in LeanCopilot. It does not use premise selection. LeanCopilot's suggest_tactics also doesn't use premise selection. The select_premises tactic does use a retriever (the same retriever as in ReProver), but the retrieved premises are not used in any automated proof search (LeanCopilot only displays them to the user as suggestions).

Jason Rute (Oct 17 2024 at 16:18):

Also it seems that njuyxw’s evaluation of ReProver that you linked also isn’t using the retriever. Maybe the takeaway is that ReProver’s retriever is detrimental to its performance (possibly because it is out of date), and the main value of LeanAgent is that it retrains the retriever to be helpful again.

Jason Rute (Oct 17 2024 at 16:18):

This is speculation of course.

Kaiyu Yang (Oct 17 2024 at 16:24):

That's not impossible, though I'd be surprised if it is the case.

Jason Rute (Oct 17 2024 at 16:39):

I guess one simple experiment is to use ReProver with the retriever turned off and see how many sorry theorems in the above list it can prove. The other is just to fine-tune the retriever on Lean, Mathlib, and Batteries and then redo the baseline.

Jason Rute (Oct 17 2024 at 16:43):

In Graph2Tac, we noticed that our argument selection model (similar to a premise selection model) could get too attached to specific names of definitions/lemmas. This suggests it is possible that ReProver (and MagnusHammer and other transformer/InfoNCE-based lemma selectors like the ones people in Lean right now are working on) could be overfit to small details of the lemmas and not generalize to new definitions/lemmas or to natural changes over time in the code base like renaming of definitions/lemmas.

Jason Rute (Oct 17 2024 at 16:55):

@Adarsh Kumarappan Another question: When you turn a repo into a "dataset", what happens with the dependencies, like the mathlib dependencies? Are those proofs added into the dataset to train the query model of the retriever? Are those premises added into the dataset to train the key model of the retriever? (My thought is that if this is the case, then you might be doing a lot of training on mathlib as well, which again if ReProver's retriever is out of date with respect to mathlib, would certainly give LeanAgent a boost.)

Kaiyu Yang (Oct 17 2024 at 16:55):

Another subtlety I'd like to mention is that whether retrieval helps can depend on the repo you use for evaluation. Intuitively, the set of premises useful for MiniF2F is relatively small and fixed. So retrieval might not be that helpful.

Jason Rute (Oct 17 2024 at 17:01):

@Kaiyu Yang Did you already do that experiment with ReProver and MiniF2F? Run it both with and without retrieval? I'm having trouble finding it in your paper. (Also, that would be with a freshly trained model. I'm also wondering if it is even worse with an out-of-date retrieval model.)

Kaiyu Yang (Oct 17 2024 at 17:08):

@Jason Rute Not for Lean 4. The Lean 4 version of MiniF2F only came into existence after (or maybe concurrent to) the LeanDojo paper. Experiments in the paper (Appendix C.4) are very old and use Lean 3.

Jason Rute (Oct 17 2024 at 17:08):

Did you do that experiment in Lean 3?

Kaiyu Yang (Oct 17 2024 at 17:14):

Jason Rute said:

Did you do that experiment in Lean 3?

Head-to-head comparison w/ and w/o retrieval on Lean 3 MiniF2F? No. The closest thing in the paper was the last few sentences in Appendix C.4, but that's on ProofNet, not MiniF2F. We also did mathlib, whose theorem distribution is quite different from ProofNet and MiniF2F.

Jason Rute (Oct 17 2024 at 17:21):

Ok, that is another good experiment to add to the list. :) (Although I've basically already mentioned it above with the larger sorry dataset.)

Aidan Swope (Oct 17 2024 at 17:22):

@Jason Rute Thank you (and others) for being welcoming to machine learning researchers. I think there is an unfortunate tendency for ML research to turn scientific fields into toy problems, and claim victory over them too soon (cf. Hinton's quip on radiology). I suspect we are at the beginning of a boom in AI for formalization, and the finer points may take some time to get understood by the broader research community.

Xujie Si (Oct 17 2024 at 20:14):

@Kaiyu Yang Thanks for sharing the new analysis. It is somewhat surprising that LeanAgent proves fewer theorems compared to its baseline LeanCopilot (if I understand the setup correctly). Perhaps its improvement is tackling a different (or harder) set of theorems, instead of a strict subset of what LeanCopilot can already prove. Otherwise, I don't see any point of creating a less capable wrapper with a new name.

Kaiyu Yang (Oct 17 2024 at 20:30):

Hi @Xujie Si, LeanAgent used ReProver as the baseline rather than LeanCopilot (and there are subtle differences between the two, as described in my earlier message). I wouldn’t say my analysis necessarily shows LeanAgent is worse, since it’s only on MiniF2F, though it could indicate the need for more analysis.

Adarsh Kumarappan (Oct 18 2024 at 23:15):

Thank you for the insightful comments and comparisons with LeanCopilot @Kaiyu Yang . We'd like to clarify a few points about LeanAgent and its goals:

Scope beyond MiniF2F: While the MiniF2F results are interesting, LeanAgent's primary goal is to go beyond this benchmark. Our focus is on proving theorems across a diverse range of repositories, especially those containing more advanced mathematics.
Lifelong Learning: LeanAgent's key innovation is its lifelong learning approach. Unlike LeanCopilot, which operates on individual repositories, LeanAgent is designed to learn and improve its performance continually across multiple repositories. This approach is crucial for scaling to more complex mathematical domains and for improving generalization.
Experimental Differences: There are significant differences in how LeanAgent and LeanCopilot are evaluated. LeanAgent uses a strict 10-minute wall clock time limit per theorem, while LeanCopilot doesn't have such a limit. The "babysitting" process described for LeanCopilot, which involves restarting the file multiple times when it crashes, makes it difficult to have a fair apple-to-apple comparison. This process could potentially allow LeanCopilot to run for much longer than 10 minutes per theorem, which is not comparable to our evaluation setup. In the future, we plan to also allow LeanAgent more time and the ability to employ more expensive test-time methods.
Retrieval and Integration: While LeanCopilot's performance without retrieval is noteworthy, we believe effective premise retrieval will be crucial for more advanced theorems. We aim to refine and improve the retrieval mechanism, potentially integrating it more tightly with proof search, similar to how LeanCopilot's select_premises tool could be enhanced. Also, note that our experiments without retrieval on the initial 14 repositories performed worse than ReProver with retrieval. This suggests that retrieval remains essential for more advanced theorem proving, even if it may not benefit significantly on some benchmarks like MiniF2F.
Focus on Static Baselines: Our primary goal is to compare LeanAgent against static LLM baselines, not interactive tools like LeanCopilot. LeanCopilot is designed as a user-facing tool, while LeanAgent is primarily a research prototype focused on advancing ML techniques for theorem proving.
Future Integration: Ultimately, we envision integrating the strengths of both LeanAgent and tools like LeanCopilot. The lifelong learning capabilities of LeanAgent could be combined with the efficient proof search and interface of LeanCopilot to create a more powerful and versatile theorem-proving assistant.
Focus on ML Advancements: As mentioned in our previous response, our primary focus is on advancing ML techniques for theorem proving, making them continuously evolve with new formal math data and knowledge. While some proofs may seem simple to mathematicians, they represent significant progress in ML-based theorem proving.

We welcome further discussion and collaboration to advance the field of automated theorem proving. Our goal is to create tools that can assist mathematicians across a wide range of domains, and we believe the lifelong learning approach of LeanAgent is a step in that direction. Given these factors, we believe that a direct comparison between LeanAgent and LeanCopilot would not provide a meaningful evaluation of our lifelong learning approach. Moreover, we plan to integrate LeanAgent, a lifelong learning framework, and LeanCopilot, a user-facing tool, in the future.

Jason Rute (Oct 19 2024 at 11:55):

@Adarsh Kumarappan I think in many ways @kaiyu and I are still just trying to wrap our heads around the results in the paper, especially since they seem a bit confusing. More data and experiments help clarify the picture even if they aren’t apples-to-apples.

Jason Rute (Oct 19 2024 at 11:55):

I for one would love to see also the statistics of ReProver without retrieval (broken up by repo) since you seem to have them. I would also love to see the breakdown of what theorems are proved during life long learning and what theorems are proved after life long learning. (I know you don’t currently have the after number for all theorems, but would you consider running it.)

Jason Rute (Oct 19 2024 at 11:56):

Although, @Kaiyu Yang, it also probably shouldn’t be surprising that LeanAgent does better than ReProver on most repos since LeanAgent is ReProver but with a retriever fine-tuned on the test repo.

Kaiyu Yang (Oct 19 2024 at 15:29):

Jason Rute said:

Although, Kaiyu Yang, it also probably shouldn’t be surprising that LeanAgent does better than ReProver on most repos since LeanAgent is ReProver but with a retriever fine-tuned on the test repo.

Yes, I wouldn't be surprised if finetuning ReProver on the test repo helps. I'm still trying to make sense of Table 2 in LeanAgent's paper. For MiniF2F, from the previous calculation, ReProver proved 85 theorems from Test.lean and Valid.lean combined. @Adarsh Kumarappan I'm wondering how many of them are from Test.lean. From my understanding, LeanAgent adopts the same evaluation protocol as LeanDojo (10 minutes wall time limit, 64 tactic candidates per step, GPUs). Is that correct?

Kaiyu Yang (Oct 19 2024 at 15:36):

And I agree with @Jason Rute that the result of ReProver w/o retrieval would be helpful. @Adarsh Kumarappan By any chance you have it for Test.lean in MiniF2F?

Jason Rute (Oct 19 2024 at 17:15):

@Kaiyu Yang what are your questions about table 2? Here are mine for @Adarsh Kumarappan:

I think there was a bug in the numbers (given the discussion above). What are the correct numbers? After the correction, does it match with table 9, and if so why not?
These numbers in the LeanAgent column only include LeanAgent runs right, not ReProver or the ablations (setups 1 to 7), correct?
There are two runs in the LeanAgent column, right? A LeanAgent during life-long learning (LLL) run and an after LLL run, correct? Both for 10 minutes?
For each repo, for the during LLL run, what repos were trained on before attempting to prove theorems in that repo for 10 minutes. (I think the answer is that for repos found in Table 7, you first train on the repos before that repo in the table as well as that repo itself. For repos in Table 8, do you train on all the repos in Table 7 and all the repos in Table 8 which come before? For MiniF2F it isn’t clear to me, maybe it is after training on the repos in Table 7 and “progressively” training on MiniF2F, but I’m confused what “progressively” means here.)
For each repo, for the after LLL run, what runs were trained on? (Was it after training on all repos in tables 7 and 8 and on MiniF2F?)

Jason Rute (Oct 19 2024 at 17:29):

Oh, I also see that LeanAgent might train on MathComp before attempting MiniF2F. I think there is some overlap (as has been mentioned with InternLM-Step-Prover and the GPT-based provers). That might also explain the better score of LeanAgent on MiniF2F over ReProver.

Kaiyu Yang (Oct 19 2024 at 17:30):

Jason Rute said:

Kaiyu Yang what are your questions about table 2?

My questions are about MiniF2F in Table 2 (I'm too unfamiliar with other repos to ask sensible questions :smile: ). LeanAgent reports results on MiniF2F Test+Valid, whereas all prior works reported results on Test. So, to formulate my questions precisely, I would need help from @Adarsh Kumarappan to get a better sense of the numbers of LeanAgent and ReProver (w/ or w/o retrieval) on MiniF2F Test. I think most of these numbers should be readily available in his current experiments.

Adarsh Kumarappan (Oct 24 2024 at 03:44):

Thank you everyone for your further comments and questions!

We would like to address some of the further questions raised in this channel by @Jason Rute :

Yes, for LeanAgent, we run during and after (meaning the theorems are not solved during lifelong learning).
We would like to clarify the motivation behind lifelong learning in theorem proving. In the introduction, we note how mathematicians often formalize across multiple domains and projects simultaneously or cyclically. We use this as motivation for connecting mathematical knowledge between domains. Moreover, since data scarcity is a major problem in theorem proving, having enough high-quality data for effective pre-training on all repositories may not be feasible. Moreover, the results show that our lifelong learning setup leads to effective backward transfer, allowing learning on task A and then task B to improve performance on task A. This is a strong advantage that pre-training does not provide and is also why LeanAgent demonstrates progressive learning, starting with basic concepts and advancing to more complex ones. Also, although not a direct comparison to the pre-training approach, the "Merge All" strategy indicates decreased performance over time. Furthermore, Appendix A.6 explains how curriculum learning guides the learner towards better local minima in the highly non-convex optimization landscape of theorem proving. This is something that pretraining does not necessarily enjoy.
We agree that modifying the number of iterations and the time spent on each iteration would be helpful to explore in future research.
As you mention, we only update the retriever, not the tactic generator.
Great question. Most repositories do not introduce new tactics, so we decided to update the retriever.
When LeanAgent progressively trains on new repositories, it adds new premises to its premise embeddings from each repository. These new premises create possible intermediate states that can be used in proofs. This expands the "space" of possible states the model can reach during proof search.
Yes, the data from dependencies like mathlib are added into the dataset. As you mention, dependencies like mathlib changes very quickly, so doing this helps LeanAgent have the right “version” of information to prove sorry theorems in a repository.
We compared ReProver with and without retrieval for the original curriculum of 14 repositories. They could only prove sorry theorems from SciLean (24 w/ retrieval, 22 w/o) and MIL (14 w/ retrieval, 12 w/o).
The numbers in the table are now correct.
Correct, the LeanAgent column does not include the ablations or ReProver.
Yes, there are two runs for LeanAgent (during and after).
During the lifelong learning run, we train on the repos in the curriculum order in the “single repository” setting. After each repo, we prove the sorry theorems from that new repo. As mentioned in the paper, we ran the initial curriculum of 14 repos, miniF2F, and then the sub-curriculum of 8 repos, following the use cases of LeanAgent mentioned in the paper.
We prove after lifelong learning only after progressively training over all the repos during lifelong learning.

Also, @Kaiyu Yang here is the evaluation protocol from the paper: “The prover uses a best-first search strategy with no limit on the maximum number of expansions of the search tree. It generates 64 tactic candidates and retrieves 100 premises for each proof state. LeanAgent uses ReProver’s tactic generator for the experiments. We generate tactics with a beam search of size 5. We used 4 CPU workers, 1 per GPU. Due to the wide variety of repositories and experimental setups that we tested, the time for each experiment varied from 4 to 9 days to complete.”

Jason Rute (Oct 24 2024 at 04:06):

Thanks for the clarifications. Just to be clear (and I'll label my questions since it is easier to respond).
Q1. All columns in Table 9, including the the Setup 0-7 columns, include a combination of two runs right, a before and after run?

Jason Rute (Oct 24 2024 at 04:06):

Q2. Further for all columns in Table 9, including the the Setup 0-7 columns, the "after" runs are all trained on the same repos (possibly in a different order) including MiniF2F and the 8 sub-curriculum repos, correct?

Jason Rute (Oct 24 2024 at 04:07):

Q3. I think it would be helpful to separate the before and after scores. I understand to do so it would require additional runs since you only have the after scores for theorems not solved in the before round.

Jason Rute (Oct 24 2024 at 04:15):

Q4. I'm still not sure I understand the point of lifelong learning point, but let me rephrase my question about it. Let's assume you turn this into a practical tool which is continuously run on Lean repositories to fill in sorrys. What have you learned from this paper about how you would implement that in practice? Would you have a single model that is continuously trained on all repos whenever one gets updated? Would you have repo-specific models? Are additional experiments needed? (Also, if you didn't see, I discussed this general problem more in #Machine Learning for Theorem Proving > Keep AI theorem provers up-to-date, including LeanAgent's approach.)

Jason Rute (Oct 24 2024 at 04:19):

Q5. When you say it takes 4 to 9 days to do an experiment, does that mean the time to fine-tune on a repository and run the model on the sorrys? So if I say gave you a new Lean repository I have been working on with a bunch of sorrys in it, would it take about 4 to 9 days to train and fill in sorrys?

Jason Rute (Oct 24 2024 at 04:22):

@Adarsh Kumarappan I forgot to @ you with my new questions above.

Jason Rute (Oct 24 2024 at 04:58):

@Kaiyu Yang I know you are really confused about how they do much better on MiniF2F. I have three simple hypotheses which I think probably all mostly hold. (1) They train on more relevant data, including MathComp and MiniF2F theorems which have proofs. (2) Their fine-tuning of the ReProver's retriever also includes fine-tuning the retriever on Mathlib (at least the Mathlib dependencies used in your MiniF2F repo) and may fix any out-of-date issues with the ReProver retriever. (3) They run two models instead of just one. Indeed, it is possible that option (3) is doing most of the work. I wouldn't be surprised if out of the three models being tested---ReProver, LeanAgent (during life-long training), and LeanAgent (after life-long training)---the sum of theorems proved by any two models is greater than that of the remaining model.

Jason Rute (Oct 24 2024 at 05:08):

LeanAgent Summary

As promised, here is my summary of this paper.

Jason Rute (Oct 24 2024 at 05:09):

There are two aspects to this paper. The first is a tool for filling in sorrys in a repo. This has been repeatedly asked for (#general > Tactic worth running my PC overnight?), and I think it is a great proof-of-concept, if not a usable tool. (I think it isn’t yet released, but they did send PRs to the repos with the proofs found.) Also, it seems to operate along a different paradigm than other tools. It takes a whole repo, fine-tunes a model (in this case ReProver) on that specific repository, and then tries to fill in the sorry theorems. This has the advantage that it could be hosted on a server somewhere with GPUs, and not have to be run locally on, say, a laptop.

Jason Rute (Oct 24 2024 at 05:09):

The model behind LeanAgent is ReProver from the LeanDojo paper (which is also the basis for the LeanCopilot tool). What makes ReProver special (compared to many other LLM-based theorem provers) is that it has premise selection which they call a retriever. LeanAgent fine-tunes the retriever on the repository in question. (I’m not sure why only the retriever is fine-tuned and not the rest of the model. The author says it is because the tactics haven't changed, but I think fine-tuning the rest of the model would actually see bigger improvements, but may also take longer to fine-tune, so it is a trade-off.) It is unclear why fine-tuning the retriever is so helpful, as a premise selector should ideally be able to adapt to new projects, and I discuss this more in #Machine Learning for Theorem Proving > Keep AI theorem provers up-to-date. My two current hypotheses are (1) that the base model ReProver is out-of-date and that is why fine-tuning helps, and/or (2) that ReProver isn’t actually worse, but that different models fine-tuned on different data can prove different theorems (something that is well known and established in many papers). One support for option (2) is that they are running two model checkpoints, each for 10 minutes. One model checkpoint was trained up to and including this repository, and the other was after training on all the repositories. I wouldn't be surprised if it is just a numbers game. Any two checkpoints are better than one.

Jason Rute (Oct 24 2024 at 05:09):

As for some of the criticisms above, such that the theorems it found are trivial and easy to find with exact?, many are probably misunderstandings of this system (and they were also criticisms of ReProver when it came out). LeanAgent doesn’t seem to use exact? (even if that would be faster/better). Instead, it uses premise selection. This may seem silly in the cases where it finds a single lemma to apply, but as other examples above have shown, once it is finding proofs with two or more lemmas, or even finding proofs using one lemma with a bit of extra proof around it, then it goes beyond exact?. Not all proofs it finds will be non-trivial (or beyond existing tools like aesop?), but if it can find some non-trivial proofs fully automatically, then this could be useful tool (assuming it is cheap enough to run in practice).

Jason Rute (Oct 24 2024 at 05:09):

I think the best practical results would come from not using the LeanAgent model alone, but from a diverse mix of models like aesop?, ReProver, GPT-4-based models, and ABEL (even if each is only run for a shorter time period). (Also, I think in Coq, there are a number of models like Graph2Tac and Tactician that would be very well suited for this exact sort of task and maybe even more so since I think they are better at adapting to new repositories than anything in Lean.) Further, if the retriever in ReProver could be more robust to retrieval, maybe there wouldn't be as much need to fine-tune the retriever at all, and that would make a system like this much more practical.

Jason Rute (Oct 24 2024 at 05:09):

Finally, the other aspect of this paper was how exactly they went about fine-tuning the retriever. They felt that they needed to train on one repository before moving the the next. I’m not exactly convinced this is the best approach practically (although I could be mistaken and there is still ongoing discussion above about this), but it was a major focus of the paper. Sequential learning and acting in a changing environment is called “life-long learning” and they used ideas from that field here. They experiment with a few approaches, deciding what order they would train on repositories and whether to continue to train on previous repository data. Their approach does best on their life-long learning benchmarks, but I personally am more interested in how it proves theorems, and that is less clear from their metrics (and the fact that the numbers seem to be a combination of multiple models). (Also, the MIL benchmark is probably particularly flawed since they literally train on the solutions to each problem before attempting the problem.) It may be that they have hit on something really good here, but I'm still not sure.

Jason Rute (Oct 24 2024 at 05:09):

My largest takeaway is that maybe it is time to build such a practical sorry-filling system. If it isn’t too expensive (their system seems to take days to fine-tune so that is a consideration), it could be a helpful assistant even if it only occasionally finds a new theorem. People would write their projects differently, leaving a lot more sorrys in them, hoping that those will be filled in overnight. Some filled-in sorrys will be trivial or exploit logical bugs, but that could be still useful or at least easy to ignore.

Jason Rute (Oct 24 2024 at 05:09):

Such a system (if open source) would also provide a good testing ground for lots of AI approaches, be more aligned to Lean users' AI needs, and address practical concerns such as how to adapt to new and changing projects (#Machine Learning for Theorem Proving > Keep AI theorem provers up-to-date). I think adapting to a changing environment is really the main motivation for this work.

Jason Rute (Oct 24 2024 at 05:09):

While I still hope for more analysis, if not, I hope someone builds this into a practical tool and really puts in the effort to understand how best to engineer it to meet the needs of Lean users.

Kaiyu Yang (Oct 24 2024 at 13:07):

Adarsh Kumarappan said:

Also, Kaiyu Yang here is the evaluation protocol from the paper: “The prover uses a best-first search strategy with no limit on the maximum number of expansions of the search tree. It generates 64 tactic candidates and retrieves 100 premises for each proof state. LeanAgent uses ReProver’s tactic generator for the experiments. We generate tactics with a beam search of size 5. We used 4 CPU workers, 1 per GPU. Due to the wide variety of repositories and experimental setups that we tested, the time for each experiment varied from 4 to 9 days to complete.”

The evaluation protocol makes sense and is mostly the same as in LeanDojo. Just a minor question: How do you generate 64 tactic candidates with a beam search size of 5? I thought with a beam search size of X you could only generate <= X tactic candidates.

Kaiyu Yang (Oct 24 2024 at 13:27):

@Adarsh Kumarappan Thank you for clarifying on the evaluation protocol! I'm still wondering about my main question: In Table 2, among the 85 theorems ReProver proved from Test+Valid, how many are from Test?

Kaiyu Yang (Oct 24 2024 at 13:38):

Jason Rute said:

Kaiyu Yang I know you are really confused about how they do much better on MiniF2F.

Nope, my confusion is not about how but about whether. According to LeanAgent's Table 2, ReProver proved 85 sorry in Lean 4 MiniF2F Test+Valid. However, multiple independent pieces of information suggest that ReProver is likely to perform better than reported:

In my previous analysis, LeanCopilot (based on ReProver w/o retriever, with caveats discussed above) proved 85 sorry in Test and 15 in Valid. That's 85/244=34.8% in Test.
In https://github.com/yangky11/miniF2F-lean4/issues/5, njuyxw reported that ReProver w/o retriever can prove 35.7% sorry in Test.
In this paper accepted to the MATH-AI workshop, ReProver w/o retriever can prove 34.43% sorry in Test. Disclaimer: I'm one of the authors, though I didn't run the experiments.

The median (34.8%) translates to 34.8% x 244 = 85 theorems proved in Test alone. There appears to be a gap with what LeanAgent reported (85 proved in Valid+Test). @Adarsh Kumarappan I think it's worth double-checking if the ReProver baseline experiments were run correctly, and I'm happy to help if needed.

To clarify, quite a few papers (e.g., ABEL and InternLM2.5-StepProver) reported "26.5% on MiniF2F Test" for ReProver. The number 26.5% is taken directly from the LeanDojo paper. It is on the Lean 3 version of MiniF2F and therefore is not comparable with Lean 4 MiniF2F.

Jason Rute (Oct 24 2024 at 13:43):

@Kaiyu Yang Since LeanAgent is just an updated ReProver wouldn’t you think that any issue with how they ran the ReProver baseline would also effect how they run LeanAgent, and their LeanAgent would still be better in the end for the three reasons I give above? (Although if there were an issue with the baseline, it could explain why where were so many instances of obvious things ReProver couldn’t prove.)

Kaiyu Yang (Oct 24 2024 at 13:44):

@Jason Rute It could be. In that case, it's worth re-running not only ReProver but also LeanAgent.

Kaiyu Yang (Oct 24 2024 at 13:49):

@Jason Rute In any case, I think we probably shouldn't overload Adarsh with too many questions. :rolling_on_the_floor_laughing: Many of them are just details and should be obvious when the code becomes available (if that's the plan).

Jason Rute (Oct 24 2024 at 13:52):

Maybe I am asking too many questions, but I also don’t think my recent five questions will be answered from reading the code. :smile:

Oliver Dressler (Jan 12 2025 at 06:27):

(deleted)

Last updated: May 02 2025 at 03:31 UTC