Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Aristotle

Sid (Jun 11 2024 at 04:58):

https://x.com/tachim/status/1800181481161244941 - 83% on minif2f after they tack on a CAS

Notification Bot (Jun 11 2024 at 12:04):

A message was moved here from #Machine Learning for Theorem Proving > DeepSeek-Prover by Jason Rute.

Jason Rute (Jun 11 2024 at 12:18):

Here is the announcement, which right now seems to be the closest thing to a paper: https://harmonic.fun/news.html If this works, that is pretty cool, but also here are some points that need to be clarified (and frankly look suspicious):

There is no paper (and that website, which doesn't even have a permalink) is sparse on details.
They split the validation and test up weirdly:

To obtain a training set, we re-split the 488 MiniF2F problems (originally evenly divided into validation and test sets) randomly into 392 training, 48 validation, and 48 test problems, where the latter two splits are unbiased random subsets of the corresponding original validation and test sets.
Also, they train directly on the miniF2F problems more than I'm used to others doing. (I guess I have to go back and check out the papers again and see which results directly train on the validation problems. I'm mostly only aware of people using the validation problems for fine-tuning and reinforcement learning, but not where one writes hand-writes proofs for them all and training on those, which is what I think they are doing with their training set.)
They also heavily modified miniF2F (not sure if this is their new test set or the new training set):

We found that many statements' formalizations in MiniF2F were much easier than the original problem, e.g. only containing the easier direction of a biconditional statement. We worked with mathematicians and olympiad competitors to ensure our version of MiniF2F represents each problem fully.
They don't really say much about the CAS or how they use it. They get 63% without the CAS.
There are very few details overall.

Jason Rute (Jun 11 2024 at 12:18):

cc @Tudor achim

Jason Rute (Jun 11 2024 at 12:27):

I think to fully evaluate this work, we need the following:

A paper describing the details.
A release of the new version of MiniF2F that they created. It may be that their 48 test problems are just easier overall. (We are getting into the range of small sample sizes here.) Or if they are not willing to do that, at the very least, the list of test problems.
How many attempts do they use to solve the problems? For example, as we saw in the Deepseek-Prover paper, there is something like a logarithmic increase in theorems solved as you increase compute (usually by increasing the number of attempts, but if it is a tree search then the amount of time spent searching is a good measure too).

Johan Commelin (Jun 11 2024 at 12:34):

The change to MiniF2F (if carried out correctly) is something that we've all been wishing for, right? So that's good news, if this new version of MiniF2F will be shared publicly.

Jason Rute (Jun 11 2024 at 12:44):

I'm really worried that MiniF2F is outliving its usefulness:

Goodhart's law is taking over. Many works are specifically modifying their training data to target MiniF2F, but it is unclear if these approaches would generalize to other use cases of automated theorem proving, like finishing real Lean proofs or even fresh competition problems (like all this year's math competitions).
MiniF2F has been out there for years and I think it has somewhat leaked, and even if it hasn't, the original problems on which it is based, have probably leaked, as well as very similar formalized libraries like https://dwrensha.github.io/compfiles/. (I should really just train a model on compfiles and test on miniF2F and see how that does. :smile:).
MiniF2F has many versions, some with errors making the problems too easy or too hard. Some which fix these errors. Some are auto-translated from Lean 3 to Lean 4 and are missing problems. And this paper looks like it is making major changes to the problems. It's difficult to know if it should be a stable benchmark, warts and all, or an improving organic thing. I like the idea of a stable benchmark, but I think I'm in the minority here. (Although to be fair, there is also an issue of inconsistencies between ITP translations which has nothing to do with stability. Some problems might be correct in one ITP language and incorrect in another.)

Jason Rute (Jun 11 2024 at 13:09):

But to be clear, as much as I have suspicions about Aristotle, I think incorporating a CAS seems like a great idea, and I could believe that it works well. (For example, I get the impression that the leading scores on the AIMO Progress Prize competition all use sympy.) Of course, in ITP speak, I'm not sure what they mean when they say their system is "one where additional external computer algebra systems are permitted to solve simple subproblems in a trusted way". What does trusted mean here? (But as long as there are no soundness bugs, it probably doesn't matter too much. Even if it is not fully translating from the CAS into Lean right now, that just seems like an engineering problem.)

Joseph Myers (Jun 11 2024 at 21:05):

Jason Rute said:

They also heavily modified miniF2F (not sure if this is their new test set or the new training set):

We found that many statements' formalizations in MiniF2F were much easier than the original problem, e.g. only containing the easier direction of a biconditional statement. We worked with mathematicians and olympiad competitors to ensure our version of MiniF2F represents each problem fully.

This is unclear about how they handle "determine" problems. Does their fixed MiniF2F require the AI prover to find the answer for itself in such cases, with a human then checking for an acceptable answer, if that was how the original problem was phrased for humans, or are the problems still the easy-mode variant in such cases (where the answer is provided in the formal statement to be proved)?

Joseph Myers (Jun 11 2024 at 21:12):

Jason Rute said:

MiniF2F has been out there for years and I think it has somewhat leaked, and even if it hasn't, the original problems on which it is based, have probably leaked, as well as very similar formalized libraries like https://dwrensha.github.io/compfiles/. (I should really just train a model on compfiles and test on miniF2F and see how that does. :smile:).

If you're training entirely on synthetic data (like AlphaGeometry) then the leakage doesn't matter. If you're training on public online data at large, or building on an LLM that was so trained, or indeed if you're building on more selective mathematical data extracted from e.g. AoPS, then leakage is a risk that can most effectively be addressed by testing on new problems (from the many olympiads that use new problems every year, for example, as long as you make sure all non-synthetic data for training was obtained before the competition in question was held) rather than expecting a fixed benchmark to work.

Jason Rute (Jun 11 2024 at 22:48):

@Joseph Myers As for handling "determine" problems, I would just assume they leave it the same as MiniF2F, i.e. the easy mode. Of course, if we start solving easy-mode problems too easily, we can switch to hard mode (or better yet, real-world problems!), but the difficulty with hard mode is that it is a very different sort of task requiring different tools and evaluation frameworks. For an ATP benchmark, the easy mode is fine. (Of course, one shouldn't say they solved an IMO problem if it is in easy mode, but that is a different matter brought on by overzealous researchers or overzealous PR departments.)

Jason Rute (Jun 11 2024 at 22:50):

@Joseph Myers To be clear, every paper in AI for ITP that uses Transformers (with the exception of the Transformer baseline in our recent Graph2Tac paper) uses pre-trained models that are capable of leakage. (The AlphaGeometry training-from-scratch paradigm is mostly unheard of.)

Jason Rute (Jun 11 2024 at 22:54):

Joseph Myers said:

If you're training on public online data at large, or building on an LLM that was so trained, or indeed if you're building on more selective mathematical data extracted from e.g. AoPS, then leakage is a risk that can most effectively be addressed by testing on new problems (from the many olympiads that use new problems every year, for example, as long as you make sure all non-synthetic data for training was obtained before the competition in question was held) rather than expecting a fixed benchmark to work.

I would very much be in favor of testing on new problems. If we could organize a committee that was interested in translating the various yearly competitions into Lean, Isabelle, and Coq and convincing researchers to evaluate their models (including older models) on those benchmarks, this would be amazing. I do think it is likely going to be a lot of thankless work, which is why we haven't seen it so far. But it would go a long way to alleviating these concerns. (Note, this same problem exists in AI for generating computer code which is a much bigger field with a lot more resources and corporate backing behind it.)

George Tsoukalas (Jun 12 2024 at 16:08):

Jason Rute said:

Joseph Myers said:

If you're training on public online data at large, or building on an LLM that was so trained, or indeed if you're building on more selective mathematical data extracted from e.g. AoPS, then leakage is a risk that can most effectively be addressed by testing on new problems (from the many olympiads that use new problems every year, for example, as long as you make sure all non-synthetic data for training was obtained before the competition in question was held) rather than expecting a fixed benchmark to work.

I would very much be in favor of testing on new problems. If we could organize a committee that was interested in translating the various yearly competitions into Lean, Isabelle, and Coq and convincing researchers to evaluate their models (including older models) on those benchmarks, this would be amazing. I do think it is likely going to be a lot of thankless work, which is why we haven't seen it so far. But it would go a long way to alleviating these concerns. (Note, this same problem exists in AI for generating computer code which is a much bigger field with a lot more resources and corporate backing behind it.)

We have been working on a new benchmark towards this end, which hopefully can alleviate some of these concerns at least until proofs to the theorems in the benchmark start popping up online. The nice thing is as long as you don’t include any proofs in the repository and you pick the right problems, the formal proofs have never been written before, so direct contamination is not possible (informal data can cause indirect contamination, though, if you don’t use new problems). We’re aiming to put out an initial release within several weeks. Curious what methods/models people think would be useful to benchmark? Perhaps we can include some more in our evaluations.

Eric Wieser (Jun 12 2024 at 16:22):

The nice thing is as long as you don’t include any proofs in the repository and you pick the right problems, the formal proofs have never been written before, so direct contamination is not possible

A danger in this approach is that problems for which a formal proof has never been written are far more likely to be incorrectly stated in Lean

George Tsoukalas (Jun 13 2024 at 11:42):

I agree. We do approximate verification by having another person parse the formalization independently to get around some of these cases. We also found thus far that running some symbolic baselines can (infrequently) find some formal statements which are trivial to prove due to a missing assumption. In general, there's a tension between really ensuring the statements are correct by writing a proof mimicking the informal one and not compromising the benchmark. Formalizing tons of statements of hard problems is already pretty laborious, writing the proofs much more so, and without committing the proofs for public use it can seem like misused effort.

Sid (Jul 10 2024 at 06:08):

Harmonic is starting to release their fixed version of miniF2F - https://github.com/harmonic-ai/datasets/tree/main/minif2f - currently they get 90% on their validation set split

Notification Bot (Jul 10 2024 at 06:11):

A message was moved here from #Machine Learning for Theorem Proving > yes by Kevin Buzzard.

Notification Bot (Jul 10 2024 at 11:11):

2 messages were moved here from #Machine Learning for Theorem Proving > Harmonic by Jason Rute.

Jason Rute (Jul 10 2024 at 11:13):

They also released a new blog post on 07.09.2024 (again with no permalink :rolling_eyes: ) https://harmonic.fun/news.html

Jason Rute (Jul 10 2024 at 11:30):

I have so much (mostly negative) to say about this---and it is sad because, under all the shady and bad stuff, they probably are doing good and interesting work. I don't have a lot of time this morning so here are the highlights of my complaints:

They call their lobotomized version of MiniF2F, wait for it, "MiniF2F" which is going to create so much confusion.
I don't know if they plan to release their new MiniF2F training data (which is a subset of the old MiniF2F test and validation data) along with proofs, but that would certainly mess up LLM-based provers on the old MiniF2F since LLM pre-training would take from it. Maybe a canary string would help.
They directly compare results on their new validation set with results from the original validation set. Why? Those are different sets of problems and small samples moreover.
They seem to want to be the leader in this space, keeping track of the previous state-of-the-art results on MiniF2F. I just have no confidence they will give this the consideration and attention to nuance it needs.
MiniF2F is multi-lingual but their new version isn't. One would think you could just find which validation problems are in the "new" MiniF2F and select those from the old MiniF2F, but they changed the problem names as well. So it is much harder to align results on the new and old versions.
And they don't even have a paper describing the details.

Jason Rute (Jul 10 2024 at 11:31):

cc @Tudor achim

Jason Rute (Jul 10 2024 at 11:34):

I feel like I'm becoming an Internet curmudgeon arguing on internet forums. If so, and I'm not being reasonable in my complaints, feel free to tell me. (What I really should be doing is working on new benchmarks.)

Aidan Swope (Jul 10 2024 at 19:45):

Hi Jason, I work at Harmonic and I wanted to clarify a couple of things. Thanks for taking interest in our work!

On the MiniF2F benchmark: we intend to release the training and test data in the next few days. Thanks for the feedback regarding the problem names; we will tag the problems with their original MiniF2F name to help align with results across different formal systems. Furthermore, the problems we’re releasing are open source, as before, so folks are welcome to remix them into their original splits. Lastly, we are not currently planning to release the proofs so we think the canary string is unnecessary, but that’s a great idea and we will include it in future releases as appropriate.

On evaluation: we wanted to evaluate our models on unseen problems rather than training on the validation set as previous work had done; hence, the decision to re-split the data. This was a split we created early on, before we generated synthetic training data unrelated to MiniF2F. Our validation and test sets are uniform random subsamples of the original test set (see our blog post), and the original MiniF2F splits were themselves random (see page 3 of the paper), so we consider our evaluations to be unbiased with respect to evaluations on the original splits. However, as you've pointed out, reserving many problems for the training set yields fairly small sample sizes and there will be naturally higher variance in the results. For this reason, and considering the trajectory of results that we are seeing internally, we plan to release other, more difficult, benchmark problems going forward for the community to evaluate their models on.

On publication: we certainly want to release a paper! It's just that the research is moving along fast and we've been focused on that :smile:

Aidan Swope (Jul 10 2024 at 20:22):

OK, I have added a name column to the dataset associating statements with their original name.

Sid (Jul 10 2024 at 20:22):

Thanks for the comment @Aidan Swope . One concern that's been raised here in the past is to what extent LLM based models may be benefitting from having likely seen the informal solutions to the miniF2F (and other benchmarks like FIMO) problems in their pretraining data. Is that something you have a good sense of and would be able to comment on?

Aidan Swope (Jul 10 2024 at 20:31):

Hi @Sid . The truth is it's hard to know, given the broad scope of pretraining and open questions around knowledge transfer within LLMs. However:

Qualitatively, our model has performed about as well on new problems we've written for evaluation that don't exist anywhere on the internet (we have not yet measured this effect quantitatively)
Going forward, we plan to evaluate our model on newly-released competition problems that aren't part of the pretraining dataset
Formally proving a statement in Lean is quite different than generating its informal proof, and I would personally be surprised if the small number of competition math solutions in the pretraining data affected the model much

Jason Rute (Jul 10 2024 at 21:08):

@Aidan Swope Thanks for the reply. Nice to hear from you.

Furthermore, the problems we’re releasing are open source, as before, so folks are welcome to remix them into their original splits.

Maybe that is worth doing, but we already have a cleaned-up MiniF2F Lean4 translation at https://github.com/yangky11/miniF2F-lean4 by @Kaiyu Yang. I think most Lean 4 papers are using the yangky11 version (but I'm not certain). Do you know how your clean-up compares to that version?

(MiniF2F is a mess on Lean 4. With your dataset, there are now three versions of MiniF2F on Lean 4 and no governing body or central repository to organize and maintain the integrity of the benchmark. I do agree that the Hoskinson Center translation of MiniF2F is poor quality. It seems like it was a quick auto-translation with no cleanup. It probably shouldn't have been released in hindsight.)

Jason Rute (Jul 10 2024 at 21:09):

Also, I still think your repo should be named something like "MiniF2F Resampled" or something indicating it isn't the same splits as the original.

Jason Rute (Jul 10 2024 at 21:09):

On evaluation: we wanted to evaluate our models on unseen problems rather than training on the validation set as previous work had done; hence, the decision to re-split the data.

I'm confused by this. First, I don't think another paper out there does "training on the validation set" in the traditional sense. (If I am mistaken, I'd love to know a reference.) Many problems in the validation set don't even have ground truth proofs to train on! Instead, the validation set is used for two purposes specified in the original MiniF2F paper: validation (like hyperparameter validation) and reinforcement learning (e.g. expert iteration where one tries to solve a problem in the validation set and if there is a proof, then using it to fine-tune the model). Of course, there isn't a huge difference between RL vs. training except one doesn't use any ground truth proofs when doing RL, but I do think it is still a significant distinction.

And as for "unseen problems" isn't that what the test split is for?

Jason Rute (Jul 10 2024 at 21:09):

It seems like your main reason for the resplit is that you wanted an in-distribution training set, and you were willing to spend extensive human effort to get that training set by writing proofs for most of the MiniF2F dataset. I still don't know why you didn't just write proofs for the many existing competition problems not in MiniF2F, like going through CompFiles and writing proofs for all the non-MiniF2F problems there.

Jason Rute (Jul 10 2024 at 21:09):

Also, do you think many other current approaches including the ones referenced in the blog post would do a lot better if they trained directly on your new training proofs and tested on your test problems? I assume they wouldn't get 90%, but I think a significant part of your value-add is just better training data.

Jason Rute (Jul 10 2024 at 21:10):

For this reason, and considering the trajectory of results that we are seeing internally, we plan to release other, more difficult, benchmark problems going forward for the community to evaluate their models on.

I am all for this. Maybe with enough benchmarks of new problems, we can get a better sense of the relative quality of various models. Of course, the real challenge will be evaluating some of the older models which are still probably pretty good, and comparing benchamarks across theorem provers. (I also hope you are willing to run on other people's benchmarks as they come out and not just on your own. :smile:)

Jason Rute (Jul 10 2024 at 21:10):

On publication: we certainly want to release a paper!

I look forward to reading it! (And probably commenting on it here. :smile:)

Sid (Jul 10 2024 at 21:42):

(deleted)

Aidan Swope (Jul 10 2024 at 22:24):

Do you know how your clean-up compares to that version?

We started on this dataset before miniF2F-lean4 was released and haven't explicitly compared to it. In addition to correctness, we've focused on having accurate natural-language statements and a separate training set. (Side note: I was @Kaiyu Yang's coauthor on LeanDojo not too long ago and I'm happy to see he's still contributing to the Lean world!)

And as for "unseen problems" isn't that what the test split is for?

We would prefer not to do any form of training involving the validation set, whether that's reinforcement learning or supervised learning.

I think a significant part of your value-add is just better training data.

A major component of our approach is reinforcement learning and synthetic data. Manually-written proofs are by now a vanishing fraction of our data. I don't doubt that if others trained LLMs on this data, their models would improve, but that is part of our research methodology.

Sid (Jul 10 2024 at 22:58):

Thanks for the replies and for engaging @Aidan Swope ! One more question - for your results, are they 0-shot or involve some level of test time search (whether via multiple samples or MCTS or something else)?

Aidan Swope (Jul 10 2024 at 23:24):

There are parts of the approach I can't talk about quite yet, but they're definitely fancier than "generate a proof and see if it compiles" :smile:

Jason Rute (Jul 10 2024 at 23:28):

Thanks again for the responses @Aidan Swope. I’ve downgraded my feelings from suspicion to mere annoyance. :smile: I look forward to the paper and I’d also love it if you announce updates here as well (papers, blog posts, or major updates to the repo).

Aidan Swope (Jul 10 2024 at 23:30):

Of course, I'll be excited to share news here as it happens!

Kaiyu Yang (Jul 12 2024 at 16:47):

A side note on https://github.com/yangky11/miniF2F-lean4: It's basically Hoskinson Center's auto-translated version plus some manual fixes. Actually, I didn't mean to "release" it, and it's never been used in any of my papers (The LeanDojo paper used Lean 3 MiniF2F). I made it public because in an ongoing project I wanted to trace MiniF2F using LeanDojo, which does not support private repos. Quite surprised to see it has been used by other people....

Aidan Swope (Jul 12 2024 at 21:03):

We've released the training and test sets just now. Please note that there are currently three statements missing from the training set (we have not yet fixed up their formalizations):

mathd_algebra_31
mathd_numbertheory_24
amc12a_2020_p22

Yury G. Kudryashov (Jul 13 2024 at 22:35):

algebra_ineq_nto1onlt2m1on formalization in yangky11 is wrong, because Lean parses n^(1/n) as (n : Nat)^(1 / n : Nat) = n^0 = 1 for n > 1. This makes the problem trivial.

Yury G. Kudryashov (Jul 13 2024 at 22:37):

Similarly, algebra_cubrtrp1oncubrtreq3_rcubp1onrcubeq5778 doesn't assume r > 0. With current Mathlib, (-2 : Real) ^ (1 / 3 : Real) = 1.

Yury G. Kudryashov (Jul 13 2024 at 22:37):

Both errors (and some more) are fixed in our version.

Yury G. Kudryashov (Jul 13 2024 at 22:39):

mathd_algebra_114 uses ^(1 / 3) without : Real, which is interpreted as ^0.

Jason Rute (Jul 13 2024 at 23:56):

Which one is “your version” @Yury Yarovikov ?

Jason Rute (Jul 13 2024 at 23:59):

Are you referring to Harmonic’s version? Are you with Harmonic?

Eric Rodriguez (Jul 14 2024 at 00:23):

(you tagged the wrong Yury, btw, Jason) - but indeed, Yury works for Harmonic. [As do I]

Yury G. Kudryashov (Jul 14 2024 at 03:45):

Yes, I work for Harmonic.

Fabian Glöckle (Sep 02 2024 at 15:43):

Reviving this thread to coordinate for a "definite" version of minif2f in Lean4: shall we consider @Kaiyu Yang 's repo as the official version and upstream Harmonic's fixes to it? Shall we host the original valid/test split on Harmonic's repo?
Shall I create a new repo that I promise to maintain for at least 1.5 years?

We should also be clear what is in-scope for "fixes":
For me a fix keeps the statements what they appear to say to humans in https://github.com/facebookresearch/miniF2F (e.g. NOT make an implication an iff just because that's what the original competetion problem asked for) but adding all type annotations and assumptions etc. that prevent the problem from becoming trivial / saying something else than what (non-Lean-expert) human readers think at first glance.

Related: is there any translation of minif2f-curriculum (https://github.com/openai/miniF2F/tree/statement_curriculum_learning/lean/src/statement_curriculum_learning) yet?

cc @Jason Rute @Kaiyu Yang @Aidan Swope @Yury G. Kudryashov @Johan Commelin

Jason Rute (Sep 02 2024 at 17:29):

My opinion is that if MiniF2F is still going to be a thing, it should be put into a common repo. While I trust that you will attempt to maintain the Facebook repo for 1.5 years, history has proved that it hasn't worked out so wellwith both the OpenAI and Facebook repos. Also, 1.5 years isn't very long. Could a non-profit take MiniF2F over? Numina, Lean FRO, or something else? cc @Albert Jiang

Jason Rute (Sep 02 2024 at 17:30):

Has anyone compared the two versions (Kaiyu's and Harmonic's?). I'm curious how different the results of some model (like say Lean Copilot) is on the two versions. (I've meant to do this with a toy model I have, but never got to it.)

Jason Rute (Sep 02 2024 at 17:32):

Also, it isn't clear what to do with all the papers that came out this year using @Kaiyu Yang's current version.

Fabian Glöckle (Sep 02 2024 at 19:49):

if there's lots of papers using @Kaiyu Yang's current version, the alternative is to freeze that one, not fixing any further mistakes

Kaiyu Yang (Sep 03 2024 at 03:56):

Creating a new repo (under either facebook or another entity) would be great. https://github.com/yangky11/miniF2F-lean4 may lack the proper maintenance needed for a public benchmark. I'm not sure if Kunhao would be interested in help maintaining the new repo.

fzyzcjy (Sep 03 2024 at 09:02):

I am also using minif2f, and looks like the workload of maintaining it is only a portion of my current open source lib (e.g. https://github.com/fzyzcjy/flutter_rust_bridge). Thus, I am happy to maintain it, and feel free to ping me if there is anything I can help!
By the way, I have been maintaining the bridge for years, and must maintain it for a long time since it is used by so many people.
I am also happy to do some engineering things, such as explicitly version it, create changelog of every fix, publish huggingface datasets that synchronizes with github, etc.

Tudor achim (Sep 03 2024 at 23:41):

Hi all, I'm Tudor, CEO of Harmonic. Re: the discussion above: we work pretty hard on benchmarks internally and plan to release most of them publicly over time, as well as maintain them long-term. https://github.com/harmonic-ai/datasets/tree/main/minif2f is just the start.

Last updated: May 02 2025 at 03:31 UTC