Zulip Chat Archive

Stream: general

Topic: Discussion of NuminaMath-LEAN


Bolton Bailey (Jul 31 2025 at 14:44):

Here's a discussion thread for the announcement of the NuminaMath-LEAN dataset (#announce > NuminaMath-LEAN @ 💬 )

(deleted) (Jul 31 2025 at 16:59):

I contributed to this dataset. I'm very proud! You should too.

Kevin Buzzard (Jul 31 2025 at 18:43):

Are the 35K formalized proofs all sorry-free?

Bolton Bailey (Jul 31 2025 at 19:09):

Kevin Buzzard said:

Are the 35K formalized proofs all sorry-free?

There are around 2000 human-written proofs that have the with_sorry flag in the ground_truth_type field, so these indicate theorems that were partially proven by humans, but with a sorry or sorries.

Doing a bit of napkin math, it looks like this number is actually included in the infographic on the announcement page - I hadn't realized this wrinkle when drafting the Zulip announcement. For more transparency, I'd rather the announcement here only count complete formalized proofs as formalized, so I'll change the announcement text to just mention the number of complete human formalized proofs.

Apologies if this was misleading, and thanks @Kevin Buzzard for prompting me to catch this.

Joseph Myers (Aug 01 2025 at 00:43):

Is there any convenient way of viewing the formalizations (both statements and proofs) if one's interested not in using the dataset as training data but in looking more closely at formalizations of particular problems? That is, any sort of view that lists the competitions covered, and, for each competition, for each year of that competition, which problems have been formalized and their status (proved or not, statement human-reviewed or not) and shows the formal statements and proofs? I'm thinking of something like the Compfiles dashboard, but on a larger scale - to allow looking at all the IMO formalizations or all the USAMO formalizations in this dataset, for example. I'm guessing there might be things in this dataset that would make sense to add to Compfiles, for competitions in scope for Compfiles (where there's at least a human-reviewed formal statement in this dataset) - and I'd be interested in seeing different formalization choices for IMO problems where I've formalized a statement, just as I've compared my choices with those in Compfiles - but it's not clear how to find such things in this dataset, or how to convert this dataset on Hugging Face into a Lean project in git that's structured more conventionally for use with Lean tools. (I certainly hope Lean can handle a project with 100000 files each with one problem and some of those with proofs.)

Joseph Myers (Aug 01 2025 at 00:46):

Incidentally, what's the rationale for using mathlib v4.15.0? I think that for AI formalization tools to be of more practical use, it's important for them to keep up with current mathlib; is there some basis to think that it's fine for training data to use old mathlib even when it's going to be training an AI that needs to work with newer mathlib?

Jason Rute (Aug 01 2025 at 01:10):

I think @Joseph Myers that while it is a commendable goal to be kept up to date, you are forgetting how much of a hassle it is to keep Lean projects up-to-date. Not surprisingly v4.15.0 was the version in January when Numina got the XTX grant to work on this project and even then there was a discussion about how Mathlib's ever changing library was going to cause a lot of problems for them.

Jason Rute (Aug 01 2025 at 01:10):

Jason Rute said:

Jia Li said:

We pray everyday for mathlib to be backward compatible for the next release and try to get things up to date as we can. But it is a huge challenge :)

Jason Rute (Aug 01 2025 at 01:13):

Figuring out how to keep 100k statements and 35k proofs automatically up-to-date would be a major accomplishment on its own and one worthy of accolades.

Jason Rute (Aug 01 2025 at 01:18):

@Joseph Myers If you know Python, it is fairly easy to download a huggingface dataset (you can even ask an AI for a program template) and then it also isn't that hard to turn this into a directory of one file per problem.

Bolton Bailey (Aug 01 2025 at 01:20):

Joseph Myers said:

Is there any convenient way of viewing the formalizations (both statements and proofs) if one's interested not in using the dataset as training data but in looking more closely at formalizations of particular problems? That is, any sort of view that lists the competitions covered, and, for each competition, for each year of that competition, which problems have been formalized and their status (proved or not, statement human-reviewed or not) and shows the formal statements and proofs?

I think the only suggestion I have along these lines right now is: In the "Data Studio" tab, there's an option to have an AI help you write a SQL query to help you find particular data you might be interested in. Unfortunately I don't think we have any kind of specialized viewer of our own, I have just been using python as Jason suggests (in fact, I have just now been looking into this exact thing of identifying the IMO problems in our dataset missing from compfiles - I'm hopeful I can PR at least some of them, David has told me he's amenable).

Jason Rute (Aug 01 2025 at 01:23):

Jason Rute said:

Figuring out how to keep 100k statements and 35k proofs automatically up-to-date would be a major accomplishment on its own and one worthy of accolades.

I'm regretting saying this a bit. I don't actually know how hard this is. This is what Mathlib does all the time and they have more proofs.

Bolton Bailey (Aug 01 2025 at 01:27):

Joseph Myers said:

Incidentally, what's the rationale for using mathlib v4.15.0?

Indeed, I think the rationale is simply that that was the version when the data gathering started / picked up steam, and a dataset-wide update seemed challenging.

WRT training new AIs on old data, I think part of the hope is that things will be similar enough that an agent trained on this data could still be somewhat successful in a newer version environment, and that RL could help to smooth the cracks, but I agree it's not ideal. It would be interesting to check to what degree the old proofs succeed on the new version without need for changes, or if the update process could be automated.

Joseph Myers (Aug 01 2025 at 01:38):

Keeping things up to date is generally easier if you do it frequently (at least for every major Lean release) rather than leaving it several months (this is actually generally true for just about any kind of dependency on fast-moving software, not limited to Lean, though the timescales involved may vary). After six months, deprecations get removed from mathlib and you lose that source of hints about how to update something, for example.

Justin Asher (Aug 01 2025 at 01:38):

Bolton Bailey said:

I think part of the hope is that things will be similar enough that an agent trained on this data could still be somewhat successful in a newer version environment

Might help to have more backwards compatibility in Lean. Is there a general reason why Lean needs to change so much as a language? I know that the folks at the Lean FRO work really hard on optimizing both the speed and user experience of Lean, but the frequent changes can make it difficult to maintain things.

Joseph Myers (Aug 01 2025 at 01:47):

It looks like a lot of the problems in this dataset with exam = IMO are actually shortlist problems not selected for the final exams. Now, I think having formal statements and solutions of shortlist problems is of value, but I'm not sure if there is clear metadata somewhere in this dataset that identifies problems as "IMO <year> P<number>" or "IMO Shortlist <year> [ACGN]<number>" (with both of those identifiers where appropriate for a shortlist problem that ended up on the IMO), which is what would really be wanted for adding things to Compfiles.

Joseph Myers (Aug 01 2025 at 01:49):

(Occasionally a problem gets put on the paper in a sufficiently different form from the shortlist version that formal statements / proofs of both would be useful, but that's unusual.)

Joseph Myers (Aug 01 2025 at 01:51):

(I'm just browsing the data viewer with exam=asc sorting here to see what the identified competitions are that problems come from; it's not a very friendly way of viewing Lean code, but does help give an idea of problem sources.)

Kim Morrison (Aug 01 2025 at 01:52):

Justin Asher said:

Bolton Bailey said:

I think part of the hope is that things will be similar enough that an agent trained on this data could still be somewhat successful in a newer version environment

Might help to have more backwards compatibility in Lean. Is there a general reason why Lean needs to change so much as a language? I know that the folks at the Lean FRO work really hard on optimizing both the speed and user experience of Lean, but the frequent changes can make it difficult to maintain things.

Don't worry, we think hard about this, and are making the best tradeoffs we know how to. :-) There are a lot of things we would like to have Lean do better, and making this happen involves changing things!

(deleted) (Aug 01 2025 at 01:53):

Honestly mathlib changes break more things than Lean changes

Bolton Bailey (Aug 01 2025 at 01:54):

Yes, I think the main issues that would arise in a upgrade probably do have more to do with mathlib changes than lean core changes, at least that's been my experience during the times when I've dumped our files into live.lean-lang.org when something wasn't working on our own platform. Certainly I wouldn't want lean development to slow down on our account in any case.

In fairness to us, mathlib has the advantage that when someone deprecates or changes something, that person is then responsible for fixing it in the rest of mathlib before CI will accept it (and they are well-suited to make those changes, given that they understand them enough to know the reason for them).

Joseph Myers (Aug 01 2025 at 01:56):

Note that when viewing with exam(asc) sorting, exam="unknown" takes up pages 36 to 1042 of the dataset, i.e. almost all problems lack that metadata, though sometimes the problem text seems to have an indication of the source in an unstructured form that an AI could probably do a reasonable job of extracting heuristically.

Bolton Bailey (Aug 01 2025 at 02:00):

Joseph Myers said:

It looks like a lot of the problems in this dataset with exam = IMO are actually shortlist problems not selected for the final exams. Now, I think having formal statements and solutions of shortlist problems is of value, but I'm not sure if there is clear metadata somewhere in this dataset that identifies problems as "IMO <year> P<number>" or "IMO Shortlist <year> [ACGN]<number>" (with both of those identifiers where appropriate for a shortlist problem that ended up on the IMO), which is what would really be wanted for adding things to Compfiles.

Yes this would be good metadata. I know that information was recorded informally for at least some of the IMO problems I looked at, my guess is that there's no field for it because it's too inconsistent across exams, but I can check if there's anything we can do to add that information or make it easier to access.

Mario Carneiro (Aug 01 2025 at 02:16):

Jason Rute said:

Jason Rute said:

Figuring out how to keep 100k statements and 35k proofs automatically up-to-date would be a major accomplishment on its own and one worthy of accolades.

I'm regretting saying this a bit. I don't actually know how hard this is. This is what Mathlib does all the time and they have more proofs.

I think @Kim Morrison and the rest of the nightly-testing team definitely deserve accolades

Jason Rute (Aug 01 2025 at 02:34):

Serious question since it's been brought up. How hard would it be to update this data to Lean 4.21?

(deleted) (Aug 01 2025 at 02:38):

Proofs that use Complex.abs are the hardest to update.

(deleted) (Aug 01 2025 at 02:38):

The rest aren't as bad.

(deleted) (Aug 01 2025 at 02:42):

There are data contributors who work directly on latest Lean and mathlib and port their proofs back to ancient Lean and mathlib by copying and pasting missing lemmas into the proofs.

Bolton Bailey (Aug 01 2025 at 02:43):

Jason Rute said:

Serious question since it's been brought up. How hard would it be to update this data to Lean 4.21?

If I select random proofs from the dataset, I find most of them compile just fine in live.lean-lang.org, so I suppose one approach would be "just recompile everything and filter out the ones that fail".

Kevin Buzzard (Aug 01 2025 at 07:02):

I bump mathlib two or three times a month in FLT and I would not say that it is difficult, most changes just involve renaming (and we have a deprecation system now) or complying with new linters (and the repo doesn't have to worry about that probably). I can't remember the last time I failed to get a bump over the line in one sitting and needed to ask for help.

I had suspected that the reason the tech companies were not bumping as frequently was simply that it's presumably much harder to bump an LLM


Last updated: Dec 20 2025 at 21:32 UTC