Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Making more training data

Adomas Baliuka (Dec 04 2023 at 02:22):

Mathlib has a lot of manual review in addition to automation like linters. This is essential for growing a math library aimed at users and has to limit the amount of code that gets in. For increasing the amount of training data in Lean, maybe there could be a different repository that grows much faster? One that's not meant for reading by humans but just as a dataset?

For example, it might have CC license and PR policy "auto-merge everything if CI passes". One could then add some more automation for de-duplicating things, sorting results by topic, auto-formatting, etc.

Looking even further, one could also auto-webcrawl and pull in code snippets from Zulip (or elsewhere) when they contain an appropriate copyright/consent comment, and users on Zulip might get some option to "allow harvesting all their code snippets on Zulip". (I'm not sure if Zulip already assumes this consent, or if current AIs are trained on code from Zulip assuming no consent is necessary, not a lawyer, not legal advice, etc...)

Junyan Xu (Dec 04 2023 at 02:48):

There was mathzoo but it didn't gain much traction. I guess everyone is free to scrape public Lean repos on GitHub, and OctoPack contains a decent amount of Lean data. Morph Prover already trains on Zulip data (see blog post).

Adomas Baliuka (Dec 04 2023 at 02:55):

I hadn't heard of Mathzoo, thanks for pointing it out!. However, it currently doesn't seem to have CI and still feels like it's "aimed at humans", by asking for README's and explicitly encouraging to add statements as axioms. I'm actually surprised it doesn't have CI. Is it due to cost? Or just nobody set one up yet? Figuring out a good way that "anyone can easily set up CI for a project that depends on just Mathlib" is a good thing in any case, isn't it?

Mario Carneiro (Dec 04 2023 at 04:35):

I'm sure it's because no one set one up yet

Mario Carneiro (Dec 04 2023 at 04:35):

Re: "aimed at humans", who do you expect to put content in the repo?

Mario Carneiro (Dec 04 2023 at 04:37):

I also don't really understand your point. It doesn't ask for a readme but it does ask for some provenance information regarding the problem, which I think is a reasonable ask for any dataset

Mario Carneiro (Dec 04 2023 at 04:38):

AFAICT the only READMEs in mathzoo are for entire formalization collections, not for individual files

Mario Carneiro (Dec 04 2023 at 04:39):

As for adding statements as axiom, this is important to be able to make claims that follow from complex facts whose proof is out of scope for the problem in question. Without this you would only be able to talk about things near the foundations

Mario Carneiro (Dec 04 2023 at 04:46):

As far as adding other datasets like Zulip, I think that would be fine although you will have difficulty making most of it compile, many posts are context dependent (so getting imports right will be difficult) and it is normal to post things that don't compile here (because most MWEs have errors in them, that's the point). Regarding general web crawl, that sounds like a major legal issue, but all the cool kids are doing it anyway... I'd advise against it.

Adomas Baliuka (Dec 04 2023 at 04:53):

I guess a serious obstacle to having CI on mathzoo might actually be the difficulty of incompatible Mathlib and Lean versions used in different gems?

Mario Carneiro (Dec 04 2023 at 04:54):

well, the fact that it's in lean 3 is the main obstacle

Mario Carneiro (Dec 04 2023 at 04:55):

The situation with multiple lean versions is acknowledged in the README, I think the idea is to run multiple lean versions to compile them

Mario Carneiro (Dec 04 2023 at 04:55):

certainly that is another confounding factor when scraping zulip, posts almost never come with a lean-toolchain

Adomas Baliuka (Dec 04 2023 at 04:56):

As an aside, how hard is it to set up CI for a Lean4 project? I just tried (blind copy-paste) the script from community docs and got an error:

error: toolchain 'leanprover/lean4:v4.3.0' does not have the binary /github/home/.elan/toolchains/leanprover--lean4---v4.3.0/bin/leanpkg

Eric Wieser (Dec 04 2023 at 10:38):

That's lean4#2872

Adomas Baliuka (Dec 04 2023 at 11:46):

Does that mean that there are no instructions on that website on how to set up CI for Lean4 projects? Are there instructions elsewhere?

Alex J. Best (Dec 04 2023 at 13:58):

There are no instructions for setting up CI for lean 4 projects on the community website that I'm aware of. Copying the CI used by some existing lean 4 project, such as https://github.com/leanprover-community/flt-regular/blob/master/.github/workflows/push.yml seems a good way to start

Alex J. Best (Dec 04 2023 at 13:58):

You probably want to delete all steps relating to docs and blueprint to start

Utensil Song (Dec 04 2023 at 14:11):

IMHO having more Lean code with lower quality threshold is not going to help training. Selected and (semi-auto-)annotated Lean data with higher quality is what's really helpful here.

Adomas Baliuka (Dec 04 2023 at 14:13):

Compared to what I'm generally seeing people train/finetune LLMs on, "code that compiles" seems like a very high quality threshold

Utensil Song (Dec 04 2023 at 14:38):

Not exactly in the area of ITP. To name a few quality-related directions:

Mathlib is mostly lack of comments to its informal counterparts, it's an obstacle for informal-to-formal tasks or vice versa;
Zulip #mwe months ago (or any code without code review) might be already non-idiomatic code, which could create more frustration for newcomers using the AI for self-teaching;
Mathlib proofs might be highly golfed, it's hard for LLMs to reason literally based on such highly compressed information;
Extracting goals from Lean code is existing work, but trying different combinations of tactics using RL and produce more variants of the same proof (and their tactic states) for LLM to truly make use of what ITP can teach about the internals of a proof is not seen yet.

Utensil Song (Dec 04 2023 at 14:38):

Also, the work to make sure a bunch of Lean code to compile is actually non-trivial effort as there are many subtleties involved in non-trivial proofs that depends on the moving parts of Lean/Std/Mathlib, much more than just setting up a CI.

Instead of trying to create a hub to host these code while there is little to no incentive for humans to contribute, if you are simply looking for more Lean data outside Mathlib, you might just start with the emerging repos marked with lean4 or using lean language, which is more decentralized and guaranteed to thrive.

Utensil Song (Dec 04 2023 at 14:45):

Particularly, reservoir will set up CI for all such Lean 4 projects and check if they pass for the latest Lean version. People will also be encouraged to keep it green as there would be benefits like easier package discovery, built binary cache, hosted doc-gen4 documents etc.

Patrick Massot (Dec 04 2023 at 14:52):

@Mac Malone Utensil's message above made me check the current state of reservoir and I was very surprised to see Mathematics in Lean with a big red cross suggesting it is currently broken. Following the link suggests reservoir complains MIL doesn't build with v4.4.0-rc1, which completely expected for a project whose current toolchain is leanprover/lean4:v4.2.0-rc4. I understand you wish to indicate which repo tracks monthly Lean releases, but I think this big red cross is a bit too much and doesn't convey the relevant message.

Patrick Massot (Dec 04 2023 at 14:53):

More generally the landing page of reservoir is full of those red crosses and I think this is both unfair and very bad to communicate what is going on with Lean.

Mario Carneiro (Dec 04 2023 at 14:58):

Yes, this is a known issue, it's terrible PR to have all those red crosses

Patrick Massot (Dec 04 2023 at 15:00):

Meanwhilst:

/home/pmassot/lean/mil> lake update mathlib
error: ./lake-packages/mathlib/lakefile.lean:6:2: error: 'leanOptions' is not a field of structure 'Lake.PackageConfig'
error: ./lake-packages/mathlib/lakefile.lean: package configuration has errors
mathlib: updating repository './lake-packages/mathlib' to revision '19096ec3fc6f290dc088f7f3ffd6318d4a0bd93d'
=> Operation failed. Exit code: 1.
error: > git fetch origin    # in directory ./lake-packages/mathlib
error: stderr:
fatal: impossible d'accéder à 'https://github.com/leanprover-community/mathlib4/' : Failed to connect to github.com port 443 after 130579 ms: Connexion terminée par expiration du délai d'attente
error: external command `git` exited with code 128
warning: package configuration has errors, falling back to plain `lean --server`

TL;DR: reservoir :cross_mark: lake :cross_mark:

Kaiyu Yang (Dec 04 2023 at 15:01):

Instead of showing whether the package builds with the latest lean, why not show the latest lean version that can build the package?

Mario Carneiro (Dec 04 2023 at 15:01):

https://leanprover.zulipchat.com/#narrow/stream/270676-lean4/topic/Invalid.20lake.20configuration/near/405630149

Utensil Song (Dec 04 2023 at 15:02):

reservoir aims to keep the builds up-to-date, but maybe for the current status, it should compile with the declared toolchain, and if the declared toolchain compiles but not the latest toolchain, it's better to be a green :check: with an :info: to indicate it's better to be upgraded.

Patrick Massot (Dec 04 2023 at 15:04):

Of course, this is what the Lean 3 analogue of that page did.

Mario Carneiro (Dec 04 2023 at 15:05):

is there a lean 3 analogue of reservoir?

Patrick Massot (Dec 04 2023 at 15:07):

There was a web page on the community website which listed projects together with information about the latest compatible version of Lean.

Patrick Massot (Dec 04 2023 at 15:07):

As far as I can tell this is strictly more than the current state of reservoir.

Mario Carneiro (Dec 04 2023 at 15:07):

is that still available somewhere?

Patrick Massot (Dec 04 2023 at 15:08):

Sure: https://leanprover-community.github.io/lean3/lean_projects.html

Patrick Massot (Dec 04 2023 at 15:09):

reservoir is meant to do much more in the long run. I think the main issue here is that the current state is public. It should be hidden somewhere until it improves to minimally usable version.

Mario Carneiro (Dec 04 2023 at 15:12):

There was some discussion a few months ago about how to show this kind of information on the reservoir website (the front page / browsing page and also the project pages). We talked about having a dropdown to select your current version or force the latest version. A matrix would work too but it's difficult to scale

Mario Carneiro (Dec 04 2023 at 15:13):

I think Mac is aware that the current display is suboptimal, and last I heard he was working on it this week, so if you have concrete suggestions regarding the site design you should suggest them

Patrick Massot (Dec 04 2023 at 15:20):

https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/Making.20more.20training.20data/near/405843447 is clearly strictly better than the current state.

Mario Carneiro (Dec 04 2023 at 15:22):

There should probably be an issue for this

Utensil Song (Dec 04 2023 at 15:35):

leanprover/reservoir#2

Utensil Song (Dec 04 2023 at 15:39):

Sorry, in retrospect, the issue title was written like a PR title.

Mac Malone (Dec 04 2023 at 16:19):

As Mario said, I am working on Reservoir until the start of my Christmas vacation. Thus, please feel free to send me feature requests or report any issues )GitHub is best, but I am happy to discuss on Zulip as well)!

Junyan Xu (Dec 04 2023 at 17:32):

We talked about having a dropdown to select your current version or force the latest version. A matrix would work too but it's difficult to scale

Could be a list with just two columns (the project name and latest compatible Lean version), ranked by the Lean version, which probably encourages the repo authors to keep up to date.

Last updated: May 02 2025 at 03:31 UTC