Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: What's state of public Lean datasets and augmentation?

Sam (Dec 09 2023 at 09:47):

I'm interested in playing around with developing some Lean models (low budget, so not much chance of interesting success at this stage) and would like to avoid putting too much work into data extraction if it's already been done. A premade dataset would be ideal. I'm aware of the LeanDojo dataset and the MLFMF dataset - which looks a bit harder to use, but maybe more flexible. Are there any others which I've missed?

I'm also curious whether any significant effort has been put into data augmentation. I'm not a Lean expert (nor a mathematician, I'm very much just an interested amateur here) but it seems to me that there are extensive opportunities to expand the size of these datasets through equivalence preserving transformations.

For a trivial example, rw [dropSlice_eq, drop_drop, add_comm] can be split into separate rws. Other examples might include:

Rewrites (and similar) could be reordered when there's no dependency between them.
Expressions in the context could be put into equivalent forms (and then a step inserted into the proof to undo this)
Context statements could be reordered (as far as permitted by dependencies)
Tactics which can be "inlined" into invocations of other tactics could be inlined in the dataset

I'm sure I'm not the first person to think of any of this, but I'm wondering what the state of building a reusable high quality dataset with support for these types of augmentations is? In theory, this should be able to go some way towards mitigating the small dataset issues.

I'm also interested in versions of proofs in which some or all premises are anonymised as a form of augmentation.

Sam (Dec 09 2023 at 09:51):

I should point out that although simple augmentations like the rw one could be done via string manipulation, to be most useful we would probably want the dataset to include contexts, so even these simple augmentations would need to be done within Lean itself.

Sam (Dec 09 2023 at 09:55):

It would also be interesting to have a dataset which includes term expansions for each tactic invocation. At the very least, I'd like to try training tactic expansion as an auxiliary learning task.

Jason Rute (Dec 09 2023 at 13:01):

Some of your first questions about available tools and datasets might be answered in #Machine Learning for Theorem Proving > How to run a simple ATP baseline for Lean but honestly this is moving so fast that I don’t know what all the available tools and datasets are for Lean 4, and how easy they are to work with.

Jason Rute (Dec 09 2023 at 13:04):

As for tactic decomposition, this used to be more common especially in the pre-transformer works like CoqGym and ProverBot9001. The idea is that if you are not using a pure language model solution to generate tactics, then you need some sort of explicit list of tactics, and having fewer might be better.

Jason Rute (Dec 09 2023 at 13:05):

For some other data augmentation see the PACT paper for some ideas, but I agree a lot more can be done in this space.

Jason Rute (Dec 09 2023 at 13:14):

Also maybe see this PA.SE answer which also talks about some of the augmentations you were thinking like anonymizing the goals. The key to making you own augmentations would probably be to learn some Lean metaprogramming through the metaprogramming book.

Sam (Dec 09 2023 at 13:21):

Thanks @Jason Rute, those all look like useful resources!

Do you know if there's interest in, or any effort towards, a community maintained dataset project? It seems that most paper releases do this themselves, and I suspect this slows the pace of progress significantly. I know that I personally am less likely to persue this interest in the short term given the amount of work it would take (including probably requiring me to learn Lean metaprogramming to relatively high proficiency) to get my ideal dataset together.

It would be cool to have one single dataset, along with a standard tool capable of lots of different augmentations.

Sam (Dec 09 2023 at 13:22):

When I briefly looked at the metaprogramming stuff, it seemed like maybe LeanDojo's data extraction pipeline was a good starting point? But I'm not sure yet.

Sam (Dec 09 2023 at 13:25):

It feels like progress in the Lean ML space currently requires quite high proficiency in both Lean and ML, and that's quite a small set intersection. Personally, I lack Lean profiency beyond a superficial level. It would be cool to have the tooling good enough to open this area of research up to the wider ML community.

Jason Rute (Dec 09 2023 at 13:26):

As for proficiently, that is tricky. I recommend ML and ITP pair up to help with this.

Jason Rute (Dec 09 2023 at 13:28):

As for general purpose tooling, I would also like to see more. I personally want to play around with the dataset extractor and ML Aesop extensions mentioned in that other thread. But I also think it is hard to make a one size fits all solution so the best hope is easy to modify code. :smile:

Utensil Song (Dec 09 2023 at 13:28):

Sam said:

When I briefly looked at the metaprogramming stuff, it seemed like maybe LeanDojo's data extraction pipeline was a good starting point? But I'm not sure yet.

LeanDojo's data extraction pipeline and lean-training-data are the best (open-source) references so far.

Jason Rute (Dec 09 2023 at 13:32):

@Utensil Song Is it easy to modify? I’m personally trying to figure out whether to try that or lean-training-data , especially if I need to modify the code to return different types of information from the goal states.

Jason Rute (Dec 09 2023 at 13:33):

I guess I should just look at both.

Jason Rute (Dec 09 2023 at 13:34):

(Oh sorry, you mention both.)

Sam (Dec 09 2023 at 13:39):

Ah, lean-training-data looks interesting! That's one that I'd missed. It seems there's some interest in making that a community tool - the readme invites requests for additional features.

Abhishek Anand (Dec 09 2023 at 13:41):

Sam said:

...

Expressions in the context could be put into equivalent forms (and then a step inserted into the proof to undo this)

Context statements could be reordered (as far as permitted by dependencies)

A alternative/complementary approach could be to investigate encodings that have the desirable properties like premise order invariance. These may require training exclusively on proofs data, which may be a disadvantage compared to pre-training on all of the internet. Some examples are:

it seems that isarstep uses an encoding that has premise order invariance baked in (see fig 2 of https://arxiv.org/abs/2006.09265)
tree encoding https://www.microsoft.com/en-us/research/publication/novel-positional-encodings-to-enable-tree-based-transformers/

Sam (Dec 09 2023 at 13:45):

@Abhishek Anand Yeah, for those particular examples baking invariances into the models can be a good option, although there are many more available augmentations where this doesn't apply.

Actually though, I wonder whether baking these invariances in actually works out as being efficient. You get better inductive biases, but you lose the "extra training data" from training on all of the permuted version. There's probably a paper out there which answers this question.

Utensil Song (Dec 09 2023 at 13:48):

Jason Rute said:

Utensil Song Is it easy to modify? I’m personally trying to figure out whether to try that or lean-training-data , especially if I need to modify the code to return different types of information from the goal states.

Implementation-wise, they are very different.

LeanDojo uses Lean meta-programming to extract everything they possibly care about to a trace IR, then processes the IR in Python, and works with other data pipeline tools. lean-training-data focuses on using pure Lean meta-programming to directly extract different types of target datasets, which is potentially better for the specific tasks as it will inspect and transform things in Lean for that task.

Sam (Dec 09 2023 at 13:56):

If we were to setup a data augmentation pipeline, there would be two general ways of doing it:

Build tooling which can augment samples on request, and put this in your training loop in the normal way of augmentation. This is ultimately the most powerful, but getting it setup to run on the training machine might be a bit of faff, and I don't know whether it could be fast enough not to be a bottleneck. This seems somewhere between hard and impossible.
Build a tool as above, and provide a pre-generated dataset a few orders of magnitude larger than base mathlib by baking in lots of augmentations. This is ultimately less powerful, but provides a much more accessible way of getting started.

Sam (Dec 09 2023 at 14:00):

@Scott Morrison would you consider data augmentation transforms to be in scope for lean-training-data?

Jason Rute (Dec 09 2023 at 14:04):

1 doesn’t seem feasible to me since once you remove the data from Lean you have removed the information you need to augment it. Unless like LeanDojo (from what it seems), you are extracting tons of metadata with the data you need, but even then it might be too hard.

Sam (Dec 09 2023 at 14:07):

@Jason Rute I was assuming the augmentation would be done as part of the extraction process. Essentially the pipeline would have to take some proof, feed it through an augmentor, and then extract multiple augmented versions.

I agree that this seems almost impossible to do in the training loop after the fact.

Jason Rute (Dec 09 2023 at 14:07):

Oh, I think I misunderstood 1. You mean your tool will extract data from Lean in the format you need it (with whatever augmentations you request). This seems best to me. It doesn’t even have to be a tight integration with the ML pipeline, unless you are doing RL.

Sam (Dec 09 2023 at 14:09):

Yeah. Ideally of course you want an effectively infinite amount of data, which means an effectively infinite amount of augmentation - which is best achieved by doing it live. But that seems infeasible. The closest we could get to that would probably be re-running the augmented extraction every epoch.

Sam (Dec 09 2023 at 14:10):

Realistically, just having a 100X size dataset with pre-applied augmentations would be a great start. If you can stick something like that on huggingface, the barrier to entry goes way down.

Sam (Dec 09 2023 at 14:22):

lean-training-data's readme includes TODO: Break out individual steps of rw and simp_rw, with intermediate goals. This is easy to do, just needs some plumbing. which looks like a vague start in the right direction!

Sam (Dec 09 2023 at 14:25):

Related to all of this, it would also be cool to put community effort into a really good Lean tokenizer. ReProver just used bytes because tokenizing lean is hard, but I bet it's possible to do better.

Sam (Dec 09 2023 at 14:26):

As bare minimum, one could use a byte tokenizer + special tokens for common strings like tactic names.

Jason Rute (Dec 09 2023 at 14:56):

You have to be careful. If you use a custom tokenizer, you can’t use pretrained models.

Sam (Dec 09 2023 at 15:05):

Well, I haven't tried it, but I suspect in some cases moving to a custom tokenizer might be reasonable as a finetuning step.

Jason Rute (Dec 09 2023 at 15:08):

So you would not use the original vocabulary embeddings, but just the internal weights?

Sam (Dec 09 2023 at 15:09):

Related to these augmentation discussions, we can also get a lot more data out of mathlib by introducing some other tasks such as

Given some goal state and some future goal state (not necessarily on the proof path); what tactic sequence achieves that state?
Given some goal state and some tactic sequence (not necessarily on the proof path); what goal state results?

Although these aren't exactly augmentations, I think they have their place in the same tool. The method to generate them would be to apply random (but sensible) tactics to each goal state and extract. Obviously you don't want to train the model to generate those random tactics, but you can train it to understand their consequences.

Sam (Dec 09 2023 at 15:10):

Jason Rute said:

So you would not use the original vocabulary embeddings, but just the internal weights?

Yeah, I think that should be possible within reason. Probably. I'm sure somebody's tried it in some domain.

Jason Rute (Dec 09 2023 at 15:14):

It would certainly make sense to extend the PACT approach with whatever auxiliary data you want (if it fits in the context). All your suggestions sound reasonable. The key I think is to find the right data to improve the search overall. That might require not just auxiliary/augmented data, but different ways of searching entirely.

Sam (Dec 09 2023 at 15:15):

One thing to be cautious of (which weirdly I've never really seen discussed) is that more compact tokenization reduces computational capacity. Models do computation per token, so a higher information per token rate should in theory make the model weaker, although to some degree that may be offset by the task being easier.

So if, for example, we tried to finetune the ByT5 ReProver model to use fewer but larger tokens, it may struggle because you've reduced it's internal capacity by a factor of mean_bytes_per_token.

Weirdly I've never seen this information density per token idea discussed, but it seems like it must surely be true.

Sam (Dec 09 2023 at 15:18):

Jason Rute said:

It would certainly make sense to extend the PACT approach with whatever auxiliary data you want (if it fits in the context). All your suggestions sound reasonable. The key I think is to find the right data to improve the search overall. That might require not just auxiliary/augmented data, but different ways of searching entirely.

Even outside of search, I'd love it if we could have a really strong Lean base model to build all of those search type models off of. That's why I'm so interested in extensive augmentation and auxiliary tasks designed to force a really strong internal world model.

Sam (Dec 09 2023 at 15:21):

My interest got renewed today after seeing the hype about Mamba potentially blowing things wide open in this space. If the early indications pan out, and we really do have a faster and stronger alternative to transformers, the barrier to building large strong base models just went down...

Jason Rute (Dec 09 2023 at 15:29):

mamba?: https://arxiv.org/abs/2312.00752

Jason Rute (Dec 09 2023 at 15:30):

Yes, I’ve been saying for a while we need to do more with state space models. Maybe it is time for me to follow through on that. :smile:

Sam (Dec 09 2023 at 15:33):

That's the one! I've only skimmed the paper so far and it's got lots of new concepts to get my head around, but it's picking up some hype for good reason.

Somebody on the machinelearning subreddit implemented it wrong by accident (throwing in a couple of extra feedforward layers in each block) and got some very impressive results on a toy character level shakespeare model. I ran their colab and got very impressive results in 13 minutes of training on a T4 and went "Oh, they might have something here..."

Sam (Dec 09 2023 at 15:35):

I haven't looked into it enough to know if it's actually good, but that result definitely left me wanting to play with it! My first thought was that I'd love to try it on Lean, and then I hit the datasets problem and ended up here.

Sam (Dec 09 2023 at 15:38):

Jason Rute said:

Yes, I’ve been saying for a while we need to do more with state space models. Maybe it is time for me to follow through on that. :smile:

If Mamba fully lives up to the hype, I'm afraid you may be 8 days too late to be the guy who defined the future of AI. :laughing:

Jason Rute (Dec 09 2023 at 15:44):

It is not about the model but how you use it. State space models have long contexts and the ability to checkpoint an initial segment of the conversation. That could be used to incredible effect. See the discussion in #Machine Learning for Theorem Proving > HiPPO papers (efficient sequence modeling).

Sam (Dec 09 2023 at 15:45):

Yeah, but early results are also showing them training faster/better than transformers for the same number of tokens and parameters, which is pretty cool if it scales!

Sam (Dec 09 2023 at 15:50):

Checkpointing isn't exactly exclusive to SSMs is it? You can do it via KV cache, but of course those are much larger.

Jason Rute (Dec 09 2023 at 15:53):

Exactly, it is so much larger.

Sam (Dec 09 2023 at 15:59):

Yeah, part of what got me interested in using Mamba with Lean is some ideas I had for how to make efficient use of that checkpointing. My thinking is that it should make it easier to do multi task learning where at each step (say after each tactic) you branch into a few different learning tasks.

Of course this can be done with a transformer based LLM and a custom mask, and there are some FlashAttention implementations that can make that masking somewhat efficient in some cases, but Mamba should be able to handle it more naturally I think.

Sam (Dec 09 2023 at 16:05):

It's worth noting that part of what makes Mamba so promising is that Tri Dao (of FlashAttention fame) collaborated on a very high performance CUDA kernel for it, which allows it to be very fast despite some of the pitfalls usually present with recurrent models. They use tricks to keep the checkpoints in SRAM which really makes this go fast.

Those tricks probably start to lose their effectiveness if you interrupt the kernel to branch too often.

Sam (Dec 09 2023 at 16:18):

Actually it looks like the fast path used for training doesn't return states at all unfortunately. But that's just from scanning the python code that calls into the kernel - I should actually play with it.

https://github.com/state-spaces/mamba/blob/main/mamba_ssm/modules/mamba_simple.py#L122

Scott Morrison (Dec 11 2023 at 23:06):

Sam said:

Scott Morrison would you consider data augmentation transforms to be in scope for lean-training-data?

Sorry about the slow reply here; yes, this would definitely be in scope, and I'm strongly in favour of doing data augmentation in Lean rather than a post-processing in python. You have the Lean tactic monads available to do meaningful transformations.

I am really hoping to put a lot of work into lean-training-data come January. :-)

Scott Morrison (Dec 11 2023 at 23:07):

There is also the Lean REPL, which gives some flexibility for do data augmentation "post-hoc". E.g. starting from a string you can run a partial declaration with a sorry in it in the REPL, get access to the tactic state, run tactics that modify the tactic state in some "augmenting" way, and then pretty print (perhaps using extract_goal) the new goal back to string for your "offline" store of data.

Jason Rute (Dec 12 2023 at 02:03):

One thing that would be very useful I think is to have a mechanism (if it doesn’t already exist) in lean-training-data and lean-copilot for keeping track of the internal/hidden state of the search. I have this impression that the data in lean-training-data and the tactic prediction API in lean-copilot are stateless. A prediction can’t use the previous tactic history in the partial proof. It also can’t pass hidden information forward, which is important for doing more complicated search algorithms, or for using RNNs in your search.

Jason Rute (Dec 12 2023 at 02:04):

Also if one wants to do everything in Lean, there needs to be a way to store metadata along with new theorems/definitions as they are created. Right now for example, I don’t think lean-copilot can do premise selection for new definitions.

Last updated: May 02 2025 at 03:31 UTC