Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Dataset to rule them all?


Vasily Ilin (Jun 30 2025 at 06:04):

Is there a Lean dataset "to rule them all"? I mean, a dataset that combines all or almost all Lean code available in the public domain? I found Lean-Github, not sure how much it includes.
Alternatively, is there a list of several datasets that ideally do not overlap much and span all public Lean code?

Jason Rute (Jun 30 2025 at 11:54):

Don’t have the full answer, but some considerations: Do you mind if it contains Lean benchmarks like MiniF2F? Also do you want it to contain the synthetic Lean code datasets gotten through autoformalization (as used in say DeepSeek Prover v1)? Do you want just Lean 4, or also Lean 3. Do you want code only or intermediate proof states? Do you want Lean code written just in the past few months too?

Bartosz Piotrowski (Jun 30 2025 at 21:04):

Not a list of Lean repos, but this tool may be useful for you for extracting data: https://github.com/facebookresearch/LeanUniverse

Vasily Ilin (Jul 06 2025 at 23:22):

@Jason Rute
I don't mind if it contains minif2f, etc
I don't mind synthetic data, but would prefer real data
Lean4
either one, I can extract intermediate states myself
recency does not matter


Last updated: Dec 20 2025 at 21:32 UTC