Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: any "open training data; open training code" IMO solvers ?


TongKe Xue (Nov 29 2025 at 07:03):

Let OTDOTC = "open training data, open training weight" ; ex. excludes OpenAI, DeepMind, Anthropic, Llama; as well as their fine tunes

Let's also go ahead and exclude the geometry IMO problems, as I believe they are solved via (procedure for adding points, decision procedure.)

  1. subject to the above constraints, I'm curious if there is anything that can solve > 50% of { lean4 formalized, non-geometry } IMO problems
  2. basically I'm looking for a system I can train + run end to end myself and that can solve some portion of the non-geometry IMO problems

I expect this to be easier than training a generic foundation model, as miniF2F dataset << all of internet.

Context: I'm curious if hobbyists, outside of the top math-super-intelligence labs, can build mini-versions of these solvers on their own. In particular, I think this tech is useful for guiding code synthesis;
i.e. if
vibe coding = "english input; maybe correct code output" and
hoogle / loogle = "type sig input; find def w/ type sig"
then maybe this can help build
input = formal spec of input/output conditions
output = code that satisfies formal conditions

Notification Bot (Nov 29 2025 at 09:12):

This topic was moved here from #general > any "open training data; open training code" IMO solvers ? by Patrick Massot.

Gerben Koopman (Nov 29 2025 at 15:57):

Maybe check out Reprover. It's a ByT5 model with weights on Hugging Face and the dataset can be found here (its basically a dict of pointers into Mathlib commit 29dcec074de168ac2bf835a77ef68bbe069194c5). You might also be interested in Kimina-Prover with various smaller versions, but I haven't had any experience with it yet.

Gerben Koopman (Nov 29 2025 at 16:01):

(Warning, shameless self plug ahead haha) You could also check out Lean Reinforcement. It's basically intended to be an mcts wrapper of the ReProver model, with a custom-trained value-head. It's still in development, but last Sunday I got a full end-to-end successful run on the cluster used. Sadly they are currently undergoing maintenance, but once it's back online I hope to do a full run with proof parallelization implemented (current parallelization in the last 4 commits is not that).

Gerben Koopman (Nov 29 2025 at 16:03):

By the time that I have something more concrete I might post something, for now it is very much early days. Of course models like HyperTree or AlphaProof take much of the same ideas with better / more efficient implementation.

TongKe Xue (Nov 29 2025 at 18:28):

My very limited understanding of AlphaGo / MCTS is

Cell = White | Black | Empty
BoardState = [[Cell; 19]; 19]

ideally, we want a function evalFn : BoardState -> f32 (having it would tell us best next move)
we don't have such a fn, but we lean a NN approximation of it via lots of MCTS self play

Questions:

  1. is the core of these non-geometry IMO provers "just merely" MCTS applied to proof states & scaled up ?
  2. BoardState is always fixed size; an Lean4 ProofState however, might have arbitrary number of hypothesis, each themself which might be arbitrarily long; how is "ProofState" represented in the mcts provers ?

TongKe Xue (Nov 29 2025 at 18:44):

There is something else I am misunderstanding here.

  1. ByT5 refers to https://arxiv.org/abs/2105.13626 right ?
  2. To the best of my knowledge, ByT5 is a normal LLM model, with tokenizer replaced with just processing the raw bytes.
  3. https://github.com/lean-dojo/ReProver then states "Our tactic generator is a ByT5 model finetuned to generate tactics given a proof state."
  4. (3) confuses me because -- why is the world knowledge used in ByT5 useful for generating Lean tactics ? [I expect only a tiny fraction of ByT5 training data involves Lean tactics, and knowing things like the various capitals of states/countries in the world to be useless for Lean tactics]

I don't know what I am misunderstanding, but clearly I am getting something wrong as I hold a set of self-contradictory beliefs. What am I fundamentally misunderstanding ?

Thanks!

Gerben Koopman (Dec 01 2025 at 11:45):

TongKe Xue said:

Questions:

  1. is the core of these non-geometry IMO provers "just merely" MCTS applied to proof states & scaled up ?
  2. BoardState is always fixed size; an Lean4 ProofState however, might have arbitrary number of hypothesis, each themself which might be arbitrarily long; how is "ProofState" represented in the mcts provers ?

So to first answer these questions, in short, no. The idea is I guess twofold.

Firstly, the LLMs are basically good enough at string generation that they can be used as tactic generators. Nothing in this setup yet forces legal tactics(/moves/actions). This is then applied to the Lean interface (LeanDojo/REPL/Lean Interact/etc), where it is checked to be valid. Now you could stop there. If the LLM is good enough, it will iteratively provide tactics to further the proof state until complete.

However, the second idea is that you can do something a bit more elaborate. AlphaProof, for example. Instead of feeding the tactic straight into the Lean interface, and discarding a proof if you went down a wrong branch, you keep several branches with this mcts algorithm from reinforcement learning. This way you have more chances to find a correct proof (though this is slightly vacuous, as you also use correspondingly more compute). Importantly, however, you smartly save results from used compute as nodes in your tree. Finally, a bit more on the training/engineering side, by doing this you also generate much richer data that you could later train on again. Either a value head as you need in certain mcts implementations, the tactic generator LLM, or your whole set-up if you like.

Regarding board size, we don't have to know it in this case. For Chess or Go you need to simulate an opponents actions and board value, but with Lean you only need succes/failure/finished/error. That way we can decide a max number of nodes in our tree, or set a max compute time, and let the model explore as long as it likes within those constraints. 'Moving off' the board is not a thing here.

Gerben Koopman (Dec 01 2025 at 11:53):

TongKe Xue said:

  1. ByT5 refers to https://arxiv.org/abs/2105.13626 right ?
  2. To the best of my knowledge, ByT5 is a normal LLM model, with tokenizer replaced with just processing the raw bytes.
  3. https://github.com/lean-dojo/ReProver then states "Our tactic generator is a ByT5 model finetuned to generate tactics given a proof state."
  4. (3) confuses me because -- why is the world knowledge used in ByT5 useful for generating Lean tactics ? [I expect only a tiny fraction of ByT5 training data involves Lean tactics, and knowing things like the various capitals of states/countries in the world to be useless for Lean tactics]

For these questions, let's do them one by one.

  • Yes, that's the ByT5 model I mean.

  • Yes, ByT5 is in that sense a normal LLM. The question is implicit I guess, but we'll get there in point 4 :slight_smile:

  • (This is not really a question either.)

  • Indeed, it carries facts as a side effect of the thing we care about; knowledge of the English language, how words are formed, and what they mean. Think of all the theorems and tactics that behave roughly like English. That sounds a bit flippant, but algorithms need to be pedantic afterall. When we finetune for Mathlib, we don't have to first teach the model English, saving compute. [And this is a huge tangent, but it might also save the creation of certain types of attention heads, see here]. Finally, why ByT5 and not another LLM? Because byte encoding allows us to nicely capture Lean symbols such as \forall, \exists, \to, etc.

Final note, I'm not sure how much knowledge is still in the ByT5 model after finetuning. I suspect very little compared to the original ByT5. But this is beside the point I guess.

TongKe Xue (Dec 01 2025 at 15:18):

Thank you for your detailed posts. I found them quite instructive. I want to focus on one thing in particular: How do we generate tactics to send Lean4 ?

I still find using LLMs for this a bit counter intuitive (given how much structure there is in a lean expr tactic, it's not only an AST like Lisp code, but it's a typed AST, with dependent types, and we're throwing all that away and telling a LLM trained on reddit threads: hey, try to generate this for us.)

But to steelman the existing tech, based on your description, it goes something like this right:

  1. Use ByT5 is trained on reddit posts, shopify pages, SEO spam, ... but in doing so, it learns certain primitive functions/procedures, and this part is expensive, and we don't want to do it ourselves.
  2. During the finetune phase, we feed it so much lean4 tactics / springer math books, that it's basically a lobotomy, where it keeps the basic "primitive funcs/procedures", and has all the knowledge replaced with math/lean4 tactics.
  3. This is the current best way to generate Lean4 tactics ?

I think the natural followup to this is:

  1. What exactly are the data we use to finetune ByT5 ?
  2. How do we get Lean4 training data? Is it via running all the existing Lean4 proofs, after every tactic, recording the (hypothesis, goal)-state along with the tactic applied ?

Lastly, using the word "crutch" non-offensively here, is this whole "LLM generates tactic" a bit of a crutch in that (1) we don't have anything better, (2) we don't really know how to train this, so (3) we just throw away lots of structure, looking at the hyopthesis/goal/tactic as just "merely strings", and feed it to the LLM, hoping for the best ?

Gerben Koopman (Dec 01 2025 at 17:06):

My pleasure. I'm glad you find it helpful :slight_smile:

As for types and AST, it's a bit complex, and there certainly have been smart people doing smart stuff, but I don't recall everything I've read. IIRC there was one paper that explicitly used the whole Lean metadata that you get in the infoview (and maybe more, e.g. with ASTs). I'll post a link when I find/remember it.

Similarly there are also other approaches that integrate certain hypothesis into the model structure. E.g. Graph2Tac is a nice example I think, give it a read for graph neural nets used for Lean.

As for throwing away the things we learned on reddit etc., that really is just junk to us. We only care about the embedding space that comes with ByT5 pretrained. When we finetune on Mathlib proofs, the LLM starts learning these tactic names.

As for an explicit example of how that might work, let's take ReProver (which I'm most familiar with). You pass it the theorem's full statement, up to the := by part, and it embeds it into a vector space via an encoder-style transformer. Then, this vector that we get is used to rank all accessible premises (definitions, theorems, etc.) to retrieve the top-k of these. Next, we pass the theorem statement + these top k retrieved premises to the encoder-decoder transformer to generate a tactic. If you are familiar with the term in-context learning, we are basically giving the encoder-decoder in-context info on premises (anyone please correct me if this intuition I have here is wrong). This model generates our full tactic to apply.

To answer your numbered questions:

  1. Yes, it is trained on whatever, and we'll piggyback that embedded language understanding for our own purposes. When choosing between pre-trained or random initialization, ByT5 sounds pretty good :slight_smile: (to be clear, models such as AlphaProof doubtless train from scratch, but then everything is in-house and purpose built. Different team sizes, different solutions.)
  2. We usually feed it Lean proofs (+ potential synthetic data), as Springer textbooks are almost never formal. Though I can perhaps be proven wrong on this. Maybe there are smart tricks for using non-formal proofs. If you have an auto-formalizer, then things really kick off for data generation and training. Don't have any experience there yet though. For knowledge, yeah. I know I mentioned knowledge/memory earlier, but I should really be careful with those terms in LLMs. It's very much the gray-matter of any transformer, and understanding it or trying to anthropomorphize [I just realized why the company is called that! hahah] it can be counterproductive. There is plenty of interesting research into the memory of LLMs though, and the difference between reproduction and originality.
  3. I dare not say. All I can say is, for AI, scale=good and freedom=good (in the sense of freedom for the model to learn arbitrary connections), but freedom requires scale. So in practice, if we want better models, transformers are the current way to go. Due to data restrictions in this field though, some priors (such as the premise retrieval, graph nn, etc.) are beneficial. I guess best needs to be defined more precisely anyway.

Sorry, the post is getting long. Let me re-read and I'll answer the second part.

Gerben Koopman (Dec 01 2025 at 17:15):

For your other two questions, I would just recommend reading either the paper or website of ReProver. They explain how the trained it, and precisely what on.

TongKe Xue (Dec 01 2025 at 17:17):

Gerben Koopman said:

As for an explicit example of how that might work, let's take ReProver (which I'm most familiar with). You pass it the theorem's full statement, up to the := by part, and it embeds it into a vector space via an encoder-style transformer. Then, this vector that we get is used to rank all accessible premises (definitions, theorems, etc.) to retrieve the top-k of these. Next, we pass the theorem statement + these top k retrieved premises to the encoder-decoder transformer to generate a tactic. If you are familiar with the term in-context learning, we are basically giving the encoder-decoder in-context info on premises (anyone please correct me if this intuition I have here is wrong). This model generates our full tactic to apply.

  1. This is where the "Retrieval-Augmented" part of "Retrieval-Augmented Theorem Provers for Lean" comes from ?

  2. Is this literal euclidean / cosine distance, or is there something smarter going on. I 100% believe your claim that this is how ReProver works, but the fact

You pass it the theorem's full statement, up to the := by part, and it embeds it into a vector space via an encoder-style transformer. Then, this vector that we get is used to rank all accessible premises (definitions, theorems, etc.) to retrieve the top-k of these. Next, we pass the theorem statement + these top k retrieved premises to the encoder-decoder transformer to generate a tactic.

works at all is astonishing to me (because of how "lossy / imprecise / nothing like a compiler-stage" each step of this is.

TongKe Xue (Dec 01 2025 at 17:18):

Gerben Koopman said:

For your other two questions, I would just recommend reading either the paper or website of ReProver. They explain how the trained it, and precisely what on.

Will do, thanks for your explainations. I think I now have enough "high level context" of how this is meant to work. Thanks again!

Gerben Koopman (Dec 01 2025 at 17:20):

TongKe Xue said:

Lastly, using the word "crutch" non-offensively here, is this whole "LLM generates tactic" a bit of a crutch in that (1) we don't have anything better, (2) we don't really know how to train this, so (3) we just throw away lots of structure, looking at the hyopthesis/goal/tactic as just "merely strings", and feed it to the LLM, hoping for the best ?

The last part is a bit hard for me to judge, but I think we shouldn't discount the value of a good LLM. After all, much of the inspiration for using LLMs is that they are pretty good guiding agents according to experience. E.g. ask it to prove an undergraduate result and it can probably give you a reasonable idea. Maybe there are better ways, but hence the research.
And I do seem to remember some paper explicitly using all these rich signals, but I simply can't think of the source right now.

Gerben Koopman (Dec 01 2025 at 17:23):

Basically yes to both the retrieval augmented and the distance (not sure of the metric right now but probably dot product). As for lossy, I guess so. Interactive self-play kind of let's the agent explore these structures itself though, like I mentioned for mcts. But this can also be done with ppo for example.

Deleted User 968128 (Dec 01 2025 at 18:54):

If you're not already, check this out. The progress is astounding 11 days in on a 4 month comp with large prize (for kaggle) - though they may have made the questions too easy.

#Machine Learning for Theorem Proving > Kaggle, 38/50 already via open models, $2M prize fund

TongKe Xue (Dec 01 2025 at 21:24):

@Gerben Koopman :

Two followups:

  1. I could not find a separate ReProver paper. Is https://arxiv.org/pdf/2306.15626 (the LeanDojo paper; section 5) the best description of ReProver ?

  2. Is there a benchmark somewhere of ReProver vs any of the IMO gold medals? (unreleased OpenAI, gemini, Harmonic, recent DeepSeek v2) ; I'm curious how this compares vs SOTA. The LeanDojo paper makes a comparison vs GPT4, but that is a few generations behind SOTA.

Thanks!

Gerben Koopman (Dec 02 2025 at 07:06):

Yeah sorry that wasn't entirely clear. That is indeed the right paper. LeanDojo is the gym-like for ReProver. As for the benchmark, not that I'm aware of. And I don't think ReProver compares to SOTA anymore.

TongKe Xue (Dec 02 2025 at 08:34):

This was a very instructive discussion (measured by the number of technical beliefs you changed my views on). Thank you for your time. Cheers!


Last updated: Dec 20 2025 at 21:32 UTC