Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Monte Carlo tree search

(deleted) (Sep 16 2025 at 05:27):

I know that in textbook MCTS the rollout stage involves playing the game randomly and checking whether it results in a win, or loss or a draw. I don't quite understand how this can be applied in a theorem proving context because it's more likely to lose than to win, and also the proof can be lengthened indefinitely with no end in sight.

(deleted) (Sep 16 2025 at 05:30):

My imagination tells me that in the context of Lean, moves can't easily be enumerated so an LLM is responsible for generating possible moves up to some limit. And because intermediate game states have to be preserved, virtual machine snapshotting is used.

(deleted) (Sep 16 2025 at 05:32):

This is why there is Infinibranch

(deleted) (Sep 16 2025 at 05:33):

Don't worry Morph what I say is based on publicly available information I'm not divulging your secrets

(deleted) (Sep 16 2025 at 05:34):

A paper on automated theorem proving tells me that in the rollout stage as random play is not possible we need to evaluate the move in a different way

Kelly Davis (Sep 16 2025 at 05:48):

Huỳnh Trần Khanh said:

A paper on automated theorem proving tells me that in the rollout stage as random play is not possible we need to evaluate the move in a different way

A RL policy does this. The policy selects "good" moves in the case of games and "good" next steps in a proof in the case of Lean.

The literature has many articles that take this path. See, for example, Section 3.4 in A Survey on Deep Learning for Theorem Proving for a starting point on the literature.

PS: I don't think you mean "evaluate" as traditionally that's the role of a value function or multiple runs in the case of GRPO, Dr GRPO, GSPO, or any other GRPO variant.

Rémy Degenne (Sep 16 2025 at 05:55):

Most applications of MCTS replace the random rollouts with another heuristic. Typically some form of neural network is trained to predict the value of each state. The alphazero paper explains the idea quite well.

Kelly Davis (Sep 16 2025 at 06:41):

Oh I see I misunderstood your question. You really did mean "evaluate". In that case, my PS gives the answer, i.e. value function or multiple runs in the case of GRPO, Dr GRPO, GSPO, or any other GRPO variant.

Wang Jingting (Sep 17 2025 at 01:53):

Huỳnh Trần Khanh said:

I know that in textbook MCTS the rollout stage involves playing the game randomly and checking whether it results in a win, or loss or a draw. I don't quite understand how this can be applied in a theorem proving context because it's more likely to lose than to win, and also the proof can be lengthened indefinitely with no end in sight.

Maybe that's why you want some version of curriculum learning? The model might be able to get some of the easiest problems by luck, gradually knowing how to solve them during training, and then proceeds to some harder ones.

Last updated: Feb 28 2026 at 14:05 UTC