Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Reinforcement learning for solving a single theorem

Jason Rute (Jul 28 2024 at 12:06):

I’ve been thinking more about Deepmind’s use of reinforcement learning at inference time in AlphaProof. While none of us know the details yet, it seems to me that this is likely a really powerful ingredient in their approach.

Jason Rute (Jul 28 2024 at 12:07):

One way to think about this, is that independent searches (while providing needed diversity) are a really wasteful uses of resources. For example, a number of recent papers just use an LLM to independently sample proofs. There is no information passing from one proof attempt to the next. Tree searches are a bit more efficient, since you share initial segments of a proof (and are even more efficient if you properly handle independent subgoals as done in HOList, HTPS, and Tactician/Graoh2Tac), but still different paths of the search tree don’t share information. If one branch struggles on a particular type of subgoal, then another branch may also struggle on that same type of goal. (Caching helps here but doesn’t work if the goals are similar but not equivalent.) Now with tree search we know it is better to do multiple tree searches instead of one giant tree search, so that diversity still helps, but these are always done independently, so that multiple trees don’t share information and have to redo search on the same kinds of goals. Finally, we can also know that it is often better to run two different methods for X time than the best single method for 2X time. But again, when this is done, neither method shares information.

Jason Rute (Jul 28 2024 at 12:07):

On the other hand RL is a mechanism for sharing information and recording what we have learned so far about a proof. It prunes unpromising paths and reinforces promising paths. It also just records information about what goals are solvable and how to solve them. (It might also break symmetries such as preventing multiple proofs of the same thing, but I don’t know for sure.) It also still allows for the sort of diversity we know helps in theorem proving. For example you can do many tree searches, using different goals, different hyper-parameters, different agents, etc, and RL can be a mechanism for these various agents to share findings of their searches.

Jason Rute (Jul 28 2024 at 12:07):

While RL is expensive, once we are talking about spending hours or days on a single problem, it seems like it is much more efficient than the current approaches, and I look forward to any ablations in the AlphaProof paper investigating this.

Jason Rute (Jul 28 2024 at 12:07):

While RL over a corpus of problems is common, I only know one paper to do RL in order to solve a single theorem: TacticZero.

Tyler Josephson ⚛️ (Jul 29 2024 at 04:46):

In this interview, Francois Chollet elevates an approach by Jack Cole for building an LLM that uses training at test time for solving problems in the abstract reasoning corpus (ARC) benchmark:

https://www.dwarkeshpatel.com/p/francois-chollet

Tyler Josephson ⚛️ (Jul 29 2024 at 04:54):

I’m not sure how the machinery should work, but the concept is intuitively attractive to me.

How many problems do we wrestle with, and learn how to solve while we try to solve it? This is all ostensibly in short-term memory. We go to sleep, process our experience, and get better at solving tasks like it in the future. Reminds me a little of DreamCoder: https://arxiv.org/abs/2006.08381

Jason Rute (Jul 29 2024 at 05:13):

The test-time fine-tuning for ARC seems natural since ARC is a meta learning problem.

As for DreamCoder, it does a form of RL over a corpus of problems. RL over a corpus of theorems is not new at all and has been done by a number of works. But maybe DreamCoder is still applicable in the case of testing over one theorem since DreamCoder also expands its corpus of problems by generating synthetic data during its “dream phase” (and this was before synthetic data was the buzz world it is today). I think what DeepMind is doing is similar, adding synthetic statements at test time, what they cryptically refer to as “self-generated variations of the contest problems”. That way, one likely has a mix of easy and hard problems to work through and learn from.

Kevin Lacker (Jul 29 2024 at 17:22):

I like this idea because it seems similar to the approach humans use. Oh, there's this complicated formula with an n. Can I solve it for n = 1, n = 2, n = 3? And then maybe that gives me some ideas for the general problem.

Jason Rute (Aug 03 2024 at 10:56):

I had another thought on test time RL. In the DeepSeek-Prover paper it isn’t clear that their scraping of problems doesn’t already subsume all of MiniF2F, especially if they scraped AoPS and the MATH benchmark. If so, then they would have seen all the test problems in natural language form during reinforcement learning. So if true, it might be that what we are measuring with the DeepSeek-Prover paper is not test time inference, but reinforcement learning inference since each problem was tried repeatedly during the RL phase. (The might be too cynical of a take. It is hard to know without investigating the details which I hope the authors do.)

Jason Rute (Aug 05 2024 at 14:30):

As for AlphaProof, I just found a podcast with Pushmeet Kohli of Deepmind about AlphaProof. (Here starting at 27:15.) It is just a high-level interview for some New York Times podcast, but it seems to confirm some of the suspicions mentioned here. For example, he says it first tries to solve variants of the problems, like trying to solve the statement for even numbers or rational numbers or where the numbers are less than some number. He also says that previous approaches couldn't even solve these problems with months of computing, which makes me again think this form of RL-driven inference on variants of the problem is really the main value-add of their approach over previous approaches.

Last updated: May 02 2025 at 03:31 UTC