Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Reinforcement learning for ATP

Ivan Bolotnikov (Jan 10 2021 at 02:53):

Hello! As far as I understand, the currently dominating idea in this community regarding building ML models for ATP (and for IMO grand challenge in particular) is to expand the Mathlib library with new handwritten proofs and then to use them for training models in a supervised manner. Meanwhile, I'm curious whether the reinforcement learning (RL) approach could be more promising for ATP. I mean, it seems to be a very natural idea to run the following open ended two "component" procedure: the first component generates theorems starting from simple ones and proceeding with more and more complex; and the second component tries to prove these theorems continuously retraining on self-generated proofs. Moreover, it might be possible to expand a pool of available tactics in runtime with incrementally more complex tactics by learning to automatically generate tactics from currently generated proofs.
Actually, I'm just interested in opinion of community members regarding the prospects of the above ideas.

Zygimantas Straznickas (Jan 10 2021 at 05:47):

This is actually something I'm trying to work on! As to why the RL approach isn't more popular, my impression is that it's (maybe correctly) perceived to be very very hard. Not that the supervised methods are easy -- but at least there are good pre-existing ideas to build on top of, "proofs of existence" that show achieving decent results is possible, and infrastructure+datasets to train the models. Whereas for RL, it seems that even state-of-the-art foundational RL techniques aren't yet good enough for this problem. I actually know two experienced researchers who tried to do RL for ATP, and both quickly gave up and refocused on other areas of RL. So it might just be that we need to wait for more RL knowledge to build up :) .

One interesting challenge for the adversarial RL setup you describe is, how do we avoid an equilibrium of boring-complicated problems? Mathematicians can easily feel the difference between the Riemann Hypothesis and a random 5000-clause boolean satisfiability problem. But a sufficiently big SAT problem can be harder to solve than any currently-open math problem. How do we stop the pair of agents from just getting better and better at (solving/generating) random boring SAT problems, instead of "interesting", math-y problems? (btw, afaik this problem is still hypothetical -- we're far away from caring about agents learning the wrong math, we still can't get them to learn any :) )

Stanislas Polu (Jan 10 2021 at 12:34):

Add to that the fact that the action space is enumerable. Most vanilla methods such as UCT+MCTS won't work well in that setup which requires more research in RL. Note that simple expert iteration which is easily applicable here has been demonstrated to work and count as RL as well :)

Vladislav Bezhentsev (Jan 10 2021 at 16:30):

One interesting challenge for the adversarial RL setup you describe is, how do we avoid an equilibrium of boring-complicated problems?

I like this point. Here are some thoughts on this matter:

Though it's indeed not obvious how to generate meaningful theorems in general, we can try to solve this problem in some limited setting. For example, we can consider only combinatorics. Then we can automatically design increasingly complex constructions consisting only of finite sets (including mappings and other relations on them) and ask a predefined set of questions about them, like "count number of elements in <construction>", "find a minimal / maximal (in some sense) construction in a given family", "how big must some substructure be to guarantee that a particular property holds? (citation from wiki on Ramsey theory)". Of course, we will still get a lot of "strange" theorems from a human point of view, but at least 1) all of them will indeed have some combinatorics essence (by construction) and 2) a lot of freshman level, math competition level and even research level combinatorics problems seems to be possible to generate in such a simple framework.
Let's imagine that we somehow were able to formalize statements of the majority of currently existing interesting / important theorems and conjectures. Let's call them "reference theorems". But then, back to the adversarial setup, we can try to generate only those "artificial" theorems which in some sense are similar to "reference theorems". That is we can try to use "references" as anchor points which hinder us from from trying to prove some weird stuff and sticks us to useful parts of mathematics. In such setting we can even transfer the concept of usefulness on the artificially generated theorems. That is, useful theorems are those which frequently occur as lemmas in the proofs of "reference theorems".
We can go a little bit further and train a model on the formalized "reference theorems" to generate new theorems in the same vein.

All in all, the 2. and 3. points are intended to say that it might be possible to gain some major benefits from only formalizing the statements of theorems without formalizing their proofs.

mikey liu (Apr 19 2021 at 01:23):

i think some sort of model-based RL is needed. dreamcoder, a paper i suggested in another thread, is similar to muzero at the top level, for example. but i think we need all the tricks in the bags (nlp, type theory, etc.) to "actually" do math.

Jason Rute (Apr 19 2021 at 13:02):

Ivan Bolotnikov said:

Hello! As far as I understand, the currently dominating idea in this community regarding building ML models for ATP (and for IMO grand challenge in particular) is to expand the Mathlib library with new handwritten proofs and then to use them for training models in a supervised manner.

I'm 3 months too late to this conversation, so you may already know this now, but for the IMO grand challenge specifically, there is only one team working on that, lead by @Daniel Selsam and my understanding (which he is free to correct) is that this is decidedly not the approach his team is taking. In particular, what I can piece together is that they are working on basically a big super IMO tactic which will solve most (all?) IMO problems. What they have realized is that to solve a problem, like prove an inequality, there is a general algorithm one follows, but also one has to make a bunch of choices, such as which of the X number of common inequality lemmas to appl. Similarly, in geometry problems, one has to make a number of choices when constructing geometry diagrams, which if not done well will lead to combinatorial explosion. Their tactic framework is specifically designed around these choices. One way to hand these choice points is to try them all in some sort of search algorithm, and that is what a lot of the stuff they have presented so far does. Another is to train a NN to make the choice, possibly using a universal choice-making NN. See the following resources for more on Daniel and team's ideas:

Daniel's AITP Talk where he talks about all this stuff.
@Jesse Michael Han's blog post which came out at the same time and goes into more details about this SearchT tactic framework.
Two papers on solving geometry problems here and here. I have not yet read either closely, and the second I only discovered today.
A paper on a universal oracle for making choices in software. See the discussion at #Machine Learning for Theorem Proving > universal oracle including a summary by me.

Jason Rute (Apr 19 2021 at 15:24):

I may have also given the idea that Daniel's IMO approach is something like "hard coding" solution heuristics. While I think that is somewhat true, there does seem to be something very general in their approach that I don't yet understand. For example, they can add more "abstractions" into the mix to solve more problems with little or no change to the overall algorithm. There is a discussion in #Machine Learning for Theorem Proving > dreamcoder that provides some more details.

Jason Rute (Apr 19 2021 at 18:31):

Now, as for general automation for interactive theorem proving, I think reinforcement learning is actually quite common. But unlike supervised learning, it requires more work to set up. Also, reinforcement learning is a broad concept, so what you have in mind for reinforcement learning is not what possibly others have in mind. Most current work on RL in theorem proving seems to follow the following pattern which is not unlike AlphaGo:
(0) (Optional) Bootstrap your prover by supervised learning on human proofs or simple heuristics (like TFIDF for premise selection).
(1) Loop over all training theorems (not the proofs, but the theorem statements) and use your learned theorem prover (with tree search) to find proofs.
(2) Save the proofs as training data and use that to refine your prover.
(3) Loop starting at step (1) until convergence or timeout.

Jason Rute (Apr 19 2021 at 18:32):

Papers which do this sort of RL loop include:

Reinforcement Learning for Theorem Proving a tableau-based FOL prover for Mizar statements using XGBoost. I think the final result can solve about half of the theorems in the library (both testing and training) with no proof supervision. (They also have a followup paper just for Robinson Arithmetic which is more of toy example demonstrating the power of curriculum learning in this area.)
Learning Heuristics for Quantified Boolean Formulas through Deep Reinforcement Learning
Learning to Reason in Large Theories without Imitation is one of the HOList project papers. With out using human proofs, I think they can solve about 56% of test theorems (and more when starting with human proofs and then improving via RL).
Generative Language Modeling for Automated Theorem Proving for MetaMath. The prover starts off with supervised learning (in a language model), but then improves via Expert Iteration (which is basically the algorithm above, even though they don't call it RL).

Jason Rute (Apr 19 2021 at 18:32):

I personally feel that the RL ideas are still in the AlphaGoZero era and can be improved a lot. There is no uses of competing populations of agents, no adversarial games, no learned world models (although Mathematical Reasoning in Latent Space starts in this direction), no hierarchical RL, no novelty search, no learned curriculum. Also, specific to theorem proving, they only explore a very narrow part of the mathematical landscape, namely proofs of theorems already in the library. There has not been a lot of work done on conjecturing new theorems which I think is what you mean by RL?

Jason Rute (Apr 19 2021 at 18:32):

One very small step in that direction might be counterexample exploration. In tactic theorem proving, when an agent reaches a goal, it could try to prove that goal true and simultaneously try to construct a counterexample. This could be a way to instill the agent with concrete knowledge about specific numbers, spaces, operations, etc.

Jason Rute (Apr 19 2021 at 18:33):

Also, it shouldn't be underestimated how hard it is to get into neural theorem proving. I've seen researchers asking about hooking up their RL algorithm to HOList only to discover that the HOList action space is very different from anything they have seen before. It is easy to come up with ideas, but hard to test them.

Joe Palermo (Apr 19 2021 at 19:46):

I'll share some thoughts I've had about the problem of how to use RL to generate new theorems to help train a neural prover (fyi I've shared similar thoughts previously in the dreamcoder thread).

I think being able to generate new theorems is critical to enable open-ended improvement in neural provers. I also think it may be possible to get a feedback loop working between a neural generator and a neural prover (basically a form of asymmetric self-play, unlike the symmetric forms of self-play seen in systems like AlphaZero/MuZero). One of the main problems is how you optimize the theorem generator. A purely adversarial objective (as in GANs) is probably not going to work. One idea is to optimize the generator to produce theorems that are near the limit of the prover's capability. Since the prover uses tree search (let's assume a gpt-f like model), you can use the branch-factor of the prover's search tree as a metric for how hard the proof was. If the prover fails completely there's nothing you can say, except that it was too hard. But if it succeeds, then branch factor is a measure of how much search was required. An easy proof requires little search (branch factor could be as low as 1, if the search is linear though in practice I suppose this might never happen). The prover's search times out after a certain number of steps, so one could optimize the generator (via RL) to produce theorems that when the prover tries to prove them result in a large branch factor but are still provable within the timeout limit.

I think the approach I sketched is very limited and could be improved in various ways. One thing not addressed is how to bias the generator in the direction of generating "interesting" theorems. It's easy to imagine a failure mode like Vladislav described above in which it generates strange theorems. Arguably, we might still learn interesting things from the dynamics of such a system, and it could serve as a basis for experiments that optimize the generator for different objectives. As Jason noted, more experimentation and new ideas are definitely needed.

I've been working on the problem of how one could generate theorems within Lean, and happy to go into more details with anyone who's curious.

Johan Commelin (Apr 19 2021 at 19:50):

If the generator at the same time tries to minimize the pretty-printed length of the theorem, that might be one constraint pushing it towards "interesting" theorems.

Johan Commelin (Apr 19 2021 at 19:51):

Otherwise I fear you'll quickly end up with 5-kilobyte long trivialities about intersections and unions of subsets.

Jason Rute (Apr 19 2021 at 20:09):

Here is an interesting paper on asymmetric self play in RL. There are three agents. Agent C creates problems which are hard for agent A but easy for agent B. Agents A and B both try to solve the problems and learn from them. This keeps the problems just on the doable side since ideally only one of the two agents can solve them. https://ai.googleblog.com/2021/03/paired-new-multi-agent-approach-for.html?m=1

Jason Rute (Apr 19 2021 at 20:10):

But @Johan Commelin may be correct and other constraints would be needed to both (1) make the problems useful and (2) make the problems explore all the math topics in the library.

Jason Rute (Apr 19 2021 at 20:12):

Maybe the generator agent needs to make problems which look like theorems in the library and maybe it has to contain a randomly sampled declaration.

Jason Rute (Apr 19 2021 at 20:26):

Stanislas Polu said:

Add to that the fact that the action space is enumerable. Most vanilla methods such as UCT+MCTS won't work well in that setup which requires more research in RL. Note that simple expert iteration which is easily applicable here has been demonstrated to work and count as RL as well :)

David Silver's team at Deepmind just released a whole bunch of papers on arXiv about MCTS including one on this subject (which I haven't read yet): Learning and Planning in Complex Action Spaces.

Daniel Selsam (Apr 19 2021 at 21:58):

@Joe Palermo The dual of a nondeterministic tactic tac is a nondeterministic theorem generator gen such that tac proves all theorems generated by gen. For some tactics, it is trivial to deduce a deterministic generator that can enumerate the theorems that the tactic proves, but for others, the obvious generator is constrained, i.e. it cannot easily be represented as a deterministic program and so requires search to execute. A sweet-spot for RL-based data-augmentation may be to map a nondeterministic tactic, plus an input to its dual generator, to successful executions of the generator, i.e. to a theorem that the tactic proves. By coincidence, a successful episode will produce a supervised datapoint in exactly the form one would want in order to learn heuristics for nondeterministic tactics.

Jason Rute (Apr 19 2021 at 22:12):

@Daniel Selsam when you say nondeterministic tactic, you mean a tactic with choice points, right? So something like gptf which is stochastic, would it be clear what the dual is (maybe if the choice points are assumed to be every place where is chooses a tactic)? Also, I assume you are thinking more along the lines of something similar to your work? Do you have a simpler example of a nondeterministic tactic and dual for which this might work?

Jason Rute (Apr 19 2021 at 22:15):

Also, what is the input to a dual generator? Is this like some index of the nth output, or a source or randomness?

Jason Rute (Apr 19 2021 at 22:17):

Or the path through the choice points?

Daniel Selsam (Apr 19 2021 at 22:27):

@Jason Rute Here are some notes I took ~2 years ago introducing the primal-vs-dual distinction in the context of inequalities doc1_primal_trick_vs_dual_trick.html (HTML with MathJax)

Jason Rute (Apr 19 2021 at 23:05):

Ok, I think I understand a little better. For the first unconstrained dual (the bad one if you will), the goal would be to find examples of f, h, and g satisfying those various constraints. The generated functions would be used to generate an example of a problem solvable by that specific trick (the primary tactic). I can see how that would be an ML problem to find such f, g and h. It is much more targeted, of course, than what any of us had in mind since it I think it requires building generator duals for all or many of the nondeterministic tactics in some framework of non-deterministic tactics (which of course is still a work in progress on your part :smile:).

Jason Rute (Apr 19 2021 at 23:06):

I have to admit I'm very interested in how well your ideas are going to play out @Daniel Selsam. They seem very foreign to me. On one hand, it seems like hand-coding heuristics for solving problems which seems very slow and arduous. On the other hand, you do hint at a great level of generality which I don't fully grasp yet. If, for example, your approach both solves a large set of interesting problems (e.g. IMO grand challenge) and it can quickly be adapted to other problems, then I will definitely take note.

Jason Rute (Apr 19 2021 at 23:08):

(If it solves the IMO grand challenge, but doesn't easily generalize to other problems, then it seems like a lot of work to hand code all of the tricks used in the IMO grand challenge, like Mathematica hand coded all sorts of integration tricks to solve integrals.)

Joe Palermo (Apr 19 2021 at 23:16):

Johan Commelin said:

If the generator at the same time tries to minimize the pretty-printed length of the theorem, that might be one constraint pushing it towards "interesting" theorems.

@Johan Commelin I agree, minimizing length seems like a useful constraint.

Johan Commelin said:

Otherwise I fear you'll quickly end up with 5-kilobyte long trivialities about intersections and unions of subsets.

My hope is that if the generator repeatedly produced trivial variations on the same problems then the prover might achieve sufficient generalization within that problem class to push the branch factor of search low enough that the generator would be forced to try something new (since in that case its reward within that problem class would be declining). Admittedly this requires that the prover be very sample efficient and is probably asking too much.

Joe Palermo (Apr 19 2021 at 23:35):

@Daniel Selsam Thanks for sharing your thought, also thanks @Jason Rute for your questions (I had the same ones) and the breakdown above. I suppose one could attempt to find f, h, and g via program synthesis guided by learned models.

Joe Palermo (Apr 19 2021 at 23:42):

Admittedly I'm new to Lean and still learning the basics. Are there many nondeterministic tactics currently in Lean? I assume any tactic that requires search to execute can be considered nondeterministic, e.g. tidy or gptf.

Jason Rute (Apr 20 2021 at 00:33):

Nondeterministic in this setting doesn’t mean stochastic (or dependent on the machine, etc.). (I think it comes from the fact that the list monad is called the nondeterministic monad since you can think of a list as the result of multiple paths being computed at the same time.). Anyway, no tactics in current lean are like this. This is a theoretical framework that Daniel is implementing as part of his project to solve the IMO grand challenge. They are like tactics but have choice points in them. You can find out more with some of the resources I put above.

Daniel Selsam (Apr 20 2021 at 00:33):

Jason Rute said:

I think it requires building generator duals for all or many of the nondeterministic tactics in some framework of non-deterministic tactics

The "grail" of this direction would be to automatically synthesize more general nondeterministic tactics from regular tactic proofs, to automatically synthesize diverse duals from nondeterministic tactics, and then to use ML/RL to learn to execute nondeterministic tactics effectively. But I think many of the most critical tactics---especially the ones that must be explicitly taught to humans---will be relatively easy to implement and yet basically impossible to synthesize, so we should be pragmatic and write them.

Daniel Selsam (Apr 20 2021 at 00:45):

Jason Rute said:

I have to admit I'm very interested in how well your ideas are going to play out Daniel Selsam. They seem very foreign to me. On one hand, it seems like hand-coding heuristics for solving problems which seems very slow and arduous. On the other hand, you do hint at a great level of generality which I don't fully grasp yet. If, for example, your approach both solves a large set of interesting problems (e.g. IMO grand challenge) and it can quickly be adapted to other problems, then I will definitely take note.

One way to phrase my goal regarding generality is that I hope the domain-specific part need not be globally coherent. It is hard to write complex software, but it can be easy to write down a bunch of tricks. Ideally different people can add tricks monotonically without much coordination, and the general-purpose "meta-AI" can "denoise" this web of tricks.

Aside: I thought of calling this "strategy programming" at somepoint, as in a higher-order version of Data Programming, but the name seemed a little too vague.

Johan Commelin (Apr 20 2021 at 05:52):

Joe Palermo said:

Johan Commelin said:

Otherwise I fear you'll quickly end up with 5-kilobyte long trivialities about intersections and unions of subsets.

My hope is that if the generator repeatedly produced trivial variations on the same problems then the prover might achieve sufficient generalization within that problem class to push the branch factor of search low enough that the generator would be forced to try something new (since in that case its reward within that problem class would be declining). Admittedly this requires that the prover be very sample efficient and is probably asking too much.

But I don't see why a lazy generator can't just decide to return a theorem of length $$e^{e^{e^x}}}$$ (add more exponentials if needed) to guarantee that even a reasonably smart prover will have a branch factor $> x$ on such a theorem. But I should also note that I have no idea what I'm talking about. I've never done anything in ML beyond an assignment in a class 10 years ago.

Scott Morrison (Apr 20 2021 at 06:46):

A basic problem for an interesting generator is how to steer it away from "produce a preimage for this one-way function" style problems, which I fear are rather dense in the space of all propositions.

Joe Palermo (Apr 20 2021 at 15:16):

@Johan Commelin That's a valid point. I think it would be important to cap proof length. Hopefully within a given max proof length, a prover can generalize enough to cover trivial variations on some proof types. I wouldn't be surprised to find that some problem classes break this paradigm though

Johan Commelin (Apr 20 2021 at 18:25):

@Joe Palermo I don't know about capping proof length. I was suggesting capping statement length.

Johan Commelin (Apr 20 2021 at 18:25):

Fermat's Last Theorem has a short statement and a loooooooooong proof.

Oliver Nash (Apr 20 2021 at 19:03):

Too long to fit in a margin anyway

Mario Carneiro (Apr 20 2021 at 19:51):

But I don't see why a lazy generator can't just decide to return a theorem of length $e^{e^{e^x}}$ (add more exponentials if needed)

If you put too many exponentials in, won't the generator have trouble returning it? Just writing that statement will take so long that the human operator will just have to step in and kill the generator

Shariq Hashme (May 06 2021 at 03:48):

Hey all, love all the discussions about theorem proving RL. ~4 years ago I worked at OpenAI and was very interested in a similar approach which I shortly discussed with Christian Szegedy at the time. I don't have the time to pursue it myself, but I wrote down the critical parts of the approach which I believe has a chance of achieving "theorem proving escape velocity": a theorem proving system which came up with its own interesting theorems and proved them (at scale, perhaps enough to prove any human math as well). If anyone finds it relatively novel and uses these ideas I'd really appreciate the acknowledgment, since I spent probably around 2 years (maybe more) thinking about this :stuck_out_tongue_closed_eyes: I have a number of thoughts I didn't include, e.g, around how it could be implemented, so if you're interested please email me.

Here's the write-up: https://docs.google.com/document/d/1bqJcWdYfl460Y_TU48f1esOkyHgNkyvbA7leVO-YJ3g/edit?usp=drivesdk

Jason Rute (May 06 2021 at 13:41):

@Shariq Hashme Thanks for the write up! I think there are a lot of great ideas floating around now about treating math as a multiagent game. I've also heard Szegedy talk about having a population of provers (I don't know if he gave any specific details about how this would work). I know David McCallister has also been thinking a lot about the "game of mathematics" and capturing the idea of "interesting" mathematics. I also saw a paper by one of the GamePAD authors with a proposal a number of years ago. (I also have some of my own ideas which I need to write down.) It would probably be helpful for us to find and organize these various proposals.

Jason Rute (May 06 2021 at 13:41):

I also hope researchers start working on this. I imagine smaller logical systems could be a good testing ground, but at the same time it might be easier to just start with a large amount of built-in mathematics to get the ball rolling. Of course, I think it goes without saying that the credit should probably go mostly to those who can get something like this working. I imagine there are a number of important design choices that need to be made to get such a system to learn and avoid collapse.

Jason Rute (May 06 2021 at 22:32):

I think your compression idea @Shariq Hashme is very similar to ideas in the dream coder paper. I want to spend some time fleshing that out and how it relates to theorem proving. For example, I think something like descriptive complexity (calculated as the negative log of the probability your model assigns to the proof) is probably a better notion of proof size.

Last updated: May 02 2025 at 03:31 UTC