# Zulip Chat Archive

## Stream: Machine Learning for Theorem Proving

### Topic: Subgoal Search For Complex Reasoning Tasks

#### Jason Rute (Nov 01 2021 at 00:40):

When I first started writing about ML for TP papers, there were just a few and it was easy to keep track of them, but now there are many more to keep tabs on. I’m thankful that @Stanislas Polu pointed me towards Subgoal Search For Complex Reasoning Tasks (website).

#### Jason Rute (Nov 01 2021 at 00:40):

When I first started to learn how to program, it was when I was a math postdoc. (Actually that is not entirely true, I learned QBasic in high school, and dabbled in other program languages for the next 15 years, but it wasn’t until my postdoc that I really made a commitment to know how to code well.) I learned Python via CodinGames.com which is a really fun way to learn to code. As I was moving up in the lessons, the hardest problems were always search problems. First they started off easy, like find the shortest path in a maze (breadth-first search) or a graph (Dijkstra’s algorithm). Later they started to become harder. I had to avoid visiting the same nodes. I had to automatically spot and prune dead-end paths, and have a good heuristic for what nodes to visit first. (If you think search is easy, try to think how to algorithmically solve a sudoku or a mind sweeper puzzle.)

#### Jason Rute (Nov 01 2021 at 00:40):

The problem with search is that is by its nature an exponential problem. You are exploring a tree or graph of nodes. The only way to make it tractable is to find the right way to divide up your search space.

#### Jason Rute (Nov 01 2021 at 00:40):

Now, let’s think of a puzzle like the Rubik’s cube or Sokoban. These puzzles are much harder than on a typical programming practice problem. Can we come up with a general purpose search algorithm for solving such problems? By general purpose, I mean:

- We don’t just put in a task-specific algorithm like a known solution for the Rubik’s cube.
- However, because these problems are hard, we are allowed to train an algorithm to learn how to solve a class of problems over the course of hours or days (but only using easy-to-generate training data, again not bringing in fancy-pants problem-specific knowledge).

#### Jason Rute (Nov 01 2021 at 00:40):

Of course, I’m not as interested in the Rubik’s cube as I am in theorem proving, which is arguably the ultimate search problem. Nonetheless, the Rubik’s cube is a great playground for search. Indeed, one of my first deep learning projects was training a neural network to solve the Rubik’s cube using only reinforcement learning via the same algorithm as AlphaZero. It worked, but not spectacularly well. It could only find solutions if they were relatively short. Less than a year later, however another team did solve the problem I was working on. Their big idea was that you can work backward, starting at the goal state to generate high quality training data.

#### Jason Rute (Nov 01 2021 at 00:40):

However, their idea wouldn’t work well in theorem proving, since you can’t generate training data by working backwards from the goal. Many tactics and logical axioms are not uniquely reversible. (This is what makes it difficult, for example, to generate synthetic proofs.)

#### Jason Rute (Nov 01 2021 at 00:40):

Tools, like GPT-f, TacticToe, DeepHOL, Tactician, ProverBot9001 and all the others that I’m forgetting to mention, do work reasonably well even by forward reasoning where one just chooses the next tactic, rule, or axiom. This tree search however works mostly because in Lean and other provers, many proofs are not that long. (If it is not obvious, when we say GPT-f solves X% of theorems, the theorems which are AI solvable are heavily skewed towards those with very short proofs.)

#### Jason Rute (Nov 01 2021 at 00:41):

**Summary**

#### Jason Rute (Nov 01 2021 at 00:41):

What this paper does is fairly simple. Instead of having an agent pick the next low-level action (e.g. twist the Rubik’s cube clockwise around the red face), it picks the next *subgoal* and then finds a path to that subgoal. In the Rubik’s cube case, a subgoal is another configuration of the cube that is about 4 low-level actions into the future. This isn’t that different from human reasoning where we don’t think “put one foot in front of the other”, but instead think “let’s go to lunch at the pizza place”.

#### Jason Rute (Nov 01 2021 at 00:41):

An in many ways, this is exactly the kind of insight needed when approaching search problems with machine learning. While we need better machine learning algorithms to make decisions on what to do next, a 20x speedup in a search isn’t going to help if the algorithm is really exponential. What is immensely more important is to give our algorithms a better choice of what to pivot on when searching the space. If an algorithm is just focusing on low level actions, it is going to get lost in the weeds. I know @Daniel Selsam’s approach to this dilemma is to have humans provide a domain-specific canvas by which a human outlines a solution and fills in the routine details. The human specifies the big important choices in the search and has an AI make those choices. This is not a bad approach. However, this paper gives a different task-agnostic solution: have the algorithm itself choose a list of subgoals to work on.

#### Jason Rute (Nov 01 2021 at 00:41):

I consider this not as a final step but only the beginning. In many ways, AI has gone backwards in regards to search. In classical AI, search and specifically finding ways to efficiently divide up and prune a search space are a key part of classical AI subjects, such as SAT solvers, constraint programing, integer programming, and more. How can we incorporate those back into neural AI, but also use neural networks to increase the generality of these approaches? Ideally, instead of hard coding a domain specific solution, we would like our algorithm to find one automatically.

#### Jason Rute (Nov 01 2021 at 00:41):

Now, as for the specific paper, there are a few ingredients one needs for this to work:

*Subgoal generator*: A way to generate subgoals. (They used CNNs or Transformers, both powerful neural networks.)*Low-level conditional generator*: A way to find a path of low-level actions from your current goal to the proposed next subgoal. (They used brute force when tractable, or a neural network when not.)*Value function*A way to score the subgoals by say estimating the distance to a solution. (Using neural networks.)*High-level Planner*This is the algorithm for the high-level search which goes from subgoal to subgoal. (They used best-first search or Monte Carlo Tree Search.)

#### Jason Rute (Nov 01 2021 at 00:41):

In many ways, nothing about the above 4 components themselves are remarkable. By using different technologies for each problem, they show that really what is great about their approach is that it is just the main idea of picking subgoals 3-4 moves ahead that works. (Although, it should be pointed out that modern language generation models like Transformers really make the subgoal generation feasible for a variety of problems.)

#### Jason Rute (Nov 01 2021 at 00:41):

As for training data, while one could certainly combine this with reinforcement learning, they didn’t need to. It was enough to use “off policy data”, which is a fancy way of saying that they don’t need access to the puzzle simulator to train their model. Instead, it was trained on data which came from other data sources. For the Rubik’s cube they just used random reversed paths starting from the goal. In Sokoban, they used some reinforcement learning runs from another algorithm.

#### Jason Rute (Nov 01 2021 at 00:41):

As for the training goals, they are quite simple. The subgoal generator predicts the state 4 (or 3) moves ahead; the low level generator predicts the next action given a current state and the target subgoal; and the value function predicts the number of actions.

#### Jason Rute (Nov 01 2021 at 00:42):

The results seem pretty impressive, especially considering there was no reinforcement learning, and they used a smaller computational budget than many similar solutions.

#### Jason Rute (Nov 01 2021 at 00:42):

**Theorem proving**

#### Jason Rute (Nov 01 2021 at 00:42):

Given how simple this idea is, it should definitely be tried in theorem proving. (And it would be fairly easy to execute with a method similar to GPT-f.) Indeed the authors showed their model performed well on INT, an inequality theorem proving benchmark.

#### Jason Rute (Nov 01 2021 at 00:42):

Now to apply it to Lean, Coq, etc, one has to consider a few factors:

- Theorem proving is not as linear as solving a Rubik’s cube. Tactics can split the tactic goal into multiple goals. Should the predicted “subgoal” (as used in this paper) be a full goal stack or just one of the goals in the stack?
- Tactics are powerful actions which can significantly change the tactic goal in many complicated ways. It is not clear how well we can predict the exact tactic state one move ahead much less three.
- There are a lot of symmetries in tactic goals, such as the renaming variables or changing the order of some hypotheses. In the above search algorithms, there was caching of repeated subgoals since they are graph-search problems. This becomes a lot harder when one has to deal with “pseudo-repeats”. This is even a worse problem in robots and video game research (such as Montezuma’s revenge) where many states are near equivalents to each other.

#### Jason Rute (Nov 01 2021 at 00:42):

To properly get a grip on issues (2) and (3), it might be necessary to train a contrastive model (not too different from self-supervised contrastive learning) which measures how “equivalent” two states are.

#### Jason Rute (Nov 01 2021 at 00:42):

I think this is worth trying (especially as language models get cheaper and easier to train). However, even if this particular algorithm fails on formal theorem proving, I think this paper (and other similar papers) really highlight the need to not get sucked into thinking of theorem proving as manipulating low level tactics, but instead think of it as bouncing back and forth between big ideas / plans / strategies / goals. While this sounds lofty, this paper clearly shows that simple adjustments to the search algorithm can work quite well.

#### Jason Rute (Nov 01 2021 at 00:42):

**Why it works**

#### Jason Rute (Nov 01 2021 at 00:42):

They also try to give an explanation for why this trick works. Since they are using valuations to estimate the distance to the solution, and using those valuations to guide their search (e.g. best first search), they note how unstable and non-monotone those valuation estimates are when going one low-level step at a time, but how that improves with jumps of four low-level steps. While I don’t disagree with this, I think there is a larger picture here which hasn’t been explored. Also, they try to test this hypothesis in section 4.6 using synthetic data, but I think they are changing too many things at once in that experiment.

#### Jason Rute (Nov 01 2021 at 00:42):

I see three possible advantages: (1) lumping multiple path segments into one action, (2) better distance-to-the-goal valuations, , (3) making better use of the policy probabilities. I think if we are careful we could design experiments to test all of these ideas (possibly on synthetic data which would make the evaluation computationally easier).

#### Jason Rute (Nov 01 2021 at 00:43):

For my idea (1), since there are many paths leading to the same subgoal, by thinking of subgoals as the actions instead of the low-level actions we are condensing many paths into one action. This both reduces our search space significantly, and gives our better action probabilities, since now we are considering the probability of all paths which lead to that subgoal instead of just the one we took. (Also, in Sokoban, by considering states of distance exactly 4, one is effectively splitting the state space in half due to the even/odd parity of moving along a lattice.) There are a few ways to measure the effect of this. One is to consider tree search spaces instead of graph search spaces. (An example of a tree search space is filling in a Sudoku puzzle in order.) Another way is to consider generate sequences of four low-level actions as a big action instead of generating subgoals as the “action”. (For many problems this greatly increases the action space since a number of 4-sequence moves lead to the same next subgoal.)

#### Jason Rute (Nov 01 2021 at 00:43):

The other two advantages, (2) better valuations, and (3) using the policy probabilities better, are a bit tricky to disentangle. When scoring a state, there are two compatible approaches. A distance-to-the-goal valuation looks ahead and asks “how good is this position relative to where I want to be?” It ignores the history of the path so far. Another approach is to look backwards, and ask “how likely is this path to be the right path which leads me to the goal?” While none of the search algorithms in the paper do this in full, a standard approach is just to multiply all the probabilities of each action (where here an action is a jump in state), to get a cumulative probability of the path so far. One could replace the valuation in the best first search for example with this measure, which is likely more stable than a distance-to-the-goal valuation. Now, what this paper does do is take the probability of a single action (again an action being a jump in state) into account. A longer jump in state is a bigger chunk of the whole path and therefore is a more accurate representation of the path and how good it is, which is why I’m saying that long distance subgoals are using the policy probabilities better (at least in there approach). I see a number of experiments here: (a) Pick tree search problems with trivial valuations (like filling in a Sudoku in order). If subgoals do better than one step, then this isn’t due to better valuations or combining action paths. (b) Repeat the same but score according to the cumulative probability. See if longer step sizes are better. (c) Find a tree search problem which only requires distance-to-the-goal valuations and just randomly chooses actions according to a uniform policy (for example moving in a grid world but attaching path history to each state). Then measure the effect of larger step sizes.

#### Jason Rute (Nov 01 2021 at 00:43):

Overall, I think there is a lot of science and mathematics of search that we in neural network based AI need to learn, rediscover, or invent. The main models so far for this sort of thing have been simple search algorithms taught in undergraduate CS courses, or algorithms designed for adversarial games. I really love papers like this which challenge the accepted wisdom. Now the next question is, can we develop a theory of neural search based on this and other ideas which take the results to the next level?

#### Johan Commelin (Nov 01 2021 at 06:29):

@Jason Rute Thanks (as usual :wink:) for your intersting summary!

#### Daniel Selsam (Nov 01 2021 at 15:07):

Jason Rute said:

... really highlight the need to not get sucked into thinking of theorem proving as manipulating low level tactics, but instead think of it as bouncing back and forth between big ideas / plans / strategies / goals.

Quick remark: mathlib proofs are filled with `have`

steps, which are subgoal-creating rather than "low-level". Systems trained on mathlib will automatically attempt these actions, though I'd agree that additional look-ahead-informed-data-augmentation could accentuate this direction.

#### Stanislas Polu (Nov 01 2021 at 22:58):

Daniel Selsam said:

Jason Rute said:

... really highlight the need to not get sucked into thinking of theorem proving as manipulating low level tactics, but instead think of it as bouncing back and forth between big ideas / plans / strategies / goals.

Quick remark: mathlib proofs are filled with

`have`

steps, which are subgoal-creating rather than "low-level". Systems trained on mathlib will automatically attempt these actions, though I'd agree that additional look-ahead-informed-data-augmentation could accentuate this direction.

Just wanted to share that this statement is supported by experimental evidence. Models trained on mathlib indeed inherit the capability to introduce cuts with `have`

statements :+1:

Last updated: Dec 20 2023 at 11:08 UTC