Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Using Tree of Thoughts Methodology with Logical Questions

Michal Novomestsky (Jun 06 2023 at 10:13):

Hello all,

I've recently been giving a lot of thought to how AI-driven theorem proving can be approached through the lens of LMs, which increasingly seem to be the way to go when it comes to this problem. I'm writing this post in light of the recently published Tree of Thoughts (ToT) and "Let's Verify Step by Step" papers, which I believe have both reaffirmed much of my core ideas, and more importantly, provide a way to practically implement them. Furthermore, I believe my thoughts follow along as a reasonable next-step, as they may also touch on some of the parts both papers have left to future works (creating a better proofstep generator in the case of Let's Verify, and a slightly more generalised methodology of that proposed by ToT). I've already reached out to the authors of both works (including @Daniel Selsam ) for their thoughts on this. I'd love to hear any discussion or feedback you might have on these, since these ideas are still very surface-level and there is much for me to learn in this field.

First, I'd like to very briefly touch on a general observation: I find that almost all successful lecturers and textbooks in mathematics (and logical fields in general) tend to encourage the student to ask questions about a problem or conjecture - "What happens as x gets very large?", "Can I think of edge cases where this behaves differently or interestingly?", "Does the problem or its resultant behaviour seem familiar to anything else I've encountered in maths (or anywhere else for that matter)?" This often gives us a way to better understand the problem by giving us more contextual information about it and provides insights into how it may relate to previously-seen problems (a methodology which I believe ties into deep learning very naturally). Doing so often allows us to A. start a problem if we don't know how to and/or B. keep moving forward if we get stuck. Both of these make for, at least in my view, an excellent way to sample richer and more relevant thoughts and proofsteps from an LM when solving a mathematical or in fact any generally logical problem, and allow for it to continue searching for a solution through a "different perspective" if it gets stuck.

ToT provides a very natural way to implement this: By prompting a LM to ask itself these logical/investigative questions, it can seed branches in the thought tree which can then be searched and pruned based on the reward they generate. The significance of this is that it is essentially breaking the problem up into recursive sub-tasks which are more likely to be successfully approached by the model (e.g. tasking a model to investigate the first N states of some formula or getting it to search for the poles of a complex-valued function are much easier and more straightforward than asking it to prove an entire conjecture at face-value).

Here I think the ToT methodology can be generalised slightly further: Currently ToT and its ancestor paradigms (IO, CoT and CoT-SC) all construct a single path through the thought space to reach a solution (at least to my understanding), the difference being that ToT weighs out options either width-wise or depth-wise before proceeding at each branch. But often when proving/solving a problem, we need to construct several initially unrelated lemmas or side-thoughts which work in unison to lead to the final solution. In light of this, I would argue that an even more robust and general paradigm would be to value branches based on their relationship with other branches (e.g. a possible value function could be V(p_theta, s) ~ p_theta(s | s's relationship with its neighbours)). LMs could value how seperate chains of thought stack up when considered in unison, and combine sub-trees into a single chain of thought if it thinks it is worthwhile to do so. One potential issue with this is the search space however - I'm unsure how expensive it would be to do such a search (it may only be practical do so initially when there are only very few branches and only high-level questions posed (e.g. "investigate what happens when x = 0"), and proceed from there, to devise a high-level "best strategy" and go with that).

In essence, this would guide the model by initially proposing abstract, human-readable and high-level strategies, which gradually reduce into smaller and more concrete subtasks, with this extended-ToT methodology allowing for an efficient way to search through and chain these high-level ideas (perhaps tune the "abstractness" of a prompt to exponentially decay with a sub-tree's depth, net reward, or similar).

Successful strategies could be picked out in one (or both) of two ways:

Simply value them based on the net reward their sub-tree generates, or perhaps the % of the total reward they contribute to.
Since LMs are quite good at summarising and condensing large amounts of text/structured data (anyone who's used ChatGPT knows this), I wonder if a way to train and tune these models would be to prompt the LM to "summarise" a series of successful training cases (e.g. prompt it to summarise the key high-level strategies, and then how the low-level, concrete mathematical steps come into play). By getting it to summarise in this fashion, it may learn how to correctly propose the correct high-level reasoning required to approach problems. Likewise, it could be prompted to summarise false-positives and false-negatives and act to error-correct itself (similarly to what Let's Verify proposes, and in a more generalised fashion to ToT's simple reasonability checking on simple arithmetic and the like).

On a more speculative note, I wonder if getting the LM to prompt itself by asking why it thinks certain strategies work would be useful in teaching it to understand reasoning better (it knows that some strategies are successful because they've just been valued using either of the previous mentioned approaches, but can it figure out why?). This ties into ideas of metacognition and could better teach it how to reason in a generalised way, although I hesitate that it might hallucinate and not really lead to anything.

I wonder if this could generalise even further. By getting a model to look at not only concrete, low-level maths (e.g. use theorem X to simplify the equation), but also to consider the efficacy of these high-level thought-structures and "logical questioning", it may even help the model understand logical reasoning in general.

Furthermore, many concepts in math, whilst different on the surface, are fundamentally related through symmetry and isomorphism (e.g. pascal's triangle and the binomial expansion are fundamentally the same although they appear in very different contexts). So by training a model to look at these high-level ideas, it may avoid the model overfitting by simply memorising the particular structure of the question it gets posed with, but to instead think about and investigate how that mathematical system behaves (which I would argue is the essence of IMO problems, since they can't present any new content outside of a high-school level afaik). To that end, I wonder if it would be even more beneficial to get the LM to summarise a training epoch without access to the actual mathematical objects and instead focus on the strategies and behaviour-structures themselves (which may further promote an understanding of reasoning).

I'd like to close with a remark on hallucination - given how aggressively LM-driven this methodology is, it would be highly susceptible to hallucination (I suspect the reason why the ToT paper had so much success was because its value functions used fairly rudimentary heuristics like checking if it's possible to sum to 24 given the current state, which don't require excessive leaps of logic). As such, I think it would be necessary to use a validation-checking paradigm similar to that of the Let's Verify paper.

Sorry for such a long wall of text! I'd love to heard any thoughts or feedback you may have on this. I'm still very much a novice in this field, but would be absolutely thrilled to turn this into a research paper and contribute to this field and am very open to any collaboration :smiley: .

Last updated: May 02 2025 at 03:31 UTC