Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: ICML

Jason Rute (Jul 21 2024 at 08:31):

ICML (one of the big AI conferences) is happening this week. I’m attending and will present a poster on our recent Graph2Tac work (updated from when I last talked about it here).

Also, there is going to be an AI for Math workshop which has some invited talks and lots of papers. If anyone else here is coming, I’d love to attend your talk poster and/or chat about AI for theorem proving!

Logan Murphy (Jul 21 2024 at 20:13):

I will also be presenting a poster on our paper on autoformalization. Looking forward to seeing you there Jason!

Jason Rute (Jul 24 2024 at 08:22):

I don't think many of us are here, but for those who are (and for those who want to follow along from home), here are the most relevant posters I've found (not including all the many great works on the workshop on Friday):

Wed 1:30 PM CET: Graph2Tac: Online Representation Learning of Formal Math Concepts (@Lasse Blaauwbroek, @Mirek Olšák, @Jason Rute, Fidel I. Schaposnik, Jelle Piepenbrock, and Vasily Pestun)
Thu 1:30 PM CET: Subgoal-based Demonstration Learning of Formal Theorem Proving (Xueliang Zhao, @Wenda Li, Lingpeng Kong)
Thu 1:30 PM CET: Autoformalizing Euclidean Geometry (@Logan Murphy, @Kaiyu Yang, Jialiang Sun, Zhaoyu Li, Anima Anandkumar, and Xujie Si)

Jason Rute (Aug 07 2024 at 14:21):

Besides all the above-mentioned papers, here are two posters I saw at ICML that I thought were very applicable to AI for formal theorem proving.

Jason Rute (Aug 07 2024 at 14:21):

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
This paper uses Monte Carlo Tree Search for solving a number of tasks, including generating computer code (the HumanEval benchmark). In the code generation setting, each action is generating a complete Python function. If the unit tests fail, the agent sees the test errors and gives a self-reflection in natural language about what it could do differently. All those things, the previous attempt, the test errors, and the self-reflection are added to the next state which is used as a prompt for the LLM. Since it is a tree search algorithm, it can choose to try a different branch of the search or to continue down this branch.

Jason Rute (Aug 07 2024 at 14:22):

Screen-Shot-2024-08-07-at-8.35.01-AM.png

Jason Rute (Aug 07 2024 at 14:22):

Many of the current LLM approaches for formal theorem proving approaches either predict individual tactics (which can get expensive) or do whole proof generation, with many independent samples ignoring system feedback (which defeats the point of an interactive theorem prover). I wonder if a technique like this could work well in interactive theorem proving, or if it would just get expensive quickly.

Jason Rute (Aug 07 2024 at 14:23):

MathScale: Scaling Instruction Tuning for Mathematical Reasoning
(Video: https://www.youtube.com/watch?v=BS4rrzajSdg)

Jason Rute (Aug 07 2024 at 14:23):

The goal of this paper is to create a synthetic natural language mathematic question-and-answer dataset for training an LLM to solve math problems.

Jason Rute (Aug 07 2024 at 14:23):

They do this by extracting mathematical data from an LLM (GPT 3.5) in the form of a graph of mathematical concepts. This graph isn’t very sophisticated. It includes a topic graph made up of high-level concepts, like Calculus and Trigonometry, and a Knowledge Point graph made up of specific examples like the chain rule and derivative of trigonometric functions. Related nodes (both within the topic and knowledge point graphs as well as between them) are connected by an edge.

Jason Rute (Aug 07 2024 at 14:24):

Screen-Shot-2024-08-07-at-9.22.40-AM.png

Jason Rute (Aug 07 2024 at 14:24):

Screen-Shot-2024-08-07-at-9.22.54-AM.png

Jason Rute (Aug 07 2024 at 14:24):

Once one has this graph, one can do random walks on it to brainstorm topics and knowledge points. Then from those brainstormed nodes, one generates new math questions and uses GPT-3.5 to answer those questions. I’m pretty sure there are no guarantees that the answers are correct, and they even say it doesn’t matter in their experiments. Just having the additional data helps.

Jason Rute (Aug 07 2024 at 14:24):

I can’t help but wonder if an approach like this could be used to extract data in a more formal structure, like better knowledge graphs (using triples) or even in Lean. One could imagine a system like this iteratively coming up with concepts, theorems, definitions, and small exercise problems for training an automated formal theorem prover. This could be the next frontier in auto-formalization.

Yufan Zhao (Aug 08 2024 at 08:08):

Jason Rute said:

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
This paper uses Monte Carlo Tree Search for solving a number of tasks, including generating computer code (the HumanEval benchmark). In the code generation setting, each action is generating a complete Python function. If the unit tests fail, the agent sees the test errors and gives a self-reflection in natural language about what it could do differently. All those things, the previous attempt, the test errors, and the self-reflection are added to the next state which is used as a prompt for the LLM. Since it is a tree search algorithm, it can choose to try a different branch of the search or to continue down this branch.

I have been following the Language Agent Tree Search (LATS) paper for a while now. It was initially submitted to (and rejected from) ICLR 2024 (https://openreview.net/forum?id=6LNTSrJjBe). A core concern of the ICLR reviewers was that LATS requires external environment feedback (mentioned in LATS Abstract: ... A key feature of our approach is the incorporation of an environment for external feedback, ...), and this is problematic for datasets like HotpotQA (reasoning in natural language) because the environment won't actually tell you the correct answer. This concern wasn't completely addressed in the ICML submission. In section 5.1 on page 6, they mention "... Note that consistent with previous work (Yao et al., 2023b ; Shinn et al., 2023 ), we use an oracle setup for HotPotQA, in which the environment provides feedback about the answer’s correctness upon receiving an answer. This enables a fair comparison between our method and baselines ...", so the comparison is fair now, but it doesn't really solve the problem at hand.

It seems to me that the main challenge of applying this method is getting the crucial "external feedback from the environment" mentioned in the abstract. In the context of Machine Learning for Theorem Proving, this likely means a way to assess the value of proof states. From other works it seems that this is a highly non-trivial task: There is Proofsize objective in Formal Mathematics Statement Curriculum Learning, using a critic model in HyperTree Proof Search, and probably many many more that I am unaware of right now.

Jason Rute (Aug 08 2024 at 10:54):

I would say it is non-trivial but not “highly non-trivial” to get a value function from Lean’s feedback. Sure you need a model which takes Lean’s feedback and converts it into a scalar. The best way would be to use RL. The AlphaZero algorithm provides a way to learn that scalar. Start with a value of one if solved, zero otherwise. And then use the backpropogated MCTS value as the training target for the valuation. (Of course RL is expensive, especially in this setting.)

But I’ve also just seen papers use an off the shelf LLM as a rough scorer of quality. The DeepSeek-Prover paper does this. And the LATS paper also does this. If it works for LATS to turn Python unit test errors into a scalar, it may work fine for Lean errors as well. (Although Python errors are more common than Lean errors in LLM training data.)

Jason Rute (Aug 08 2024 at 10:55):

(Also most tree search algorithms need some kind of a scalar value, so it isn’t like this is a problem unique to their method.)

Last updated: May 02 2025 at 03:31 UTC