Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: InternLM2.5-StepProver

Zijian Wu (Oct 22 2024 at 02:33):

We have submitted a technical report on arXiv introducing a new set of open-source language models designed for theorem proving in Lean 4, including our newest policy model, InternLM2.5-StepProver, and its corresponding critic model, InternLM2.5-StepProver-critic. Our models achieve new state-of-the-art results on the miniF2F benchmark (65.9%) and the ProofNet benchmark (27.0%).

We utilize Lean-Workbook, one of the largest open-source problem collections available, for large-scale expert iteration. We prove/disprove 17.0% of problems in Lean-Workbook-Plus, constituting an enhanced theorem proving dataset and showing significant improvement compared to only 9.5% of problems proved when Lean-Workbook-Plus was released. With the search trajectories obtained during the iteration, we train a critic model, InternLM2.5-StepProver-critic, to provide guidance for the policy model in searching for deeper (and better) proofs on average. The integration of a critic model shows clear performance improvement compared to naïve best-first search methods.

The policy and critic models, as well as the enhanced Lean-Workbook-Plus dataset, have been open-sourced on Hugging Face.

Huang Suozhi (Oct 22 2024 at 07:06):

Also see our work on twitter: https://x.com/hungsuzh143318/status/1848604474555961804 :heart_kiss:

Sid (Oct 22 2024 at 19:09):

Zijian Wu said:

We have submitted a technical report on arXiv introducing a new set of open-source language models designed for theorem proving in Lean 4, including our newest policy model, InternLM2.5-StepProver, and its corresponding critic model, InternLM2.5-StepProver-critic. Our models achieve new state-of-the-art results on the miniF2F benchmark (65.9%) and the ProofNet benchmark (27.0%).

We utilize Lean-Workbook, one of the largest open-source problem collections available, for large-scale expert iteration. We prove/disprove 17.0% of problems in Lean-Workbook-Plus, constituting an enhanced theorem proving dataset and showing significant improvement compared to only 9.5% of problems proved when Lean-Workbook-Plus was released. With the search trajectories obtained during the iteration, we train a critic model, InternLM2.5-StepProver-critic, to provide guidance for the policy model in searching for deeper (and better) proofs on average. The integration of a critic model shows clear performance improvement compared to naïve best-first search methods.

The policy and critic models, as well as the enhanced Lean-Workbook-Plus dataset, have been open-sourced on Hugging Face.

Interesting paper! I had a couple of questions

Do you put just the last tactic or all previous tactics as part of the prompt in addition to the state.
In theory the state should have all the information right so do you have a sense of why putting in the past tactics helps?

Heather Macbeth (Oct 22 2024 at 19:16):

Very minor quibble: the report says that your score of 6/640 on PutnamBench is state-of-the-art, but it seems that as of a few weeks ago the leading score is 7/640.

Huang Suozhi (Oct 23 2024 at 01:38):

Sorry, we didn't notice that. We will update that soon.

Zijian Wu (Oct 24 2024 at 02:06):

Sid said:

Zijian Wu said:

We have submitted a technical report on arXiv introducing a new set of open-source language models designed for theorem proving in Lean 4, including our newest policy model, InternLM2.5-StepProver, and its corresponding critic model, InternLM2.5-StepProver-critic. Our models achieve new state-of-the-art results on the miniF2F benchmark (65.9%) and the ProofNet benchmark (27.0%).

We utilize Lean-Workbook, one of the largest open-source problem collections available, for large-scale expert iteration. We prove/disprove 17.0% of problems in Lean-Workbook-Plus, constituting an enhanced theorem proving dataset and showing significant improvement compared to only 9.5% of problems proved when Lean-Workbook-Plus was released. With the search trajectories obtained during the iteration, we train a critic model, InternLM2.5-StepProver-critic, to provide guidance for the policy model in searching for deeper (and better) proofs on average. The integration of a critic model shows clear performance improvement compared to naïve best-first search methods.

The policy and critic models, as well as the enhanced Lean-Workbook-Plus dataset, have been open-sourced on Hugging Face.

Interesting paper! I had a couple of questions

Do you put just the last tactic or all previous tactics as part of the prompt in addition to the state.

In theory the state should have all the information right so do you have a sense of why putting in the past tactics helps?

We opt to put all previous tactics as part of the prompt.
Though not thoroughly studied, we have conducted tests on a policy model with prompts containing only state information. We observed a performance drop, especially on problems that require many power/logarithm operations. Such problems can be solved in different ways, depending on how the model simplifies the equations (in exponential form or logarithm form). However, the model has to stick to one method. We believe that including all previous tactics in the prompt provides a "direction" for the proof, while models trained with only state information are more likely to "swing" back and forth, going in circles. Similar effects are also observed on problems that require estimating the upper bound of an integer variable and enumerating on that, such as mathd_numbertheory_296.

Jason Rute (Oct 24 2024 at 02:23):

As for swinging back and forth, in PACT (which only depends on the current state), I've observed the model suggesting to rewrite with commutivity, and then suggesting to rewrite with commutivity again, effectively undoing the last rewrite. (I've also heard others say that sometimes adding extra information like this can hurt performance, so it probably depends on the model and approach and needs to be thoroughly experimented in practice.)

Last updated: May 02 2025 at 03:31 UTC