Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Autoformalization of Coding Problems

GasStationManager (Nov 29 2024 at 02:55):

As part of my push for creating coding As able to prove the correctness of its own code,
I am working on autoformalization of coding problem datasets, i.e. translating problem descriptions into formal specifications in Lean.

A major challenge of autoformalization efforts is the evaluation of the quality of the translations in a scalable manner.
For coding problems, the situation may be better, especially if the dataset comes with test cases for each problem. In short, we can repurpose these test cases to serve as checks on the fidelity of the translations. If the translation is faithful, the formal specification and the test cases should be compatible, and this check can be automated.

See here for details and preliminary results on my proof of concept. With a sample of 30 problems (taken from the code_contests data set), I used Sonnet to translate them into formal specifications in Lean, resulting in 17 specifications without syntax errors. My verification procedure was able to prove that 10 of these are compatible with their test cases.

I think we can already scale this up to produce autoformalized data sets of a pretty good size.

GasStationManager (Nov 29 2024 at 03:01):

Comments are welcome! A couple of questions:

right now my proof procedure (beginning of this file ) consists of trying a kitchen sink of all the tactics I know about. Any advice on building a good automated proof script for simple proof tasks? I did try the lean-smt solver but perhaps I'm not using it the best way. Same with Aesop. What about lean-auto / duper?

GasStationManager (Nov 29 2024 at 03:14):

anyone else here working on autoformalization? With autoformalization of mathematical theorems, a similar approach could work partially: use e.g. Plausible to try to generate counterexamples. But if your formalized theorem statement passes the test cases, it could still just be a different statement from the natural language version, one just happens to be also true. With coding problems that is not an issues because it asks for a function, so it is very unlikely to satisfy the test cases by accident.
So if you are training an autoformalization agent, it might be beneficial to also train on coding problems, even if your goal is math, because you get much nicer ground truth feedback from the coding problem test cases...

Siddhartha Gadgil (Nov 29 2024 at 09:38):

I am working on Autoformalization and the test I use if "roundtrip" - informalize and get an LLM to compare the statements. This is not perfect but is mostly correct with "o1-mini" comparing statements (using "GPT-4o" for both formalization and informalization)

Jason Rute (Nov 29 2024 at 12:33):

The main techniques I’ve seen in the literature are (a) trying to automatically prove the statement correct, incorrect, or vacuous (hypotheses imply false), (b) round tripping (either at level of text or at an embedding level or using an LLM to verify), (c) automatically proving that two translations are equivalent and using the largest cluster of pairwise equivalent translations, (d) using a LLM to score the translation, (e) using feedback from Lean (and possibly automatic fixes) to fix bugs and make sure the code compiles. But this whole field is new and there is a lot of room for new ideas and better implementations of existing ideas.

GasStationManager (Nov 29 2024 at 19:32):

Then there's also David Silver's assertion that autoformalization doesn't have to be correct; as long as some theorem statement is produced, they can be used as training data; and that's how they trained AlphaProof. Has anyone else tried that approach to train a theorem prover?

Jason Rute (Nov 29 2024 at 19:40):

DeepSeek-Prover (and DeepSeek-Prover v1.5 which uses the same training) are very close to AlphaProof in that respect. They formalize a large number of scraped math problems in different ways and use both the correct and incorrect formalizations. (That is also where I first heard about also checking for vacuously true formalizations.)

Jason Rute (Nov 29 2024 at 19:42):

(Also there has been a large number of autoformalization papers in the last few months and I’m behind on reading them all.)

GasStationManager (Nov 29 2024 at 19:46):

Re: using feedback from Lean, I can confirm that it is very useful. A lot of times the LLM outputs will have syntax errors. Some LLMs (like Sonnet) are almost there, and getting feedback from Lean allows them to fix their code. I made the simple script LeanTool for this; might be of interest to others.

Jason Rute (Nov 29 2024 at 19:47):

Also as for using incorrect formalizations, the first paper using GPT-style LLMs for autoformalization (https://arxiv.org/abs/2205.12615) also showed you can use a large number of mostly wrong formalizations to train a theorem prover with RL, but I don’t think they used proving the theorem false as a training objective. (Although some papers also use successful parts of partial proofs, like say if you can prove one case but not the other, to add more training data. I don’t know if that paper did that.)

Jason Rute (Nov 29 2024 at 19:50):

Re feedback: This is also a big technique now in theorem proving. A number of papers use symbolic methods and/or editor feedback to fix broken LLM generated proofs. The pass rates are much higher than the LLM proof alone. I survey such papers in another thread.

Jason Rute (Nov 29 2024 at 19:56):

The list is in this thread, but there are already a number of newer papers like CobbleStone (in Coq). https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/Proper.20LLM.20Lean4.20Integration.20with.20recursive.20checks.20for.20error. It is just so easy for folks to experiment with this approach.

GasStationManager (Nov 29 2024 at 20:18):

Thanks very much for the pointers!

GasStationManager (Nov 29 2024 at 22:55):

I've always thought that the Draft-Sketch-Prove type of approach holds a lot of promise; if we want to utilize automated tools like SMT /SAT solvers (and I think we do), then that is a natural way to divide the labor. However most papers using this type of approach appear to implement in Isabelle, apparently due to the availability of the sledgehammer tool in Isabelle. Does Lean have a comparable set of tools for automatic proofs today? (Even if only a subset is available, perhaps some tasks can be filled by LLMs...)

Thomas Zhu (Nov 29 2024 at 23:15):

GasStationManager said:

I've always thought that the Draft-Sketch-Prove type of approach holds a lot of promise; if we want to utilize automated tools like SMT /SAT solvers (and I think we do), then that is a natural way to divide the labor. However most papers using this type of approach appear to implement in Isabelle, apparently due to the availability of the sledgehammer tool in Isabelle. Does Lean have a comparable set of tools for automatic proofs today? (Even if only a subset is available, perhaps some tasks can be filled by LLMs...)

A hammer in Lean is under development (for some pointers, see lean-auto and Duper). There are also purely LLM-based tools like LLMLean and LeanCopilot.

Orthogonally, a significant issue for re-creating Draft Sketch Prove in Lean is that there is very little sketch-style data in Lean. To illustrate, many proofs in Isabelle look like (when translated to Lean syntax):

example : G :=
  have h1 := by sledgehammer
  have h2 := by sledgehammer [h1]
  have h3 := by sledgehammer [h1, h2]
  show G by sledgehammer [h1, h2, h3]

and therefore it is natural in Isabelle to train models that generate only the informal/formal "sketch" part h1, h2, h3 because of abundance of data; then you can finish off by hammer. On the other hand, Lean proofs are usually

example : G := by
  tactic1 -- suffices to show G'
  tactic2 -- suffices to show G''
  tactic3 -- solves G''

and here these tactics cannot be replaced by 3 hammer invocations (note that hammers work by translating to a format for external provers, and therefore can either completely solve a goal or completely fail). So the problem of generating "sketches" that can be finished off by hammers for Lean is much harder than in Isabelle (where you have papers like Draft, Sketch, Prove and SubgoalXL), simply due to lack of training data.

Thomas Zhu (Nov 29 2024 at 23:22):

Perhaps this is getting far from the original topic. On autoformalization my opinion is that for RL training data, correctness of the formalized version doesn't matter. If I interpret correctly, AlphaProof tries 100 formalizations per problem (1M informal problems -> 100M formal problems). It begins to matter for generating benchmarks from e.g. high school competitions, where correctness is crucial

Joseph Myers (Nov 30 2024 at 00:55):

Arguably the sketch of a good Lean proof is the sequence of lemmas involved (in that long proofs are best split up into separate lemmas rather than having a single long by tactic block).

Joseph Myers (Nov 30 2024 at 00:56):

You might not be able to prove each individual lemma by a hammer, but the smaller the lemmas are, surely the easier it would be for an AI to reconstruct proofs for each of them.

GasStationManager (Nov 30 2024 at 01:11):

Actually my main interest is in code generation (and proof of its correctness). In this setting, often the LLM can already articulate in natural language why their code is correct. So perhaps we just need a good formalizer to translate it into a formal sketch (coming back to the topic of this thread).

Thomas Zhu (Nov 30 2024 at 03:41):

Joseph Myers said:

You might not be able to prove each individual lemma by a hammer, but the smaller the lemmas are, surely the easier it would be for an AI to reconstruct proofs for each of them.

I agree there is some equivalence between have-style, declarative proofs and short lemmas preceding a proof. Plus, if a lemma is not solvable by hammer, one can always propose lemmas for this lemma and recurse. On the machine learning side, one would need to train a lemma generator, which requires extracting meaningful (theorem, lemma) pairs from human written proofs; this recursive lemma pipeline is explored by this paper https://arxiv.org/pdf/2411.01829 (though in Isabelle)

GasStationManager (Dec 03 2024 at 10:51):

Small update: I have ported these autoformalized problems to Code with Proofs: the Arena demo site. They go from challenge 38 "beautifulTable" to challenge 47 "dice_game". You're welcome to try them out!

Last updated: May 02 2025 at 03:31 UTC