Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: More papers on autoformalization

Jason Rute (Sep 29 2022 at 21:03):

Two new papers dropped (submissions to ICLR 2023) on the topic of auto-formalization of statements and proofs in ITPs using off-the-shelf pre-trained models (like Codex and Minerva).

Jason Rute (Sep 29 2022 at 21:11):

Towards a Mathematics Formalisation Assistant using Large Language Models
- Lean 3 and 4
- Tried translating 120 informal theorem statements to Lean 4
- Tried translating 13 natural language proofs to Lean 3
- Some novel features include
  - input dependent prompts (similar to retrieval augmented language models)
  - using the Lean 3 to Lean 4 syntax translator (from mathport?) and the Lean 4 elaborator to filter candidate theorem translations

Jason Rute (Sep 29 2022 at 21:11):

Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs
- Isabelle
- Focus on miniF2F problems
- Gives an end-to-end solution which takes in informal statement and formal statement and then generates a proof (by first generating an informal proof and a formal Isabelle proof sketch, and finally applying hammer-like tactics to fill in the missing steps in the sketch).

Jason Rute (Oct 02 2022 at 02:45):

First, I want to say that I think both papers are very nice. They really show the potential of out-of-the-box large pre-trained language models to write proofs in ITPs. And like the recent auto-formalization paper, these techniques could be immediately coded up and given to Lean, Isabelle, Coq users to play with.

Jason Rute (Oct 02 2022 at 02:45):

But I'm also a bit concerned by a particular element in the second paper ("Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs"). This paper shows that one can solve many more proofs using LLMs to first make an informal proof, then convert that to a formal proof sketch, and finally fill in the holes with hammer-like tactics. Since this paper uses large pre-trained language models, I think it should at least have a small discussion about leakage. This is even more important since many problems in the miniF2F benchmark are famous and have been discussed on the internet repeatedly. In particular, there is a lot of discussion on the 1959 IMO p1 which the model got correct. (This is both the first IMO problem ever, and the one which some consider the easiest ever).

Jason Rute (Oct 02 2022 at 02:46):

Of course, some types of leakage are worse than others. For example, if Codex had the miniF2F repo in its training data, then any problems with an Isabelle proof in that repo would be much easier for the model. (Now Isabelle proofs were just added recently to miniF2f, and the Codex code-davinci-002 data is from Jun 2021, so I don't think there is this type of leakage, but this should be discussed in the paper.) Also, some of these proofs could also occur in other repos or in other ITPs. Finally, it is guaranteed that at least some of these theorems are in Minerva's and Codex's training data in their natural language form.

Jason Rute (Oct 02 2022 at 02:46):

Now this problem isn't unique to this paper. Most of the benchmarks on miniF2F use pre-trained language models specifically pre-trained on math-heavy data. This is very much desirable as it is an easy way to instill the model with "math and code awareness", but it also means that the miniF2F benchmark with it's heavy use of competition problems, many years old, should be thought of as not purely a reasoning benchmark, but an auto-formalization benchmark. This means that models like sledgehammer which don't even have the ability to memorize human-written proofs are at a disadvantage, especially for famous problems which have made their way into the training data (in natural language form). A cynical view (to play the devil's advocate) would be that this disadvantage might be reduced heavily if evaluated on problems which are guaranteed to be novel.

Jason Rute (Oct 02 2022 at 02:46):

I do think this is a tricky matter. The majority of formalization in mathematics is formalizing stuff which has already been published, some if it decades old and ubiquitous on the Internet. It is quite advantageous to have a model like this one, which given a theorem, automatically retrieves and reconstructs a number of informal proofs of that theorem, and uses those to make a number of formal proof sketches which are tested to see if they can be automatically filled in.

Jason Rute (Oct 02 2022 at 02:46):

By the way, the first paper (Towards a Mathematics Formalisation Assistant using Large Language Models) goes very much out of their way to test on problems unlikely to be in the training data of Codex.

Albert Jiang (Oct 02 2022 at 13:30):

I agree with Jason's assessment that the leakage problem should be addressed so that the readers are aware of this possibility. But to qualify it: the leakage, if any, is more of a Minerva problem (informal prover) than a Codex (autoformaliser) problem since the miniF2F/Isabelle partition contains not many proofs. For the DSP paper in particular, an ablation study shows that if one simply lets Codex write down the full formal proof, the performance is pretty bad. This to me suggests that pure memoisation does not work out-of-the-box.

Guillaume (Oct 03 2022 at 08:25):

@Jason Rute I think the problem you are mentioning is indeed more specific to Minerva, where it's true that the majority of the test problems can be found on Google. But here, the main point of the paper was more to show that if an informal proof is available (whether it's written by a human or a by model), then you can use it to guide a formal prover. So the first application here is not necessarily to prove new problems, but more to facilitate the formalization / translation in Lean when informal proofs exist (which is the case for all theorems on mathlib I imagine).

Albert Jiang (Oct 05 2022 at 12:44):

After some research and analysis, I conclude that it is not likely that the abilities of large language models in the DSP paper can be attributed to memorisation, although I agree entirely with @Jason Rute that this is a very tricky matter and should be treated with care. My reasons for the conclusion are:

For coming up with informal proofs, Minerva and Codex are used. In section 5 of the Minerva paper, the authors analysed and discussed the possibility of memorisation by rephrasing the problems or changing the numbers/variables in the problems, plus a few more text-similarity examinations. They did not find convincing evidence that Minerva benefits from memorisation on their evaluation dataset. Codex is trained on code data and is less likely to remember mathematical proof data (it also performs inferiorly to Minerva models).
For the autoformalisation of proof sketches, the specific Codex version (code-davinci-002) was trained on data up to June 2021, at which point the miniF2F benchmark was not publicly released yet! Also there is very little formalisation on the internet in general on these high-school maths competition problems. So it seems highly unlikely to me that Codex should remember anything about miniF2F.

I hope this is useful for the discussion of memorisation/data contamination with respect to this paper.

Yuhuai Tony Wu (Oct 14 2022 at 12:13):

I want to mention we had done an analysis in the Minerva paper about memorization, that is worth reading if you are concerned about this. We showed Minerva did a lot more than just memorization. In addition, Minerva attained great scores on new math exams this year (2022 UK, Polish math exams).

Jason Rute (Oct 14 2022 at 13:08):

I just looked at Section 5 of the Minerva paper. I feel like it is stating two things:

The MATH dataset wasn't in the training data for MInerva so memorization is impossible.
Minerva does about the same with small changes to a question (it has not seen before).

But does this say anything about competition level problems which may be in the dataset for PaLM or Minerva? Those are not small changes from other problems. Those are very different questions from each other. Also, I don't think you looked at what happens when a problem does end up in Minerva's training data. (IMHO section 5.2 should have taken problems which were found in Minerva's training data, not problems that Minerva happened to get correct.) Maybe the accuracy shoots up really high to almost 100%.

Here are a few follow up hypotheticals:

Do you think that Minerva would have done better or the same on the 2010-2020 polish math exams as the 2022 polish math exam, especially if those exams and their solutions made it into the training data for PaLM or Minerva (in English)?
Do you think that Minerva would do the same on 1959 IMO p1 if it was or was not in the Minerva training data?
Do you think the results in this ICLR paper which use Minerva would come out the same if the mini-F2F problems where replaced with different but similar caliber problems which are nowhere on the internet?

I would also love to see the techniques of this ICLR paper applied to the 2022 Polish math exam. That would be a good follow up and a cleaner experiment. (But I'm not asking the authors to do that. I know I'm being a Reviewer #2 right now. I think the autoformalization part of the paper stands on its own.)

Adam Topaz (Oct 25 2022 at 20:58):

Here is the arXiv version of one of the papers mentioned above:
https://arxiv.org/abs/2210.12283

Jason Rute (Nov 30 2022 at 04:40):

The first paper is now public as both a preprint and a note for MATH-AI. Moreover, the authors @Ayush Agrawal, @Siddhartha Gadgil, @Navin Goyal, @Ashvni Narayanan, and @Anand Rao just released their LeanAid tool which is like LeanChat, but seems to have more powerful integration with Lean including checking if a translation typechecks and using Lean's environment docstrings to do adaptive prompting. They include a repo and a demo video.

Jason Rute (Nov 30 2022 at 04:45):

Here is also another MATH-AI paper along these lines:

ProofNet: A Benchmark for Autoformalizing and Formally Proving Undergraduate-Level Mathematics Problems by @Zhangir Azerbayev, @Bartosz Piotrowski, and @Jeremy Avigad

Siddhartha Gadgil (Nov 30 2022 at 04:49):

Jason Rute said:

The first paper is now public as both a preprint and a note for MATH-eAI. Moreover, the authors Siddhartha Gadgil Anand Rao just released their LeanAid tool which is like LeanChat, but seems to have more powerful integration with Lean including checking if a translation typechecks and using Lean's environment docstrings to do adaptive prompting. They include a repo and a demo video.

#lean4 > LeanAide translation: Natural language to Lean 4

#general > LeanAide translation: Natural language to Lean 4

#general > LeanAide translation: Natural language to Lean 3

One clarification: we use docstrings extracted from mathlibin addition to those from the environment (the latter is not currently used by default in Lean 4). This is a large set so allows good prompt engineering. These were extracted by running docgen and using the intermediate JSON file (with some cleanup).

Albert Jiang (Nov 30 2022 at 09:25):

This is extremely cool!

Tyler Josephson ⚛️ (Feb 27 2023 at 15:43):

ProofNet paper now on arXiv - nice work @Zhangir Azerbayev @Bartosz Piotrowski Hailey Schoelkopf @Edward Ayers @Dragomir Radev @Jeremy Avigad

Kevin Buzzard (Feb 27 2023 at 16:22):

So is "Yale College" different from "Yale university"? Different authors have different affiliations (but all the relevant ones have @yale.edu email addresses)

Kevin Buzzard (Feb 27 2023 at 16:25):

page 2: the Sylow theorem example at the top of the page has an example of a group of order 312, and the proof starts "since the order of G is 351, ...". Is this supposed to be a demonstration of how woefully bad language models are still, or is this a typo?

Sebastien Gouezel (Feb 27 2023 at 16:31):

The next characters are 351 = 3^2.13 (which is again crazy) and then G has 3-Sylow subgroup of order 9 (which is wrong for the initial group of cardinality 312 as this is not divisible by 9)...

Zhangir Azerbayev (Feb 27 2023 at 20:15):

Undergraduates are technically members of the college, and both the college and the department of computer science are part of the university.

Zhangir Azerbayev (Feb 27 2023 at 20:23):

Kevin Buzzard said:

page 2: the Sylow theorem example at the top of the page has an example of a group of order 312, and the proof starts "since the order of G is 351, ...". Is this supposed to be a demonstration of how woefully bad language models are still, or is this a typo?

This is my mistake. I copied the example from the dataset into the figure, and made some silly errors while doing so.

Fabian Glöckle (Feb 28 2023 at 18:00):

Thank you!
as a tiny remark: the Pythia arxiv link doesn't seem to lead to the intended paper

Junyan Xu (Mar 26 2023 at 01:14):

Just noticed Draft, Sketch, Prove shares the same acronym as Demonstrate, Search, Predict from Stanford two months later :)

Albert Jiang (Mar 26 2023 at 11:48):

eh I might go back to norse mythology names if acronym collision is so frequent

Praneeth Kolichala (Apr 14 2023 at 04:08):

Re ProofNet, I was just scrolling through the paper and realized that case study 1 is actually wrong. Exercise 10.7.4 asks you to prove that if $M\subseteq R$ is an ideal which contains all non-units of $R$ , then $M$ is the unique maximal ideal. The issue is that of course this fails if $M=R$ . A human would understand that implicitly, $M$ has to be a proper ideal, or that it must contain only non-units of $R$ as well (or else $M$ trivially cannot be maximal). However, the formalization does not contain this subtlety:

theorem exercise_10_7_10 {R : Type*} [ring R] (M : ideal R)
  (hM : ∀ x : R, x ∉ M → is_unit x) :
  is_maximal M ∧ ∀ (N : ideal R), is_maximal N → N = M := sorry

example : false :=
(exercise_10_7_10 (⊤ : ideal ℤ) (by simp)).1.ne_top rfl

This could be seen as a one-off mistake, but I am tempted to read into it more (especially since it was specifically selected as a successful case study by the authors). I wonder if it would be good to train models to either provide a proof or a counterexample, so that situations like this are still valuable. It also highlights the subtlety of translating text to formal mathematics -- sometimes you need a good grasp of the content already in order to translate the exercise correctly.

Zhangir Azerbayev (Apr 15 2023 at 20:37):

Praneeth Kolichala said:

Re ProofNet, I was just scrolling through the paper and realized that case study 1 is actually wrong. Exercise 10.7.4 asks you to prove that if $M\subseteq R$ is an ideal which contains all non-units of $R$ , then $M$ is the unique maximal ideal. The issue is that of course this fails if $M=R$ . A human would understand that implicitly, $M$ has to be a proper ideal, or that it must contain only non-units of $R$ as well (or else $M$ trivially cannot be maximal). However, the formalization does not contain this subtlety:
theorem exercise_10_7_10 {R : Type*} [ring R] (M : ideal R)
  (hM : ∀ x : R, x ∉ M → is_unit x) :
  is_maximal M ∧ ∀ (N : ideal R), is_maximal N → N = M := sorry

example : false :=
(exercise_10_7_10 (⊤ : ideal ℤ) (by simp)).1.ne_top rfl
This could be seen as a one-off mistake, but I am tempted to read into it more (especially since it was specifically selected as a successful case study by the authors). I wonder if it would be good to train models to either provide a proof or a counterexample, so that situations like this are still valuable. It also highlights the subtlety of translating text to formal mathematics -- sometimes you need a good grasp of the content already in order to translate the exercise correctly.

Thanks for spotting this! We are preparing a conference submission that will hopefully rectify any errors.

I haven't decided whether or not to convert everything in the paper to Lean 4 before the conference submission. If I do, we can run sagredo on negations of theorem statements to catch errors.

If I don't end up doing that, do you think something like tidy would help catch silly errors?

Scott Morrison (Apr 16 2023 at 08:27):

I doubt tidy on negations will get very far. If someone were to systematically improve slim_check by running it on the negations of things in mathlib, that would be awesome. There is a lot of low-hanging fruit there.

RexWang (Apr 16 2023 at 13:56):

Hi @Zhangir Azerbayev , I just noticed an error in the proof of the first example on ProofNet. Could this be due to an error in the problem source?

Exercise ID: Dummit-Foote|exercise_4_5_14

nl_statement
Prove that a group of order 312 has a normal Sylow $p$-subgroup for some prime $p$ dividing its order.

nl_proof
\begin{proof}
Since $|G|=351=3^{2}.13$, $G$ has $3-$Sylow subgroup of order $9$,
as well as $13-$Sylow subgroup of order $13$.
Now, we count the number of such subgroups.
Let $n_{13}$ be the number of $13-$Sylow subgroup and $n_{3}$ be the number of  $3-$Sylow subgroup.
Now $n_{13}=1+13k$ where $1+13k|9$. The choices for $k$ is $0$.
Hence, there is a unique $13-$Sylow subgroup and hence is normal.
\end{proof}

Corrected verison: $|G| = 312=2^3\cdot 3 \cdot 13$ and $351=3^3\cdot 13$.

I haven't checked the other exercises yet, but this could have small impact on the accuracy of the formalized proof.

EDIT: Sorry, I have just noticed that this has already been pointed out.

Last updated: May 02 2025 at 03:31 UTC