Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: A (semi)-autoformalization challenge: 650=>448


Terence Tao (May 16 2025 at 19:26):

In #Machine Learning for Theorem Proving > Tools for semi-automated line-by-line formalization we discussed different ways to formalize a certain implication 1689=>2 studied in the Equational Theories Project, in which we had multiple proofs using multiple tools. While the implication was initially discovered in the 1990s using computer assistance, this particular implication is now relatively easy, with many different semi-automated or fully automated approaches being either partially or fully successful in establishing it.

I would now like to pose a more difficult challenge. There is a more difficult implication 650=>488 that we were able to establish by first verifying the implication in Vampire, and then developing enough of the theory of superposition calculus within Lean that one could translate that proof into a Lean proof, albeit one which is quite lengthy. Vampire took 3.6 seconds to establish this particular implication, which is unusually long (by comparison, 1689=>2 took 0.003 seconds for Vampire). In this regard, this is the hardest positive implication in the ETP.

Anyway, the challenge is to find an alternate proof, by whatever means you wish - human, semi-automated, or automated, with additional imports or without. In this file I have prepared a version of the Vampire proof with all the intermediate steps kept, but with the proofs replaced with sorries (and a comment describing the original proof in the superposition calculus). All in all there are 66 sorries to fill. It would be interesting to have people test out various tools and report their performance. For instance, directly applying canonical to each sorry without any parameter tuning works initially (it closes off eq9, eq10, and eq12) but after that it times out - apparently there are too many hypotheses in the proof state after that for canonical to work efficiently. How do other automated or semi-automated approaches fare?

David Renshaw (May 16 2025 at 20:14):

I was able to get grind to work on all but one of those sorries. (eq3685 is the one that remains.)

David Renshaw (May 16 2025 at 20:15):

Though it gives me the warning

The grind tactic is experimental and still under development. Avoid using it in production projects.

so I'm not sure how much to trust it.

Patrick Massot (May 16 2025 at 20:59):

You can totally trust proofs found by grind. I think that what you are asked to not trust is that grind will still find those proofs in the next release.

Terence Tao (May 16 2025 at 22:56):

Just to keep the discussion centralized: @Jeremy Avigad was able to clear all sorries (and even shorten the proof somewhat) purely through duper, see #Machine Learning for Theorem Proving > Tools for semi-automated line-by-line formalization @ πŸ’¬

Michael Kinyon (May 18 2025 at 21:42):

Using Prover9, I was able to get a 27 step proof of 650=>448 (actually xy=x). At Terry's request, I am posting here the Prover9 proof itself (as a text file) and my "humanization" of it (as a PDF).

I don't speak Leanish, but if anyone is interested in Leanifying this proof, feel free.

I should say I did not look at the Vampire proof. I started from scratch and followed the methodology described in a paper on proof simplification I wrote half a dozen years ago: https://arxiv.org/abs/1808.04251

650implies448.txt

650implies448.pdf

David Michael Roberts (May 18 2025 at 22:01):

I'd like to claim this mini project (I mentioned it to @Terence Tao on Mathstodon under my nom de plume)

Terence Tao (May 18 2025 at 22:24):

Good luck! I look forward to seeing your "field report". On my Youtube channel https://www.youtube.com/@TerenceTao27 I describe three different types of workflows to use AI tools to autoformalize these sorts of things, but of course you are welcome to come up with a fourth type instead, which may in fact be a more valuable data point.

Bruno Le Floch (May 19 2025 at 00:07):

I already have a Lean proof, which I am in the process of converting to human-readable LaTeX (probably tomorrow, it's 2am here). Here is the Lean proof. It is loosely based on several prover9 runs with intermediate results thrown in as assumptions or as goals.

Lean proof of 650=>448

Terence Tao (May 19 2025 at 00:35):

Thanks! I think it will be of interest to gather different proofs obtained by different methodologies, as this may be a useful point of comparison that may inform future autoformalization work.

Jason Rute (May 19 2025 at 02:41):

The term β€œautoformalization” is usually used to mean translating natural language to formal language (often that formal language is a theorem proving language like Lean, and the natural language is often a theorem statement, definition, or proof). Is this what we mean here, or are we talking about β€œautomated theorem proving” (ATP), i.e. filling in (formal) proofs automatically. (ATP has, at least historically, meant formal proof since the field is quite old, and it’s only been in the last two years or so that it has even seemed conceivable for computers to construct fully correct natural language proofs.)

Jason Rute (May 19 2025 at 02:41):

I don’t mean to be pedantic. I just want to make sure I understand the goal here. I think this Zulip thread is more about ATP while the YouTube videos are about autoformalization. Of course, auto-formalization and ATP are closely related, and they are becoming more and more blended. To autoformalize a proof you usually need ATP to fill in the gaps, and many LLM approaches to ATP blend autoformalization and ATP by generating a natural language argument first and then translate it to formal language. Also code assistants like the ones you use in the video naturally do both (to various degrees of success).

Jason Rute (May 19 2025 at 02:41):

I am enjoying this discussion and I hope we keep it up, both with these problems as well as others! However, I also want to remark that the problems in this thread, as I understand them, are the sort of ATP problems that symbolic methods have been traditionally good at, first order reasoning without a lot of relevant background theory or big ideas. I’m looking forward to seeing other problems that arise naturally which symbolic solvers struggle with, and seeing what methods (if any) ultimately succeed.

Terence Tao (May 19 2025 at 02:47):

Well, as you say, this particular problem is already well within the range of existing ATPs, and we have multiple ATP-generated proofs at this point (from Vampire and Prover9). So it's more of a formal -> formal problem (going from an ATP proof to Lean), though it can be factored as formal -> informal -> formal (e.g., going through the "humanized prover9 proof" that @Michael Kinyon just supplied). Given the length of the proofs, any of these arrows would probably require at least some semi-automation, but I don't know which of these tasks would be classified as "autoformalization" in the accepted terminology.

One particular feature of these positive equational implications, which is very much not present for most math formalization tasks, is that they can be stated and proved without appeal to external libraries (other than minimal Mathlib tactic libraries). So the whole "lemma search" problem is eliminated for this particular type of task. This will of course change the playing field to favor some approaches and disadvantage others, but still I think it is a useful toy problem to start with, which can then partially inform the challenges involved for more realistic formalization problems.

[Some of the negative equational implications are significantly harder and required actual Mathlib input, e.g., the theory of the free group on countably many generators, or linear algebra over finite fields, but that's another story...]

Jason Rute (May 19 2025 at 03:38):

I neglected to think of the formal-to-formal side of autoformalization. Yes, I can see this problem fitting firmly in that niche.

Bruno Le Floch (May 19 2025 at 10:06):

Here is a human-understandable proof,
as a 3-page pdf file 650d.pdf and LaTeX file. In the process of writing it up I've revised some steps in the lean proof.

How I reached that proof

Obviously, automated methods are much much faster than doing this by hand like I did (roughly 6h getting to a working Lean proof, 5Β½h cleaning it up, 2h making it human-readable).

Bas Spitters (May 19 2025 at 10:15):

@Terence Tao about formal-to-formal. Have you considered proof producing ATP? Fx for Vampire can emit proofs in Dedukti, which can likely be adapted to lean:
https://arxiv.org/abs/2503.15541

Michael Kinyon (May 19 2025 at 18:18):

@Bruno Le Floch Very nice work with the humanization! You did two things I usually do in my own work, but didn't do this time because of time constraints: spot useful lemmas and identify frequently occurring term functions. I didn't use lemmas in my runs, just simplification via Prover9's hints mechanism. I did notice repeating terms like x(yx) but decided to skip it since the proof was already short. So I just focused on converting paramodulations to "substitute then rewrite" and left it at that. (My final run was with back demodulation turned off so the proof was with paramodulations only.) Also, I completely agree that when Prover9 includes a clause in more generality than it ends up using, it's aesthetically better to humanize to a less general form (fewer variables). Anyway, great work!

Adam Kurkiewicz (May 24 2025 at 19:41):

I took @Bruno Le Floch's PDF file, and I was able to auto-formalise Lemma 1 using Claude 4 Opus without any meaningful human input.

Here is the approach I followed.

I used the following system prompt:

You are a professor of mathematics working at the Imperial College of London specialising in formalising mathematics in Lean 4. Help the user with their formalisation task.

I used the following initial prompt (I provided Bruno's blueprint, 650d_proof.pdf, as an attachment):

Can you finish the formalisation of the following proof in Lean 4?

650d_proof.pdf


``
class Magma (Ξ± : Type _) where
  op : Ξ± β†’ Ξ± β†’ Ξ±

infix:65 " β—‡ " => Magma.op

abbrev Equation650 (G: Type _) [Magma G] := βˆ€ x y z : G, x = x β—‡ (y β—‡ ((z β—‡ x) β—‡ y))

abbrev Equation448 (G: Type _) [Magma G] := βˆ€ x y z : G, x = x β—‡ (y β—‡ (z β—‡ (x β—‡ z)))

abbrev f {G: Type _} [Magma G] (x y : G) := x β—‡ (y β—‡ x)

abbrev g {G: Type _} [Magma G] (x y : G) := x β—‡ ((y β—‡ x) β—‡ x)

abbrev square {G: Type _} [Magma G] (x : G) := (x β—‡ x)

abbrev cube {G: Type _} [Magma G] (x : G) := ((x β—‡ x) β—‡ x)

abbrev lcube {G: Type _} [Magma G] (x : G) := (x β—‡ (x β—‡ x))

theorem Equation650_implies_Equation448 (G: Type _) [Magma G] (h: Equation650 G) : Equation448 G := by

Claude Opus would then produce a formalisation, which I would run through lake build, and I would paste the errors (which could easily have been done by a script, I just did it manually to see the responses in real time). After several rounds of this, Claude 4 Opus produced the following Lean 4 formalisation, which correctly formalises Lemma 1:

theorem Equation650_implies_Equation448 (G: Type _) [Magma G] (h: Equation650 G) : Equation448 G := by
  -- First, let's establish Equation 4
  suffices eq4 : βˆ€ x y : G, x β—‡ y = x by
    intro x y z
    simp only [eq4]

  -- The proof establishes several lemmas leading to Equation 4

  -- Lemma 1 (Easy identities)
  have h3 : βˆ€ x y z : G, x β—‡ f y (z β—‡ x) = x := fun x y z => (h x y z).symm

  have h4 : βˆ€ x z : G, x β—‡ lcube (z β—‡ x) = x := by
    intro x z
    have := h3 x (z β—‡ x) z
    simp only [f, lcube] at this ⊒
    exact this

  have h5 : βˆ€ x y z w : G, f x (y β—‡ w) β—‡ f z w = f x (y β—‡ w) := by
    intro x y z w
    have := h (f x (y β—‡ w)) z w
    rw [h3 w x y] at this
    exact this.symm

  have h6 : βˆ€ x y z w : G, y β—‡ w = y β†’ f x y β—‡ f z w = f x y := by
    intro x y z w hyw
    rw [← hyw]
    exact h5 x y z w

  have h7 : βˆ€ x y z : G, g x y β—‡ f z x = g x y := by
    intro x y z
    -- Note: g x y = f x (y β—‡ x) by definition
    simp only [g, f]
    exact h5 x y z x

  -- I'll complete the rest of the proof using sorry for now
  -- The key remaining steps are:
  -- 1. Prove h8 (idempotence of f under certain conditions)
  -- 2. Prove h9 (projection property for g)
  -- 3. Prove the various intermediate results leading to Equation 58
  -- 4. Finally derive Equation 4

  sorry

I unfortunately ran out of credit, so couldn't keep going with this experiment and see if throwing more money at the problem would have produced a correct autoformalisation of the entire blueprint provided by Bruno, but at least we know that the tool is capable of producing formalisation of Lemma 1.


Last updated: Dec 20 2025 at 21:32 UTC