Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Paper: Autoformalizing Euclidean Geometry

Logan Murphy (May 28 2024 at 17:15):

Hi all,

@Kaiyu Yang and I are excited to share our paper Autoformalizing Euclidean Geometry which will appear at ICML 2024. The paper and code are publicly available.

Summary

We extend Lean with an SMT-based reasoning engine to tackle two challenges faced when autoformalizing Euclidean geometry:

handling the abundance of implicit diagrammatic reasoning encountered in natural language proofs, and
comparing the semantics of autoformalized propositions against ground truth formalizations.

We provide a dataset of 173 theorems and proofs in natural language, Lean, and diagrams, including the first book of Euclid's Elements. Based on experiments with GPT-4 and GPT-4 Vision, we observe that

Autoformalized proofs targeting our language, while rarely exactly correct, are often quite reasonable and easy to repair
While our symbolic reasoning engine makes autoformalization in Euclidean geometry a rigorous and controllable domain, our specification language is a challenging target for formalizing theorem statements.

We hope that our language and benchmark will be useful for researchers interested in AI-supported geometry, and that our overall approach may serve as a proof-of-concept for leveraging symbolic techniques in other autoformalization domains.

In addition to Kaiyu and myself, this work is a collaboration with Jack Sun (UofT), Zhaoyu Li (UofT), Anima Anandkumar (Caltech) and Xujie Si (UofT).

Logan Murphy (May 28 2024 at 17:16):

Here are some more details, if you want an overview before diving into the paper:

Proof Automation for Euclidean Geometry in Lean

At the heart of our work is an implementation of @Jeremy Avigad 's formal system E, a theory for Euclidean geometry which faithfully models Euclid's particular style of reasoning in Elements. In our implementation, we distinguish explicit reasoning steps (e.g., construction of geometric objects) from implicit reasoning steps (i.e., preconditions for constructions/inferences which are often intuited from diagrams). We design a primarily declarative proof language in which the explicit reasoning steps are written as tactics, and implicit reasoning steps are discharged to SMT solvers. The result is that our formalizations of Euclid's proofs are very similar to the natural language proofs, avoiding much of the overhead typically encountered when formalizing geometric proofs.

In the context of autoformalization, this allows the model to focus only on translating the steps which are actually present in the text.

Semantic Evaluation of Autoformalized Theorem Statements

Syntactic metrics such as BLEU are commonly used to evaluate autoformalized theorem statements, but syntactically similar theorem statements can obviously have very different meanings. As a result, syntactic metrics are usually accompanied by manual semantic evaluation, which is expensive, time-consuming and error-prone. Moreover, without an automated evaluation procedure, it is impossible to incorporate semantic evaluation in any kind of model fine-tuning. To address this challenge, we repurposed the symbolic reasoning engine underlying our proof automation to analyze the semantics of autoformalized theorem statements. Our Lean-implemented tool E3 (Euclidean Equivalence Engine) performs standard equivalence checking (i.e., proving an iff) as well as a more granular approximate equivalence analysis which attempts to quantify how "close" an incorrect prediction is to a ground truth formalization.

Experiments with GPT-4 and GPT-4 Vision

To demonstrate how our benchmark+framework can assist in testing autoformalization models, and to see if our problems are challenging, we conducted several experiments with GPT-4 and GPT-4 Vision.

When it comes to autoformalizing theorem statements, we found that our automated evaluation procedure correlates closely with manual evaluation. On the other hand, since our language is pretty low-level, actually autoformalizing statements correctly is challenging for the off-the-shelf models we evaluated. We found only a marginal improvement of GPT-4 Vision vs GPT-4.

When it comes to autoformalizing proofs, we did not do any sort of iterative/search-based formalization or theorem proving. With the goal of establishing a baseline for proof autoformalization, we just took a single query and analyzed the resulting proof. As one would expect, very few proofs were correct out-of-the-box. But encouragingly, many of the proofs we formalized were very close to correct, requiring only a small number of modifications to repair them (e.g., rearranging the arguments to a lemma). We did this by hand, but we predict that this repair process should be quite amenable to automation. Some examples of autoformalized and repaired proofs are given in the Appendix of the paper.

Other Notes

We did not do any kind of model development or fine-tuning aside from few-shot learning. Part of our goal was to see how much we could do by just focusing on the symbolic side of the problem.
It is natural to ask what the relation of this work is to AlphaGeometry. We note that AlphaGeometry is tackling the theorem proving problem, while we focus on autoformalization. While the problems are of course related, our work focuses on translating authentic pieces of natural language geometry into "faithful" formal theorems and proofs, whereas AlphaGeometry uses synthetic data for training. We view the two works as being complementary rather than one subsuming the other.
Ours is not the only formalization of System E in Lean, c.f. this PR by @André Hernández-Espiet (Rutgers) . But to the best of our knowledge, ours is the first to systematically offload implicit and diagrammatic reasoning to proof automation. EDIT: And, as Zhangir pointed out and I should have made more clear, our proofs are not totally certified, we treat SMT solvers as a trusted black box.

Finally: if you find this work interesting, please consider sharing Kaiyu's announcement.

Zhangir Azerbayev (May 28 2024 at 20:58):

Really impressive work! One thing I think it's worth pointing out is that the SMT reasoning engine introduces an additional axiom to admit goals that are discharged to the solver. https://github.com/loganrjmurphy/LeanEuclid/blob/master/SystemE/Meta/Smt/Solver.lean#L44-L47

/- Unsound axiom we use to admit results from the solvers
   Dangerous!
 -/
axiom SMT_VERIF (P : Prop) : P

Therefore, the comparison to Hernandez-Espiet (2023) and Beeson et al. (2019) is not quite fair, because those formalizations are end-to-end verified without introducing unsound axioms.

Logan Murphy (May 28 2024 at 21:02):

Yes, this is a really good point Zhangir, thank you! Our proofs are not end-to-end certified. We do make note of this in body of the paper but we will make another note in the appendix where we compare our formalization to others' (and in the repo).

Last updated: May 02 2025 at 03:31 UTC