## Stream: Machine Learning for Theorem Proving

### Topic: autoformalization?

#### duck_master (Mar 01 2021 at 05:41):

I searched the Zulip for stuff about autoformalization but didn't find much except for a bit of discussion of this paper of Szegedy, so I'm creating this topic.

What has been done on autoformalization (i.e. automatically translating natural-language mathematics to formal proofs)?

This seems pretty important to me because there's a lot of wisdom around in the mathematical literature, but we currently have no better way to do this than to hire programmer experts to painstakingly transcribe everything. (Don't get me wrong - this sounds like a pretty fun activity - but I think a lot of it is boring and routine.) Using autoformalization tools (even imperfect ones that produce partially-correct/-accurate transcriptions) would enable formal mathematics to be done at scale, and competitively with the pace of traditional research.

One way we could go about this is by creating synthetic data: from a dataset of verified proofs (which could be human-written or itself automatically generated combinatorially), use a compiler-like program to turn this into natural-language content (possibly with LaTeX added), and optionally mutate it with word-dropping, replacing words with BERT, deleting minor lemmas, etc. to simulate the non-rigorous-ness of human mathematicians. Combined with progress in machine translation, I think that autoformalization is looking more feasible than ever.

As for concrete stuff I found:

#### Roman Bars (Mar 01 2021 at 12:24):

I think one non-trivial challenge you will find is the word "canonical". Mr. Buzzard I think had an issue with that word in the context of class field theory and also gluing schemes (look at the blog).

Also parsing sentences involving some overused word like "cohomology" might be an issue (it is often used without much clarification yet there is so many different cohomologies out there: singular, analytic de Rham, algebraic de Rham, etale, crystalline, Galois, rigid, group and so on and so on).

#### Jason Rute (Mar 01 2021 at 13:24):

@duck_master This is great! And if you are interested in giving more of a summary of some of these papers, I'm sure we would love to read it.

#### Jason Rute (Mar 01 2021 at 13:24):

The one thing I'm familiar with that you haven't mentioned is some recent talks by @Christian Szegedy talking about the work of the N2Formal team at Google Research. Besides that manifesto you have already mentioned, Christian gave a talk at AITP 2020 on partial progress so far. The video is here. His talk starts at about 2:13:00. I'm not aware of any publications of his team yet.

#### Jason Rute (Mar 01 2021 at 13:51):

I think auto-formalization is really key to advancing automated theorem proving and helping with library creation. However, just as with "automated reasoning", there is a wide spectrum of possibilities. I don't think that new papers in mathematics, or advanced esoteric topics will be autoformalized anytime soon. However, it is well known by now that one can train a reasonably good ML theorem prover in your favorite logic or ITP if you have a reasonably large and diverse set of theorem statement examples (search for papers with the terms "Reinforcement Learning" and "theorem proving"). And I think our provers would do even better if we could formalize a lot of basic facts about certain areas of mathematics. The library in Lean for example is in some sense very sparse. Lean doesn't need (or want) to add lots of little variations of every theorem in the library, if they are easy consequences of the main theorems, but that may very well help in training mathematical proof agents. Also, such autoformalization would be possible help to an ITP user trying to formalize a particular paper or book.

#### Jason Rute (Mar 01 2021 at 13:52):

Here are some topics I would imagine come up in the space. (It is mostly my random thoughts, but some of it is informed by what I've read or seen.)

• Aligned or unaligned. In machine translation this is a big topic. It is easier to train translation if we have examples of English and French sentences which say the same thing. How, we have more data if we just take a lot of English text and a lot of French text, and a lot of work has been done lately on unaligned machine translation. This is not only in natural language translation, but also in translation from say Python to C++. In formal mathematics, I think there is very little natural aligned text. I think some of the papers you mention use synthetic examples which may help.
• Vocabulary. I think one of the most difficult issues is that there is still not a lot of common mathematics in ITPs. For example, I'm not aware of any ITP with extensive probability theory or calculus. Even though these are part of an undergraduate curriculum, they are difficult to formalize and design notation for. I doubt any current AI would be able to align some ideas until they are designed out in ITPs. And these are just examples of common mathematical ideas. Unfortunately, nearly every paper in mathematics invents its own terms (even if it is just temporary to that paper). Alignment would be very difficult with research level papers for that reason.
• Mismatches in corpus sizes. There is vastly more data on informal math than formal math and this is going to be a challenge. Indeed, Szegedy talks a lot about this challenge in his talk.
• Side conditions. This is already an issue in Lean. For example, forall n m : nat, (n - m) + n = m is not technically true unless n >= m, but is morally true. What happens if the autoformalizer returns something like that. Is that useful?
• Clean data. If auto-formalization of raw arXiv doesn't make much progress, then I wonder if we need to step back and ask the following question. What is the type of mathematics would we would like to autoformalize today. IMO problems? Textbooks? Textbook exercises? Made up examples just for ITP training that is easier to type out in LaTeX than a formal prover. Can we then curate an extensive and clean dataset. It won't be as large as arXiv, but it might be much more useful.

#### Kevin Buzzard (Mar 01 2021 at 14:02):

A lot of my work is based on the assumption that auto-formalisation of e.g. the work of Clausen and Scholze which I've been formalising this morning is simply way too hard for a machine to even contemplate right now. I think that the people who think that this is close are just reading mathematics papers of a completely different nature to the ones I am trying to formalise.

I've seen what computers can do and right now my personal opinion (and others will have different opinions) is that it is a complete joke to think that we are anywhere near being able to get a computer to understand what a research algebraic number theorist writes in a modern paper. Even PhD students struggle to understand what an expert writes -- they leave so much out, with no indication about whether the gaps are to be filled in with a trivial calculation or an application of a standard theorem with a highly complex proof. If I were less sure about this opinion then I would be an awful lot less sure about my current program of trying to help to train humans to do this kind of work. I would love to be proved wrong.

#### Ziyu Zhou (Mar 10 2022 at 06:48):

Hi everyone! I am a senior student, and I plan to make a machine translation algorithm from LaTeX to Lean as my graduation project. It doesn't need to be too complicated and powerful. For example, I will be satisfied with a model that if it has learned a chapter of a textbook, it has the ability to auto-formalize the exercises of that chapter. I think the memory network and some pre-training language models like CodeT5 can do that. I would like to ask if there is any available dataset at present. I would also appreciate any other good suggestions! Thank you!

#### Jason Rute (Mar 10 2022 at 11:52):

@Ziyu Zhou Just to set expectations, what you propose, just formalizing exercises in the back of a book, would be a big deal if you could make it work. I don't mean to discourage you. Nothing shocks me anymore with pre-trained transformer models (foundation models). If you can get something working reasonably well, I would strongly encourage you to put in on arXiv, share it widely (including here), and submit it to an ML conference it (like ICLR, NeurIPS, or ICML), and submit it to non-publication workshops (like AITP). Now as for references, the references mentioned above are all a good place to start (including that talk by Szegedy that I mention). I have to admit that I'm not sure any of those works use pre-trained models (like CodeT5) and that might be the game changer here. (Also, if your method is quite general purpose, you may want to also try Coq, Isabelle, and HOL-Light as those are older languages so they have more examples on the web, especially Coq I think.)

#### Jason Rute (Mar 10 2022 at 12:18):

As for datasets, there is a lot of ways to get theorem statements out of lean using metaprogramming. (Maybe ask in the #metaprogramming / tactics channel.) I know of no aligned datasets between LaTeX and Lean and that is what makes this problem difficult.

#### Ziyu Zhou (Mar 10 2022 at 17:35):

@Jason Rute Thank you for your advice! I know it is difficult, and I'll probably find an easy angle to start with. I intend to carry out experiments with some small neural networks with memory. It looks like I might have to write the dataset myself, even if it's a tough job.

#### Junyan Xu (Mar 10 2022 at 23:08):

There are a some formalized olympiad problems at https://github.com/leanprover-community/mathzoo/tree/main/src/mathzoo/olympiads ; not all problems come with the original statements, but I assume you can find them easily. I think at one point OpenAI paid a bunch of undergraduates to help formalize them.

#### Ziyu Zhou (Mar 11 2022 at 05:38):

@Junyan Xu Thank you! That is useful.

#### Kunhao Zheng (Mar 11 2022 at 10:40):

Maybe can start from building your small synthetic latex-lean-aligned dataset. Like what MATH dataset did for many easy questions. Concretely you can start with writing a script that produces what is the addition of 3 and 5? and example : 3 + 5 = 8 := sorry in the same time. with many scripts you can produce many elements like addition, modulo, or even some inequalities. Then you can play with everything in this aligned dataset, and see how things transfer to mathlib.

#### Arthur Paulino (Mar 11 2022 at 10:53):

Another idea: a model/script that can turn some atomic phrases into specific tactics applications. Your dataset would have tactics as lines instead of entire proofs

Last updated: Dec 20 2023 at 11:08 UTC