## Stream: Machine Learning for Theorem Proving

### Topic: autoformalization?

#### duck_master (Mar 01 2021 at 05:41):

I searched the Zulip for stuff about autoformalization but didn't find much except for a bit of discussion of this paper of Szegedy, so I'm creating this topic.

What has been done on autoformalization (i.e. automatically translating natural-language mathematics to formal proofs)?

This seems pretty important to me because there's a lot of wisdom around in the mathematical literature, but we currently have no better way to do this than to hire programmer experts to painstakingly transcribe everything. (Don't get me wrong - this sounds like a pretty fun activity - but I think a lot of it is boring and routine.) Using autoformalization tools (even imperfect ones that produce partially-correct/-accurate transcriptions) would enable formal mathematics to be done at scale, and competitively with the pace of traditional research.

One way we could go about this is by creating synthetic data: from a dataset of verified proofs (which could be human-written or itself automatically generated combinatorially), use a compiler-like program to turn this into natural-language content (possibly with LaTeX added), and optionally mutate it with word-dropping, replacing words with BERT, deleting minor lemmas, etc. to simulate the non-rigorous-ness of human mathematicians. Combined with progress in machine translation, I think that autoformalization is looking more feasible than ever.

As for concrete stuff I found:

#### Roman Bars (Mar 01 2021 at 12:24):

I think one non-trivial challenge you will find is the word "canonical". Mr. Buzzard I think had an issue with that word in the context of class field theory and also gluing schemes (look at the blog).

Also parsing sentences involving some overused word like "cohomology" might be an issue (it is often used without much clarification yet there is so many different cohomologies out there: singular, analytic de Rham, algebraic de Rham, etale, crystalline, Galois, rigid, group and so on and so on).

#### Jason Rute (Mar 01 2021 at 13:24):

@duck_master This is great! And if you are interested in giving more of a summary of some of these papers, I'm sure we would love to read it.

#### Jason Rute (Mar 01 2021 at 13:24):

The one thing I'm familiar with that you haven't mentioned is some recent talks by @Christian Szegedy talking about the work of the N2Formal team at Google Research. Besides that manifesto you have already mentioned, Christian gave a talk at AITP 2020 on partial progress so far. The video is here. His talk starts at about 2:13:00. I'm not aware of any publications of his team yet.

#### Jason Rute (Mar 01 2021 at 13:51):

I think auto-formalization is really key to advancing automated theorem proving and helping with library creation. However, just as with "automated reasoning", there is a wide spectrum of possibilities. I don't think that new papers in mathematics, or advanced esoteric topics will be autoformalized anytime soon. However, it is well known by now that one can train a reasonably good ML theorem prover in your favorite logic or ITP if you have a reasonably large and diverse set of theorem statement examples (search for papers with the terms "Reinforcement Learning" and "theorem proving"). And I think our provers would do even better if we could formalize a lot of basic facts about certain areas of mathematics. The library in Lean for example is in some sense very sparse. Lean doesn't need (or want) to add lots of little variations of every theorem in the library, if they are easy consequences of the main theorems, but that may very well help in training mathematical proof agents. Also, such autoformalization would be possible help to an ITP user trying to formalize a particular paper or book.

#### Jason Rute (Mar 01 2021 at 13:52):

Here are some topics I would imagine come up in the space. (It is mostly my random thoughts, but some of it is informed by what I've read or seen.)

• Aligned or unaligned. In machine translation this is a big topic. It is easier to train translation if we have examples of English and French sentences which say the same thing. How, we have more data if we just take a lot of English text and a lot of French text, and a lot of work has been done lately on unaligned machine translation. This is not only in natural language translation, but also in translation from say Python to C++. In formal mathematics, I think there is very little natural aligned text. I think some of the papers you mention use synthetic examples which may help.
• Vocabulary. I think one of the most difficult issues is that there is still not a lot of common mathematics in ITPs. For example, I'm not aware of any ITP with extensive probability theory or calculus. Even though these are part of an undergraduate curriculum, they are difficult to formalize and design notation for. I doubt any current AI would be able to align some ideas until they are designed out in ITPs. And these are just examples of common mathematical ideas. Unfortunately, nearly every paper in mathematics invents its own terms (even if it is just temporary to that paper). Alignment would be very difficult with research level papers for that reason.
• Mismatches in corpus sizes. There is vastly more data on informal math than formal math and this is going to be a challenge. Indeed, Szegedy talks a lot about this challenge in his talk.
• Side conditions. This is already an issue in Lean. For example, forall n m : nat, (n - m) + n = m is not technically true unless n >= m, but is morally true. What happens if the autoformalizer returns something like that. Is that useful?
• Clean data. If auto-formalization of raw arXiv doesn't make much progress, then I wonder if we need to step back and ask the following question. What is the type of mathematics would we would like to autoformalize today. IMO problems? Textbooks? Textbook exercises? Made up examples just for ITP training that is easier to type out in LaTeX than a formal prover. Can we then curate an extensive and clean dataset. It won't be as large as arXiv, but it might be much more useful.

#### Kevin Buzzard (Mar 01 2021 at 14:02):

A lot of my work is based on the assumption that auto-formalisation of e.g. the work of Clausen and Scholze which I've been formalising this morning is simply way too hard for a machine to even contemplate right now. I think that the people who think that this is close are just reading mathematics papers of a completely different nature to the ones I am trying to formalise.

I've seen what computers can do and right now my personal opinion (and others will have different opinions) is that it is a complete joke to think that we are anywhere near being able to get a computer to understand what a research algebraic number theorist writes in a modern paper. Even PhD students struggle to understand what an expert writes -- they leave so much out, with no indication about whether the gaps are to be filled in with a trivial calculation or an application of a standard theorem with a highly complex proof. If I were less sure about this opinion then I would be an awful lot less sure about my current program of trying to help to train humans to do this kind of work. I would love to be proved wrong.

Last updated: May 09 2021 at 23:10 UTC