# Zulip Chat Archive

## Stream: Machine Learning for Theorem Proving

### Topic: Automatic Theorem Proving Competitions

#### Jason Rute (Aug 09 2023 at 00:39):

I want to start a conversation about automatic theorem proving competitions, especially ones that neural-based AI can compete in. I don't know much about this space, so I want to know more. But at the same time I feel there is really nothing right now that fits anything like I have in mind. First, I do know of a few competitions:

- (CASC) CADE ATP System Competition I think is a big one and it focuses (if I'm not mistaken) on first order logic.
- The SAT conference seems to have a few competitions every year around SAT, Model checking, and QBF.
- The IMO Grand Challenge is less of a competition than an aspirational competition.
- There are of course numerous benchmarks out there like MiniF2F, and at least one for every ITP based on the main libraries of that ITP, but these aren't competitions.
- I think there are also human ITP competitions, like ProofGround. These are not intended for ATPs, but I don't see why one couldn't allow ATP entrants.

#### Jason Rute (Aug 09 2023 at 00:39):

Overall, I find competitions intriguing. I think in machine learning at least, benchmarks are more common and useful than competitions, because benchmarks can be worked on at any point. But competitions do have a certain advantage. For example, if the problems are new, one doesn't have to worry about the model cheating by seeing the problems already. It can also be a bit more exciting especially if the models do well (like in AlphaFold2 winning CASP).

#### Jason Rute (Aug 09 2023 at 00:40):

Do folks here know of more such competitions? And if so, are there any that machine learning models can compete in, not just symbolic AI systems?

#### Jason Rute (Aug 09 2023 at 00:40):

My understanding (possibly wrong) about CASC is that while it is not impossible for neural and other machine learning models to compete, there are a lot of onerous restrictions. From the CASC website:

All techniques used must be general purpose, and expected to extend usefully to new unseen problems. The precomputation and storage of information about individual problems that might appear in the competition, or their solutions, is not allowed. Strategies and strategy selection based on individual problems or their solutions are not allowed. If machine learning procedures are used to tune a system, the learning must ensure that sufficient generalization is obtained so that no there is no specialization to individual problems. The system description must explain any such tuning or training that has been done. The competition panel may disqualify any system that is deemed to be problem specific rather than general purpose. If you are in doubt, contact the competition organizer.

#### Jason Rute (Aug 09 2023 at 00:40):

I think the fundamental problem, if I understand correctly, is that the same problems are used from year to year. So any system which learns from those problems ahead of time is cheating. I personally think this is really unfortunate (again if I understand correctly). It prevents the building and testing of anything but purely symbolic AI where everything has to be directly programmed in. I think the one exception is a category called Large Theory Batch Problems where the model is allowed to look at a range of problems and solve as many as it can in a long period of time, say 48 hours. A model can look at all the problems in the batch and solve them in any order. While one can't do any learning ahead of time, this does let the model apply its learnings from problems it solves early to solve later problems. The MALARea model is a machine learning model which has won this category a few times, but it is not a neural model since you would need a fast learning algorithm for this to work. Unfortunately I think this prohibits the newer FOL neural reinforcement based learning algorithms for which there are a few good ones now.

#### Jason Rute (Aug 09 2023 at 00:40):

I do however appreciate that coming up with new problems is hard, but I still wonder if there is a way to have better theorem proving competitions where AI models can compete.

#### Jason Rute (Aug 09 2023 at 00:49):

@Geoff Sutcliffe Feel free to let me know if I'm mischaracterized CASC and its requirements on machine learning models.

#### Geoff Sutcliffe (Aug 09 2023 at 13:04):

Hi Jason, all,

Here is some more about CASC wrt this suggestion ...

I want to start a conversation about automatic theorem proving competitions, especially ones that neural-based AI can compete in. I don't know much about this space, so I want to know more. But at the same time I feel there is really nothing right now that fits anything like I have in mind.

Yep, there is nothing specifically for NN-based ATP. I would support you

starting one. An appropriate conference to host it would be AITP.

First, I do know of a few competitions:

* (CASC) CADE ATP System Competition I think is a big one and it focuses (if I'm not mistaken) on first order logic.

It uses first-order and higher-order classical logic (and non-classical

logic is on its way!).

Overall, I find competitions intriguing.

The certainly help stimulate progress. And it's fun.

I think in machine learning at least, benchmarks are more common and useful than competitions, because benchmarks can be worked on at any point. But competitions do have a certain advantage. For example, if the problems are new, one doesn't have to worry about the model cheating by seeing the problems already. It can also be a bit more exciting especially if the models do well (like in AlphaFold2 winning CASP).

My understanding (possibly wrong) about CASC is that while it is not impossible for neural and other machine learning models to compete, there are a lot of onerous restrictions. From the CASC website:

NN-guided ATP systems do compete in CASC, and the LTB division is discussed

below was designed especially to support ML to learn search guidance for

large homogeneous problem sets.

I think the fundamental problem, if I understand correctly, is that the same problems are used from year to year.

That's true to some extent, in that for many of the divisions the TPTP

problem library is used. Note however that the TPTP grows each year, and

new problems that the systems have not seen are selected in preference

to problems that were in the previous release. Additionally, the SLH and

LTB division use new non-TPTP problem sets each year.

I think the one exception is a category called Large Theory Batch Problems where the model is allowed to look at a range of problems and solve as many as it can in a long period of time, say 48 hours. A model can look at all the problems in the batch and solve them in any order.

Correct.

While one can't do any learning ahead of time,

Incorrect. For LTB a largish sample of problems and their proofs is released

ahead of the competition to allow systems to be tuned (including ML trained)

ready for the competition.

this does let the model apply its learnings from problems it solves early to solve later problems.

Correct.

The MALARea model is a machine learning model which has won this category a few times, but it is not a neural model since you would need a fast learning algorithm for this to work.

I believe you are wrong - MALARea did update its model as it progressed

through the problems. Read the papers, or chat withg Josef. But in a sense

you are right, because MALARea was the only NN based system that enetered

the LTB division, and non-NN adaptive strategies did better.

I do however appreciate that coming up with new problems is hard, but I still wonder if there is a way to have better theorem proving competitions where AI models can compete.

I have lots of problem sets you can play with.

Geoff Sutcliffe Feel free to let me know if I'm mischaracterized CASC and its requirements on machine learning models.

I hope my comments help. I'd be happy to help you set up a competition

(separate from CASC) if you decide to go ahead.

Cheers,

Geoff

#### Jason Rute (Aug 09 2023 at 14:42):

@Geoff Sutcliffe Thanks for the thorough reply!

Yep, there is nothing specifically for NN-based ATP. I would support you

starting one. An appropriate conference to host it would be AITP.

It would be a nice conversation for the upcoming AITP. (Unfortunately it is at a bad time for me every year, so others would have to discuss it without me.)

#### Jason Rute (Aug 09 2023 at 14:42):

While one can't do any learning ahead of time,

Incorrect. For LTB a largish sample of problems and their proofs is released

ahead of the competition to allow systems to be tuned (including ML trained)

ready for the competition.

So you are saying it is possible to train on this set of problems ahead of time? How big is it? (To be clear, most NN based systems I know use at least 2k problems for training.) And can the neural network flat out memorize this set of problems (which frankly neural networks are prone to do)? Also, to be clear, it is not possible to train the model on any other set of problems (like past LTB problem sets or the TPTP dataset), only this "largish" sample, right?

#### Jason Rute (Aug 09 2023 at 14:42):

For SLH, is any pre-training allowed, and if so, on what set of problems? Could you train on all previous TPTP problems?

#### Jason Rute (Aug 09 2023 at 14:42):

NN-guided ATP systems do compete in CASC

Are there any machine learning systems besides MALARea which compete in any of the categories? (And to be clear, while MALARea uses machine learning, specifically naive Bayes and maybe k-nearest neighbors, I don't think it uses neural networks.)

#### Jason Rute (Aug 09 2023 at 14:43):

hope my comments help. I'd be happy to help you set up a competition (separate from CASC) if you decide to go ahead.

I personally am most interested not in FOL or HOL theorem proving in a pure form, but in theorem proving in practical ITP systems in the presence of large existing libraries of mathematics. The idea that one would be given theorems in Isabelle, Coq, or Lean (or even Metamath, HOL-Light, or Mizar) to prove. They could use and pre-train on the existing mathematics libraries, and use the tactics of the system. Indeed, there are already a number of provers out there which do this (like Lean GPT-f, Thor, MagnusHammer, ASTactic, Tactician, and so on). And there are already benchmarks, like MiniF2F or just reproving theorems in the library not used for training. I wonder if this can/should be turned into a competition.

#### Jason Rute (Aug 09 2023 at 14:43):

I guess there are also many practical concerns. Neural network systems often need GPUs (although not all do, especially for inference), and they likely use a bit higher number of resources than traditional ATPs (but again, not all do as some can run on a few cores). Also, it is hard to restrict what data one can pre-train on. Many LLM based NN solutions pre-train on basically a large chunk of the Internet and Github. While this could be banned, it also prevents evaluation of cutting edge approaches. But if you allow the Internet as pre-training data, you really have to be careful to make sure your problems are brand new.

#### Jason Rute (Aug 09 2023 at 14:43):

Most practically, I might be interested in competing so I should be careful if having too much input into a competition makes me ineligible to compete. :smile:

#### Aren Güralp (Aug 09 2023 at 16:36):

Hey Jason, a competition like this would be something I'd want to see happen.

I think one of the interesting challenges here is the relative scarcity of training data (formalized theorems + proofs) and the difficulty of manually producing more original data. This would present a requirement for competitors to be more creative with their models as they wouldn't be able to simply scale up (until we have reliable auto formalization, of course, but it is hard to know when that might be developed, and the implications of it would be huge). As a result, this might actually stimulate innovation more than other competitions.

A recent addition to the space that brings this closer to reach is LeanDojo, which is an open-sourced set of tools, data, models and benchmarks that aims to remove the barriers of creating theorem proving LLMs. This is something that I think would allow more people to join a competition like this, increasing the impact created by it and fostering interest.

It could also have knock-on effects for direct problem solution (ie the creation of a solution theorem by the model in response to a problem prompt), which happens to be something I'm far more interested in and that I'm building `mathlete.ai`

to do. Still very much so in the early idea stages, but it's something I'm excited about.

#### Zhangir Azerbayev (Aug 09 2023 at 22:43):

Thanks for starting this thread Jason. I can think of a number of reasons such a competition would be good for the state of research in ML for theorem proving and for the deep learning community in general.

- The most important reason is that it would battle test infrastructure for interaction and data extraction. LeanDojo looks very promising, and iirc it's close to having full Lean 4 support. But it's hard to know what the limitations of these tools are until you put them in the hands of creative users.
- If designed correctly, such a competition would draw a lot of interest from the wider deep learning community. Right now, there is something of a crisis in LLM evaluation because the one-shot IID benchmarks we use to evaluate language models don't reflect the most useful abilities of something like chatgpt or copilot. I'm sure that OAI/Anthropic have great internal evaluations for open-ended/multi-step/context-heavy tasks but they aren't public and likely require human annotators to score outputs. The field is crying out for a benchmark that assesses long horizon, context-laden tasks and can also be scored automatically. A theorem proving competition structured like the natural number game (with sequentially dependent problems, as opposed to minif2f, where each theorem is independent) would fit this bill.
- If the competition were based on problems that were private until the date of the competition (which it really should be), then it would be a source of guaranteed clean test data, which can be hard to come by for LLMs.

I wouldn't have a strong interest in competing myself, so I would be happy to help organize.

#### Adam Kurkiewicz (Aug 11 2023 at 16:54):

Thanks Jason, Geoff, Aren and Zhangir.

I've been thinking about organising a competition like this too.

There are several challenges I could think of. Some of the thoughts below are directly influenced by chatting with @Kevin Buzzard in person a few weeks ago and lurking on the Zulip chat and NOT from deep expertise in the subject. But here they are FWIW.

**High quality formalisations**. Formalising certain kinds of "high school" problems (e.g. geometry) is still relatively niche expertise. I think the key expertise lies with @Joseph Myers and a few others, but to the best of my knowledge it's a very narrow group of people who have the expertise needed to do this well. The unfortunate downside is that a lot of problem statements in existing benchmarks are incorrect, and often trivially true or false. There are a lot of challenges here (e.g. what does it actually mean to "find x" -- is proof of existence enough, or does one have to actually construct the object? What if "x" follows from a simple decidable search?).**Previously unseen problems**. Being able to come up with a good volume of previously unseen problems of the right kind of difficulty. In practice even a talented/motivated formaliser will struggle to produce more than a couple of examples a week. I think scaling that will require a larg-ish group of motivated folks.**Format**. What kind of format would be most useful to the community of LLM researchers? Zhangir proposes long horizon, context-laden tasks, like a natural numbers game. How about "machine translation" from informal proofs in latex to formal proofs in lean? I remember reading a paper from facebook research about this -- perhaps this is more within reach than coming up with original reasoning.

I don't think these challenges are insurmountable and I've got some ideas on how to work through them. Happy to chat with anybody interested.

#### Adam Kurkiewicz (Aug 11 2023 at 17:16):

@Zhangir Azerbayev -- I can see you're ahead of me here when it comes to **3. Format** -- I just became aware of your work on ProofNet -- looks really great! I'm reading through the paper just now...

#### Aren Güralp (Aug 11 2023 at 18:53):

Hi Adam, it's great to see more people interested In this topic.

I've been doing some background reading on the subject for the past week or so. Here are some of my thoughts on the points you make.

- The specifics of what types of theorems are harder/easier to formalize in lean is beyond me, but I've been thinking a lot about how to formalize an actual problem- not just a theorem to be proved, but one where the answer must be known before a claim can be made. I think of this in the context of the IMO Challenge, where the competing model will have to produce the theorem itself. I've been experimenting a little with starting at a proof of existence and prompting ChatGPT to use that to find the solution step-by-step, with no success so far. I believe a competition like this should require the construction or at least provision of "x". My other idea for this is to somehow construct a "dummy theorem" that can stand in for any specific "x" of its type, then carrying out tactics on that theorem until the value of "x" can be verifiably solved for.
- I don't see why it wouldn't be possible at the smaller scale to model these problems off of real mathematics competitions like the IMO or Putnam that already have problems produced for them every year. Scaling up, it's always possible to tap into short- and long-lists for these competitions. The challenge of course lies in the actual formalization of these, which would take human expertise. The alternative would be autoformalization, which is not entirely reliable at the moment, and might require production of a dedicated model trained and finetuned for theorem/proof translating.
- I am most intrigued by a competition format similar to the IMO, where problems are given without solution and both proof and solution must be produced. I think this would be a much broader measure of LLM reasoning capability, as opposed to the single task of ATP or translation, and I believe it would be open to more avenues of attack, which would make for richer stimulation of the field. This is also the one element that is most open to variation, as it's possible to simply have a competition for each format that there is value to be found in.

#### Junyan Xu (Aug 11 2023 at 20:48):

If you go into the repo of the Liquid Tensor Experiment, and search the Git history (can't be done on GitHub though), you'd find lots of `sorry`

s at every stage of development, which are later filled in. Could those not be used as training data / benchmark? There's even the blueprint that connects theorems and proofs with natural language references. To make a competition, could we just use some private repo under active development?

#### Zhangir Azerbayev (Aug 11 2023 at 22:06):

Adam Kurkiewicz said:

Zhangir Azerbayev -- I can see you're ahead of me here when it comes to

3. Format-- I just became aware of your work on ProofNet -- looks really great! I'm reading through the paper just now...

ProofNet more or less failed in its goal of encouraging interesting research because no new ideas will be required to solve it. All we'll have to do is wait for a few more iterations of GPT. I think something similar applies to domains like IMO or Putnam: there won't be much buy in from the deep learning community because GPT-n for not too high a value of n will one-shot these problems in natural language.

The more interesting test would be one that points in the direction of a lifelong learning agent that can accumulate mathematical knowledge from experience. For example, we could create a test that forces the model to

- Accumulate a library of theorems during test time that it must use as deftly as those it saw during pretraining.
- Formalize some definition and be on the hook for the consequences of a poor one,

and so forth.

I like the approach of the natural number game, which is structured in terms of sequentially dependent worlds and levels.

One good source of material might be constructing some object differently from how it is done in mathlib. For example, if the p-adic numbers in mathlib are defined algebraically (I haven't checked), we could create a "p-adic number game" that does the construction via the p-adic metric.

#### Junyan Xu (Aug 11 2023 at 22:36):

If you search Eudoxus on Zulip you will see multiple construction of the real numbers ...

#### Adam Kurkiewicz (Aug 12 2023 at 06:35):

Hmm, the insight about GPT-n seems a little pessimistic! Historically people tend to over-rate short-term AI progress due to hype, which later inevitably leads to deflated expectations (e.g. AI winter in the 80s, or neural networks winter in the 90s and the 00s).

I wouldn't be surprised if the IMO grand challenge was still unbroken in 20 years with many more ideas needed to break it!

If we start this competition now, we could still be running it in 20 years...

And I would definitely give more credit to ProofNet. For example, LeanDojo used it as a benchmark of their ReProver -- now we have an open source baseline and two hand-made benchmarks (miniF2F, ProofNet), and a synthetic benchmark due to LeanDojo (mathlib split). The field seems to be going strong!

"p-adic number game" is a good idea! There are some other exotic structures we could use, like the "Arithmetree" by Jean-Louis Loday. This could scale in principle -- we should be able to regularly come up with recreational maths examples that will be largely independent of what's been formalised out there. But I would still include the informal proofs at this stage for in-context learning.

If the competition spurs the creation of a high-end autoformaliser within the next 20 years, I think we will have achieved a lot!

#### Zhangir Azerbayev (Aug 12 2023 at 19:33):

Adam Kurkiewicz said:

Hmm, the insight about GPT-n seems a little pessimistic! Historically people tend to over-rate short-term AI progress due to hype, which later inevitably leads to deflated expectations (e.g. AI winter in the 80s, or neural networks winter in the 90s and the 00s).

I wouldn't be surprised if the IMO grand challenge was still unbroken in 20 years with many more ideas needed to break it!

If we start this competition now, we could still be running it in 20 years...

And I would definitely give more credit to ProofNet. For example, LeanDojo used it as a benchmark of their ReProver -- now we have an open source baseline and two hand-made benchmarks (miniF2F, ProofNet), and a synthetic benchmark due to LeanDojo (mathlib split). The field seems to be going strong!

"p-adic number game" is a good idea! There are some other exotic structures we could use, like the "Arithmetree" by Jean-Louis Loday. This could scale in principle -- we should be able to regularly come up with recreational maths examples that will be largely independent of what's been formalised out there. But I would still include the informal proofs at this stage for in-context learning.

If the competition spurs the creation of a high-end autoformaliser within the next 20 years, I think we will have achieved a lot!

I expect that even if there are zero new conceptual breakthroughs in machine learning the IMO grand challenge (in natural language) won't last very long. Already, GPT-4 + verifier scores over 50% on the AMC competition. The precise timeline is contingent, but I find it hard to imagine GPT is more than a few iterations of increased data+compute away from the IMO.

#### Scott Morrison (Aug 13 2023 at 03:19):

I'm pretty skeptical here that "just more data and compute" will get GPT to IMO level. I don't think there's meaningful evidence that GPT-4 level models can do the exploratory working-out-what-to-prove step of IMO problems, even augmented by verifiers or agents. I would be happy to pointed to such evidence (performance on the AMC wouldn't count without a pretty careful account of the extent to which this exploratory work is necessary on the 50% that is currently within range). Without that, I think there is a qualitative gap between current technology and the IMO grand challenge, and just adding data+compute won't get us off the ground.

That said, I don't particularly doubt that RL focused on math problems, with a good curriculum leading up to IMO level, could cross that gap in the next few years.

#### Adam Kurkiewicz (Aug 13 2023 at 18:09):

@Scott Morrison -- I think I'm on your side of the argument here -- could you share more thoughts about AMC and GPT-4's performance?

In particular are questions multiple choice? I don't know much about those american competitions.

#### Bolton Bailey (Aug 13 2023 at 21:50):

Here is a list of problems on a recent AMC10A

Here is a list of problems on a recent IMO

#### Bolton Bailey (Aug 13 2023 at 23:08):

In particular, I feel skeptical that we are just need more data and compute to get an LLM to output something like, for example, this solution to problem 5 here. I encourage you all to try the problem for yourselves before looking at the solution, especially if you've never looked at the IMO before, just to make sure you have an accurate sense of the kind of reasoning you have to employ to solve an IMO problem.

Find all triples $(a,b,p)$ of positive integers with $p$ prime and\[a^p = b! + p\]

It seems to me that whatever AI system eventually wins the IMO, it will have to be able to do a lot of backtracking and case analysis. This strikes me as the kind of thing that is much better suited to a formal environment than LLM human-language reasoning, but if you are going use an LLM speaking human language to itself, I think there will definitely have to be a lot of structure in the prompts that are being fed to the LLM.

#### Zhangir Azerbayev (Aug 13 2023 at 23:31):

Scott Morrison said:

I'm pretty skeptical here that "just more data and compute" will get GPT to IMO level. I don't think there's meaningful evidence that GPT-4 level models can do the exploratory working-out-what-to-prove step of IMO problems, even augmented by verifiers or agents. I would be happy to pointed to such evidence (performance on the AMC wouldn't count without a pretty careful account of the extent to which this exploratory work is necessary on the 50% that is currently within range). Without that, I think there is a qualitative gap between current technology and the IMO grand challenge, and just adding data+compute won't get us off the ground.

That said, I don't particularly doubt that RL focused on math problems, with a good curriculum leading up to IMO level, could cross that gap in the next few years.

Isn't Sagredo an example of GPT doing exploratory work? We've seen a few examples of it fumbling around for a while before finding the correct tactic step.

#### Zhangir Azerbayev (Aug 13 2023 at 23:40):

Bolton Bailey said:

It seems to me that whatever AI system eventually wins the IMO, it will have to be able to do a lot of backtracking and case analysis. This strikes me as the kind of thing that is much better suited to a formal environment than LLM human-language reasoning, but if you are going use an LLM speaking human language to itself, I think there will definitely have to be a lot of structure in the prompts that are being fed to the LLM.

I don't think formal has a discernible advantage over NL at IMO level. For graduate/research level math, it might be very hard to train good verifiers, but for IMO it will be much more feasible.

#### Bolton Bailey (Aug 13 2023 at 23:50):

Zhangir Azerbayev said:

I don't think formal has a discernible advantage over NL at IMO level. For graduate/research level math, it might be very hard to train good verifiers, but for IMO it won't be terribly hard.

The way I see it, it doesn't just have to be good, it has to be essentially perfect. If I have to think through 100 different subproblems to solve an IMO question, and if a mistake verifying any one of them has the potential to make my proof wrong, then my natural language verifier needs an accuracy on the order of 99%.

#### Bolton Bailey (Aug 13 2023 at 23:58):

Also the "graduate level math" comment is always funny to me. I am a graduate student. I am not a mathematician, but I'm a computer scientist, and I do proofs in my research. I think of IMO problems as much more difficult than my research problems.

A fair response to this might be that the math I do and the math that math grad students do is very different. But in some sense this is a point in favor of the difficulty of the task on its own. What we need is not just more training data but more training data of writing on the form of math that is used to make IMO problems. Even loading all of the Springer yellow books into the machine might not do it because even though they are math books, they are not necessarily the right kind of math book.

Last updated: Dec 20 2023 at 11:08 UTC