Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: AITP 2020

Jason Rute (Mar 31 2020 at 23:31):

I noticed the abstracts for AITP 2020 got posted recently (not sure when). I'm really looking forward to reading through them. I thought I'd also add any comments I have here. Feel free to add you own as well.

Jason Rute (Mar 31 2020 at 23:32):

(I also assume there is no other public place on the internet where people talk about AITP 2020, right?)

Jason Rute (Apr 01 2020 at 23:09):

The first abstract I want to talk about is Reinforcement Learning for Interactive Theorem Proving in HOL4 by @Minchao Wu, Michael Norrish, Christian Walder and Amir Dezfouli. They built a gym environment for HOL4. I don't think the code or paper is public yet, but the abstract has a lot of details. The main idea is that they created a Python environment for reinforcement learning of theorem proving in HOL4. There are a number of points which make it different from HOList and the other projects. Rather than go through all the details, I'll give a few quick thoughts:

The main thing I found interesting about this is that they set this up as a Gym compatible environment (so that they could use the REINFORCE algorithm). By Gym compatible I mean Open AI's Gym interface which is a really simple interface. You have to be able to start an environment, perform a step in that environment, and the read an observation about the state of that environment. What makes this different from other interactive theorem proving "gyms" that I'm aware of is that it doesn't seem to allow backtracking. Backtracking (in the form of a tree search) is very important for projects like AlphaGo, solving the Rubik's cube, and HOList. From what I can tell, they allow for two ways around this:
1. They actually do have back-tracking in a sense, since a "state" is not single goal stack (with a local context), but instead a "fringe". I'm not entirely clear what a fringe is. They describe a fringe as all unvisited goal stacks, but it isn't clear to me what "visited" means in this context. Unlike, say, A* search, each goal stack has a very large number of "neighbors" since there are a very large number of tactic applications that can be applied to a goal. It would be unreasonable to visit all of the neighbors. So maybe every goal set remains in the fringe, but then I don't know what "unvisited" means. (Also later they talk about "winning" if the fridge becomes empty, but that also wouldn't make sense since I only need to solve some goal stack, not all of them.) Another interpretation is that a goal stack is visited when any tactic is applied to it, but then the fringe would only have size 1, right, because each tactic application only creates one new goal stack? (Clearly I don't understand this well enough.) Anyway, I think there is some notion of being able to choose which goal stack to explore more.
2. The default behavior is to completely ignore the fringe and to just apply your tactic application to the first goal stack in the fringe. In this case it is basically pure model-free reinforcement learning. (Also since they are using REINFORCE, it seems they are using this default behavior.) This however is surprising since I was under the impression that the model-based reinforcement learning used in theorem proving (i.e. using the fact that the agent has access to an accurate simulator of the theorem learning environment) is very important to getting good results in theorem proving. If that is not the case, then that would be surprising. (I was just having a conversation about this for HOList.)
Unlike a lot of theorem proving RL setups which only assign rewards at the end (solved or not), they assign partial negative rewards for picking bad tactics (which fail or don't change the goal) and for running out of time. This probably helps the agent learn faster.
It is interesting that environments written in Python (like this one) sell it as being easier to use machine learning tools like PyCharm/Tensorflow and environments written in OCaml (like the new TacticToe for Coq paper) sell it as being closer to the theorem prover. It gets at the following dichotomy. If we ever hope to use these tools in practice, we need to have them available in the language the theorem prover is written in, but if we want to try things quickly or get machine learning experts involved, then we need to have it in Python environments with clean and well documented APIs. I think we need both.
While the environment seems fairly grand, what they have done with it so far seems fairly limited. I think they have only used it on 10 theorems so far. Also, I think each theorem is like a whole new world where the agent has to relearn how to prove theorems from scratch. This is very different from the DeepHOL Zero work where the agent learns stuff from one theorem which it can apply to another. I hope that this work can be eventually brought to that level. However, even if that is all that they have planned, I think it will serve as a good Gym environment for Machine Learning researchers to try out theorem proving, but not as competition for say DeepHOL Zero or any of the supervised-learned ML agents, like TacticToe which is currently the state of the art for ML in HOL4.

Jason Rute (Apr 01 2020 at 23:11):

Also, if anyone can fix this, the name of their project is spelled incorrectly at http://aitp-conference.org/2020/ : "leaerning" -> "learning"

Minchao Wu (Apr 02 2020 at 09:10):

A fringe is a set of goals (with their corresponding local contexts) that haven’t been proved during a single proof attempt. One starts with the main goal. As one applies tactics, the fringe is expanded with new (sub)goals generated. The main goal is proved when the fringe becomes empty. There is no backtracking because this way is thinking of theorem proving as a single-pass exploration of the tree. The hope is that the “backtracking flavor” should be learned by the policy. In the current setting of environment, one can still implement backtracking by themselves if they wanted to. However, we are working on a new version that would support backtracking in a neater way.

Jason Rute (Apr 02 2020 at 12:12):

Ok. I clearly misunderstood. So a "fringe" is what is called a "goal stack" in HOL4, right? (Looking back, I misread that sentence. Although, it is confusing to say that an edge of the tree is a tactic application, since if a node is a single (sub)goal, all the edges coming out of that node are going to be the same tactic application.)

Jason Rute (Apr 02 2020 at 12:20):

There is no backtracking because this way is thinking of theorem proving as a single-pass exploration of the tree.

Is this "single pass exploration" approach viable? It would be as if AlphaZero was only allowed to play go and chess with a single pass through (no MCTS). Do you think one can train a state of the art HOL4 theorem prover without a tree search? (Or maybe I'm misunderstanding the goal of this project.) However, I do admit that by restricting to model-free reinforcement learning that you are opening up a wide field of prebuilt and tuned algorithms which you can apply (as long as you can handle embedding the formulas since that is nonstandard).

Jason Rute (Apr 02 2020 at 12:23):

In the current setting of environment, one can still implement backtracking by themselves if they wanted to.

Does the gym-like environment have a mechanism to backtrack? Maybe get_stateand set_state methods (which I've seen in other Gym-like environments)? (Of course the user could program their own HOL4 simulator, but that seems like overkill. :slight_smile:)

Jason Rute (Apr 02 2020 at 12:25):

However, we are working on a new version that would support backtracking in a neater way.

Looking forward to seeing this! As well any other papers, presentations, code, and most of all, public gym-like APIs that come out of this project!

Minchao Wu (Apr 02 2020 at 12:53):

Is this "single pass exploration" approach viable? It would be as if AlphaZero was only allowed to play go and chess with a single pass through (no MCTS). Do you think one can train a state of the art HOL4 theorem prover without a tree search?

I wouldn't say that there is no tree search, but the tree search is the learned policy. If you really want some explicit tree search (say MCTS), you still have the option to add them afterwards by treating them as policy improvement operators.

However, I do admit that by restricting to model-free reinforcement learning that you are opening up a wide field of prebuilt and tuned algorithms which you can apply (as long as you can handle embedding the formulas since that is nonstandard).

I don't see how prebuilt and tuned algorithms can be applied directly here at this stage. The learning of the argument policy is non-standard and would require quite some flexibility from the prebuilt algorithms. We wrote our own training algorithms for the AITP version.

Does the gym-like environment have a mechanism to backtrack? Maybe get_state and set_state methods (which I've seen in other Gym-like environments)?

There is get_state but no set_state in the current AITP version. The one we are working on would make it available. Some of us still believe that the non-backtracking version would be much easier to understand and train (and shouldn't perform too bad), though.

Jason Rute (Apr 02 2020 at 12:58):

If you really want some explicit tree search (say MCTS), you still have the option to add them

To have MCTS (or any tree search), you need a way to jump between states of the environment. Do you have that? If I apply a tactic and later decide I didn't want to do that, can I back up (or say "go to such and such state")? Most of the Open AI gym environments don't support this functionality, so it isn't obvious that yours does either? (EDIT: Maybe they support it in that you can make a deep copy of the state, but I don't know that would be possible with an ITP like HOL4 in the backend.)

Jason Rute (Apr 02 2020 at 12:59):

I don't see how prebuilt and tuned algorithms can be applied directly here at this stage.

That is a good point. I guess you were saying you used REINFORCE, so I didn't think about it too strongly, but yes, the action space is quite unique.

Minchao Wu (Apr 02 2020 at 13:05):

Most of the Open AI gym environments don't support this functionality, so it isn't obvious that yours does either?

No, at least not at this stage. One would need to do it themselves. The environment can tell you whatever the resulting state is, but the user needs to manage the states themselves.

Jason Rute (Apr 02 2020 at 13:08):

Can a user save a copy of state and later apply actions to it ? In functional programming terms, I'm trying to figure out if the states are persistent data structures. If not, I don't see how the user can manage the tree search themselves. (Although, I guess the user can also just start over and apply the same sequence of tactics again to get back to a previously visited state.)

Minchao Wu (Apr 02 2020 at 13:24):

For that you just need something like set_states which is not too difficult to implement. Then one can maintain a history of states on their own and call the environment to rollout. But as you said in (), one can always do it with the minimal help from the environment.

Rongmin Lu (Apr 06 2020 at 03:05):

Thanks for posting this!

Stanislas Polu (Apr 06 2020 at 14:34):

I found Wu/Jiang/Grosse/Ba's abstract on proving inequalities very interesting. Their generator is straightforward to reproduce and while the proven equalities/inequalities appear still a bit easy (for the k/l values they took at least), the result and study is pragmatic and well covered.

Brando Miranda (Jul 29 2020 at 16:34):

Does anyone know if it's still possible to attend AITP? (or if it's likely to go online and one can register?)

Jason Rute (Jul 29 2020 at 16:40):

I think the best thing to do would be to ask Josef Urban directly.

Jason Rute (Sep 14 2020 at 17:05):

How is AITP 2020 going? Any exciting news? Is it in-person, remote, or a mix? Will videos and slides get posted on the website as usual? I hope it is a good meeting. So much has happened in the last year!

Mikoláš Janota (Sep 14 2020 at 17:31):

It's a mix, we managed to get to Aussois from CZ

Thibault Gauthier (Sep 14 2020 at 21:34):

Unfortunately, Lasse's talk and
mine were not recorded because we were distracted by some technical problems.
For those interested, here are my slides about self-learned formula synthesis in set theory. AITP2020-setsynt.pdf

Stanislas Polu (Sep 15 2020 at 05:47):

Thanks @Thibault Gauthier out of curiosity, are there similar work out there about propositional logic, since there seems to be a close connection between your graphs and truth tables? And maybe, refactoring a proposition into a simpler form might be useful for some applications?

Thibault Gauthier (Sep 15 2020 at 09:55):

There is something about finite model builder, but I am not sure it's going the same direction.
https://easychair.org/publications/paper/rZKt

Thibault Gauthier (Sep 15 2020 at 14:34):

@Stanislas Polu
I really like your talk today about NN-guided search for Metamath.
The multiplication of probabilities makes sense.
Are you planning to go for reinforcement learning from scratch (AlphaZero-style) with no training on metamath (and maybe some pretraining on other datasets)?

Stanislas Polu (Sep 15 2020 at 14:43):

@Thibault Gauthier thanks! What we do today is Expert Iteration with a Value function which is preeeety close to a full fledged RL loop. The challenge with applying AlphaZero is that the action space is.... enumerable here, so the exploration budget for AlphaZero to work would be humongous :) But luckily DeepMind published a paper recently that will certainly help with that: https://arxiv.org/abs/2007.12509

Stanislas Polu (Sep 15 2020 at 14:43):

That's something we'll definitely explore once we reach the limits of Expert Iteration :+1:

Stanislas Polu (Sep 15 2020 at 14:45):

@Kevin Buzzard hi hi! To follow-up on your question. Anybody who has applied State of the Art technology to code generation knows how useful are comments in code to train models to translate from informal prompts to usable code. If I was to make one recommendation for the Lean community, that would be to comment Lean code as much as conceivably possible. Future ML experts will thank you for that :)

Johan Commelin (Sep 15 2020 at 14:49):

Why has nobody ever told me that before?

Kevin Buzzard (Sep 15 2020 at 14:50):

Stanislas Polu (Sep 15 2020 at 19:57):

@Thibault Gauthier I realize I misread your question. I thiiiink we're more interested in tackling external real-world benchmarks these days (such as exercises statements manually formalized). In that setup, access to training proofs will be even scarcer and we plan to explore the ability to train against a curriculum of exercise statements without the proofs. We have early results that suggest that it works well; but we do start from the base models that are trained on the Metamath database, not from scratch.

Stanislas Polu (Sep 15 2020 at 20:00):

I think the RL Holist papers demonstrate that training "from scratch" (without proofs but "scratch" is still a bit of an abuse of language since you do have the statements) just works in principle with a finite action space. But as far as we're concerned we do need to bootstrap the generative language models so that they are able to generate sensible terms for the substitutions.

Stanislas Polu (Sep 15 2020 at 20:01):

I hope this time I answered your question :grimacing:

Stanislas Polu (Sep 15 2020 at 20:04):

Pre-training on other datasets, I think (not verified), would probably not suffice to stumble on correct substitutions early on to go full AlphaZero style.

Jason Rute (Sep 16 2020 at 00:15):

Stanislas Polu said:

But luckily DeepMind published a paper recently that will certainly help with that: https://arxiv.org/abs/2007.12509

I just had a look at this paper. It's really good. It's amazing how looking at the mathematical underpinnings of an ad hoc solution can help to optimize it and make it more rigorous. We must spread the Gospel. :smile: If you are going to implement MCTS for RL, use the new and improved version in this paper!

Jason Rute (Sep 16 2020 at 00:19):

Also, speaking of tree search, has anyone else noticed that tactic-based proving is a two-player zero-sum game? Player A chooses the tactics. Player B chooses one subgoal to work on. If Player A has a perfect strategy the theorem is provable. If Player B does, the theorem is not provable. I think this could speed up RL for theorem proving (since one doesn't have to explore all subgoals to get meaningful policy and value updates). It also probably means that maybe we should start on the hardest subgoal instead of the first subgoal (or the easiest subgoal). Anyway, food for thought...

Stanislas Polu (Sep 16 2020 at 09:27):

To follow-up on yesterday's talk on the IMO Grand Challenge, which is, no question about it, a very well designed ultimate test for the capabilities of ATPs, I think it's important to recognize that IMOs are also still far out of reach of the current state of the art in general (even if some classes are admittedly within reach, such as some types of inequalities, some types of geometry questions, ...). As a consequence, one thing that the GC does not address is that it does not enable us to effectively measure our progress as a community today, if we are, exaggeratedly saying, stuck at 0%.

On the path to IMOs, I strongly believe (and happy to discuss this point) it would be beneficial for the community to have a set of unified benchmarks of increasing difficulty against which we can measure ourselves and collaborate without relying on the current goto solution that consists in test/valid splits of existing libraries which are siloed by construction, hugely conditioned on how lemmas were split in each of these libraries, and as such not directly comparable across efforts.

As such, if there is interest for it, I'd love to collaborate on a mini F2F benchmark consisting in its first iteration of problems statements that are "very easy", but do require a minimal amount of mathematical thinking and human effort to be demonstrated formally and are challenging enough such that current ATPs will fail to reach 100% as of today. This would create a medium for our different teams to compare themselves against each others which would IMHO foster collaboration and innovation. It should be simple enough such that it can measure our progress effectively, but hard enough such that reaching 100% on the benchmark would convincingly demonstrate that we're making useful progress towards the grand challenge.

Pragmatically, I think that would simply mean selecting say 50-100 exercises that we believe would fit the bill, publicize them and put in the small amount of work needed to formalize their statements (only) in various systems.

I'm very curious to hear what people think about this proposal? Am I missing something? Do people feel convinced that accuracy on test splits of formal libraries is good enough? Or that we're close enough to IMOs to use them as a measure of progress?

Minchao Wu (Sep 16 2020 at 09:36):

Is the F2F benchmark supposed to be in a specific language (say, Lean), or would it be in some other forms?

Kevin Buzzard (Sep 16 2020 at 09:43):

There is a huge source of Olympiad problems available, and Lean is getting to the point where it can understand pretty much all of them, with Joseph Myers' recent work on affine geometry enabling formalisations of problems involving e.g. circles, triangles etc (indeed Joseph just finished formalising all the solutions to this year's British Maths Olympiad round 1).

I run a formalizing club at Imperial and I also run a puzzle-solving club where we work on Olympiad problems; when it all starts up again in October I could see if I could get the puzzle-solvers formalising olympiad questions.

Scott Morrison (Sep 16 2020 at 09:49):

Stanislas Polu said:

Do people feel convinced that accuracy on test splits of formal libraries is good enough?

I think that mathematicians mostly think that testing again formalised libraries is ... barking up the wrong tree? (I'm trying to be politic here --- certainly it's a start.)

A formalised library looks almost completely unlike "the process" of doing maths (wherever in the spectrum of theory-building / problem-solving you mean). I mean, it's great if ML based super-tactics can reproduce and/or help in preparation of formalised libraries, but the "millions of tiny lemmas" that you see in mathlib (and most other libraries I've looked at) is a strange artifact of wanting to talk to the computer, not much of a reflection on what human mathematicians do. :-)

Stanislas Polu (Sep 16 2020 at 10:04):

@Minchao Wu I think it can easily be cross-language/system: we just formalize the statements in all systems that make sense (say Lean, Metamath, HOL Family, Coq).

Reid Barton (Sep 16 2020 at 10:04):

One place to start could be to take human solutions to IMO problems and formalize the outline of the argument (including any auxiliary definitions it uses) as statements without formal proofs. Use that as the input, and have the system fill in the proofs, or just use them somehow to produce a proof of the original problem if it prefers (e.g., maybe one of the intermediate statements isn't quite correct).

Stanislas Polu (Sep 16 2020 at 10:04):

@Scott Morrison completely agreed with that. Hence the motivation for an "out-of-library" / "exercise-type" benchmark?

Stanislas Polu (Sep 16 2020 at 10:06):

@Kevin Buzzard I think formalizing the statements only here would be sufficient to produce a benchmark we can share. But again, it does not solve the problem that IMOs may be too hard to efficiently measure progress?

Stanislas Polu (Sep 16 2020 at 10:07):

@Reid Barton that's an interesting take. Is there a particular motivation for this objective vs simpler exercises?

Kevin Buzzard (Sep 16 2020 at 10:09):

This particular mathematician has no clue whatsoever about machine learning, which is why I am so interested in engaging with the ML community to figure out what they want. I want a program which is capable of helping PhD students to do algebraic geometry in some way which is recognised by other staff members as being tangibly useful.

Scott Morrison (Sep 16 2020 at 10:09):

I think longer term goal of IMOs is great. Reid's suggestion provides something intermediate to aim for, that is hopefully actually building in the right direction.

Stanislas Polu (Sep 16 2020 at 10:14):

@Kevin Buzzard to make sure I understand, are you looking for a setup where the student provides the high-level "hard" reasoning steps in a natural/easy way and the system provides the glue? Being fine with a system that is somewhat dumb at maths but great at glueing formal proofs together?

Kevin Buzzard (Sep 16 2020 at 10:15):

Oh my need is much broader. I need any kind of system where someone who has no idea about solvers or provers or anything but has a training in mathematics can look at it and say "hey, that looks useful". I just care about making these kinds of tools more mainstream in the mathematics community -- this is in some sense the biggest problem I face right now.

Reid Barton (Sep 16 2020 at 10:15):

I guess one could say that this problem is something like getting computers to check the solutions to IMO problems on the same level as humans. It's hard for me to imagine that this isn't a necessary step along the way--but also a system capable of this could be very useful for human and even automated formalization of proofs, which in turn could increase the training data for a system that wants to solve bigger problems, etc.

Kevin Buzzard (Sep 16 2020 at 10:16):

A mathematician looks at something like Z3 or a Sat solver or TPTP and doesn't recognise anything which they could use there.

Kevin Buzzard (Sep 16 2020 at 10:16):

I am currently regarded as an oddity within my community, and getting funding is difficult because mathematicians have 0 use for any of these tools right now.

Reid Barton (Sep 16 2020 at 10:17):

Giving intermediate statements along the path to a proof can also be tuned for difficulty by dropping out some of the steps (or even introducing error in the statements).

Kevin Buzzard (Sep 16 2020 at 10:18):

Saying "we have SAT solvers which can solve gigantic tedious logic problems" does not buy much in my community. Even solving this boolean pythag triple problem does not buy you much here, because solving fantastically combinatorially complicated problems in basic number theory using brute force techniques is not passing on the kind of understanding which we are interested in -- we want to see high-level structure.

Reid Barton (Sep 16 2020 at 10:19):

Finally, Lean's current automation would score very poorly on most problems of this form but I think that "hammers" available in other systems like Isabelle would come in comfortably far from both 0% and 100%.

Stanislas Polu (Sep 16 2020 at 10:19):

This is interesting because I believe it uncovers some form of tension here. You want assistance to make formalization easy and natural while ML researchers want to demonstrate that their machine can do non trivial maths.

Kevin Buzzard (Sep 16 2020 at 10:19):

To crack the world of pure mathematics as it actually is (as opposed to how some CS people want to see it) we need stuff which looks more impressive. An Olympiad problem-solver might turn some more heads; a database which can hammer through some sections of the stacks project might too.

Stanislas Polu (Sep 16 2020 at 10:21):

@Reid Barton This is interesting and might indeed be a sweet spot wrt to the tension above. What would be the artifact? What do you think of an IMO problem statement and a set of intermediary lemmas statement? And we can arbitrarily mask these lemmas to challenge systems or simply benchmark against the lemmas independently?

Kevin Buzzard (Sep 16 2020 at 10:24):

Anything which would convince a mathematician that machines are capable of making some kind of progress on Olympiad style problems would I'm sure be of interest, whether it's a box which outputs "I think proving these intermediate lemmas would be helpful" or "here is a choice point -- can I have some human input (in particular input from a human who can't solve the problem) to see which way to go next" or "here are some lemmas which I can prove and which might be helpful" -- anything which comes up with a human-understandable step towards solving the problem which clearly looks like it might be going in the right direction would turn a few heads.

Kevin Buzzard (Sep 16 2020 at 10:25):

It is important to note that there are plenty of researchers in maths who have no clue how to solve most olympiad problems. Puzzle-solving (and in particular speed puzzle solving, which is what the IMO is) is a particular kind of skill, which is certainly not an essential prerequisite for being a research mathematician (although of course it can sometimes help in some areas).

Reid Barton (Sep 16 2020 at 10:25):

Yes, this is the sort of thing I had in mind. You'll also need some way to represent auxiliary definitions (e.g. geometric constructions) which some of the intermediate lemmas talk about; so then the data set also needs to know when a lemma's statement depends on a definition. I guess it's also possible that two lemmas depend on each other for their statements, for example, if you prove that some set is nonempty and then prove something about its maximum.

Kevin Buzzard (Sep 16 2020 at 10:26):

So computers making any kind of observations which humans could independently verify as "progress" would be of interest to the maths community.

Kevin Buzzard (Sep 16 2020 at 10:28):

Let me right now make a broad and over-general criticism of the ML papers I look at which are claiming to do mathematics. They might say "OK so we took a bunch of topology lemmas in Mizar, and then we fed them into our prover, and it completely solved 37% of them and made partial progress on another 37%"

Stanislas Polu (Sep 16 2020 at 10:28):

@Reid Barton is that the current plan for the IMO GC?

Kevin Buzzard (Sep 16 2020 at 10:28):

and in my experience it is extremely rare that they give explicit examples of things they achieved.

Kevin Buzzard (Sep 16 2020 at 10:29):

This is one trap which the metamath paper did not fall into, because there it says "here is a proof we found which is better than the proof in the metamath library"

Stanislas Polu (Sep 16 2020 at 10:29):

Yes completely agreed with that. But to be fair we're all lacking a meaningful benchmark we can use to do research

Kevin Buzzard (Sep 16 2020 at 10:29):

I see

Kevin Buzzard (Sep 16 2020 at 10:30):

As a mathematician I find these papers which just say 37% extremely difficult to judge, because it is extremely rare to see explicit examples of achievements -- the ML people just regard the dataset as a dataset and I suspect many researchers do not really ever look at it in some sense.

Stanislas Polu (Sep 16 2020 at 10:30):

Yes

Kevin Buzzard (Sep 16 2020 at 10:30):

When I talk to Urban about what is going on under the hood, my impression is that in some of these papers you put your 50 lemmas into some kind of order, and then each lemma you make a note of which of the earlier lemmas were used to prove it

Kevin Buzzard (Sep 16 2020 at 10:31):

and then you tell your prover to go and re-prove them but you also might tell it which of the previous lemmas to use, for example.

Patrick Massot (Sep 16 2020 at 10:31):

I agree with everything Kevin and Reid wrote. In particular I like the IMO challenge idea but Kevin is right to point out that a lot of mathematicians (maybe 99%) have no IMO skill at all.

Kevin Buzzard (Sep 16 2020 at 10:31):

And then 37% could be interpreted as a very poor return on this

Kevin Buzzard (Sep 16 2020 at 10:33):

And then when you actually do look at some of the things they proved, maybe they proved a trivial corollary to a theorem; maybe in fact 37% of the results in the database were each a trivial corollary of the lemma before; this is often what a formalised database can look like.

Kevin Buzzard (Sep 16 2020 at 10:33):

This is what I see when I read ML papers and these are not results I can sell to my community.

Kevin Buzzard (Sep 16 2020 at 10:34):

There is this team in Oxford trying to compute invariants of mathematical objects and they say things like "we predicted the rank of 80% of the elliptic curves in this database"

Stanislas Polu (Sep 16 2020 at 10:34):

Very aligned and in complete agreement with that.

Kevin Buzzard (Sep 16 2020 at 10:34):

but I have an algorithm which will accurately compute the rank of 99% of elliptic curves in any reasonable database.

Kevin Buzzard (Sep 16 2020 at 10:35):

1) Compute the root number 2) Assume the Birch and Swinnerton-Dyer conjecture 3) if the root number is odd, output "1" and if it's even output "0"

Kevin Buzzard (Sep 16 2020 at 10:35):

It's the same fallacy as the AI which can predict with 80% certainty whether you are straight or gay from a photo of your face. I can beat that by predicting "straight" in all cases.

Kevin Buzzard (Sep 16 2020 at 10:36):

I'm not attempting to be accurate, I'm just attempting to play the odds.

Stanislas Polu (Sep 16 2020 at 10:36):

So I believe we all agree that there is urgency for a benchmark to emerge against which ML researcher can do research, and designed such that progress on that benchmark is regarded as progress by the math community.

Kevin Buzzard (Sep 16 2020 at 10:36):

What could an IMO benchmark look like?

Kevin Buzzard (Sep 16 2020 at 10:36):

Or more generally an Olympiad benchmark?

Stanislas Polu (Sep 16 2020 at 10:36):

I think @Reid Barton 's suggestion is good

Kevin Buzzard (Sep 16 2020 at 10:37):

(because for each IMO there are 100 other Olympiad questions, probably publically available, e.g. the British Maths Olympiad has two rounds, and easier questions; it is used as training for the UK IMO team)

Stanislas Polu (Sep 16 2020 at 10:38):

Statement of exercises + statements of definitions and lemmas required to solve the exercise, in such a way that if a system closes a large number of lemmas a mathematician would say, ok it didn't go all the way to solving the exercise but these lemmas were non trivial and interesting to demonstrate

Stanislas Polu (Sep 16 2020 at 10:38):

eventually as success rate on lemmas increase and we close more and more exercise we can start masking the intermediary lemmas and report progress

Kevin Buzzard (Sep 16 2020 at 10:38):

and is the idea that you feed the solver the intermediate lemmas, or hope it isolates them?

Stanislas Polu (Sep 16 2020 at 10:38):

I think the step 0 is attack the lemmas independently

Stanislas Polu (Sep 16 2020 at 10:39):

then we can condition on the lemmas to solve exercises, and progressively remove them

Kevin Buzzard (Sep 16 2020 at 10:39):

@Daniel Selsam was talking about these choice points. In an ideal world the computer would sometimes isolate some simpler statement and ask whether it was worth attacking

Kevin Buzzard (Sep 16 2020 at 10:39):

Some IMO problems are just a "one trick" thing -- e.g. "colour the points in the following way and now observe it's easy"

Stanislas Polu (Sep 16 2020 at 10:41):

Right but we dive into design/system decisions already here. I think we need something that is applicable to humans as well.

I'm a beginner I'm lost, but I see some lemmas I can solve so I start with them. Then I get really good at solving these lemmas. Eventually I close all proofs with lemmas, and in the end I don't need the lemmas anymore.

Stanislas Polu (Sep 16 2020 at 10:41):

In some instances indeed the lemma might divulge the trick but if the lemma is not so trivial to prove itself we already demonstrate capability?

Stanislas Polu (Sep 16 2020 at 10:43):

These intermediary lemmas serve as both hints and technical mini challenges on which we can make progress.

Stanislas Polu (Sep 16 2020 at 10:43):

The latter is how they will be used first, the former is how they will be used later on and eventually they won't be needed anymore.

Stanislas Polu (Sep 16 2020 at 10:48):

Arguably, this representation is already how someone would formalize an IMO problem today. I argue here that we don't need the full proof to have a benchmark that is useful and productive; just formalizing the statements would already go a very long way.

Minchao Wu (Sep 16 2020 at 10:54):

There’s some additional challenges for it to be cross-system: some IMO problems may be easily formalizable in one system but not the other, due to the different states of development of a particular theory required along the way

Minchao Wu (Sep 16 2020 at 10:56):

We don’t know what exactly these problems are, though, until we actually try to formalize them

Reid Barton (Sep 16 2020 at 11:04):

Stanislas Polu said:

Reid Barton is that the current plan for the IMO GC?

These are just my own thoughts on your question of what an intermediate vesrion of the IMO GC goal might look like. However, since we already have a small collection of olympiad problem and solution formalizations, it should be fairly easy to try to produce a small data set along these lines, and see whether there are any immediate roadblocks.

Stanislas Polu (Sep 16 2020 at 11:06):

@Minchao Wu Starting with very simple problems alleviate that challenge, but does not fit the bill of making mathematicians turn their heads. I'm pretty sure most systems out there are still incapable of solving the most trivial inequalities if they are even remotely obfuscated (such as say, a*(a-2) <= 1 or a^2 + b^2 = 1 => a*b + a-b <= 1). I think we have to acknowledge that and create an environment that foster rapid progress towards the grand challenge; that's why I was proposing a very easy benchmark to begin with, and given the discussion still believe that would be valuable as a prelude to a benchmark that would align with the objective of updating mathematicians perspective on formal systems.

Stanislas Polu (Sep 16 2020 at 11:06):

@Reid Barton yep totally true. Do we have others like this one: https://github.com/jsm28/bmo2-2020-lean ?

Reid Barton (Sep 16 2020 at 11:10):

There's also https://github.com/leanprover-community/mathlib/blob/master/archive/imo1988_q6.lean and I know some other IMO problems like 2018/5, 2019/1, 2019/4 have been formalized, but perhaps those have not been collected anywhere.

Minchao Wu (Sep 16 2020 at 11:11):

Stanislas Polu said:

that's why I was proposing a very easy benchmark to begin with, and given the discussion still believe that would be valuable as a prelude to a benchmark that would align with the objective of updating mathematicians perspective on formal systems.

Totally agreed here

Reid Barton (Sep 16 2020 at 11:11):

But it might be better to start from natural-language solutions (e.g. the official solutions) and try to translate them as faithfully as possible into formalized versions, rather than work from solutions that are already formalized and maybe as a result fit to the particular prover/library.

Johan Commelin (Sep 16 2020 at 11:12):

Getting an AI to prove 1988q6 would be pretty sick. It's one the most famous IMO problems for being impenetrable.

Reid Barton (Sep 16 2020 at 11:49):

Regarding Kevin's points, I agree it's hard to understand the real significance of proving 37% of the results in the Mizar standard library, even for someone who is in the area of formalizing mathematics.
But as far as reception in the broader mathematical community is concerned: my conjecture is that there's no technical achievement feasible in the next couple of years that would really force the generic mathematician to sit up and pay attention. It's just too easy for them to move the goalposts and say that what has been achieved was easy--an effect that must be familiar to AI researchers. Even if we solve the IMO GC, while many mathematicians will surely be excited, there will be others who say: "I was never any good at IMO problems anyways, so what could they have to do with real mathematics?"
So, as much as I think advocacy efforts like Kevin's are important, and while it might make Kevin's work more difficult, I feel we should not optimize too much for what we could "sell" to mathematicians in general, at least at the moment.

Reid Barton (Sep 16 2020 at 11:51):

I'm reminded of a story that Kevin tells in which he met some famous mathematician at a conference and told the mathematician about how he had been working on formalizing mathematics in Lean or whatever, and the mathematician's first question was: "Can computers tell us anything new?"
Well your undergrads can't tell you anything new, can they? But you still teach them Galois theory anyways.
Why not teach a computer Galois theory, and problem-solving strategies, and so on?

Stanislas Polu (Sep 16 2020 at 11:54):

(Poor undergrads :laughing:)

Stanislas Polu (Sep 16 2020 at 11:55):

I think what we should focus on is a benchmark that would make people like you, or kevin update their perspective on what these systems can do.

Reid Barton (Sep 16 2020 at 12:12):

By the way, I really like the example problem of this post by @Daniel Selsam as an example of why filling in "obvious" steps in proofs can be a lot harder than it might seem (though I suspect he would be more optimistic about our prospects of solving problems like this today :upside_down:)

Kevin Buzzard (Sep 16 2020 at 12:14):

It was Bryan Birch who asked me if they could tell us anything new

Kevin Buzzard (Sep 16 2020 at 12:14):

It was his first and only question about my new endeavour

Kevin Buzzard (Sep 16 2020 at 12:16):

We were at Swinnerton-Dyers memorial service. Birch and Swinnerton-Dyer used an early computer to compute a bunch of solutions to cubic equations modulo lots of small primes in the 1960s and then made a conjecture which caught on

Jason Rute (Sep 16 2020 at 12:35):

I agree that there are really two advocacy efforts that are fairly disjoint. One is to convince mathematicians that formalization is of value. Like Reid said above, I don't think much in the way of technical feats are going to help in the short turn, not until we come up a way to actually help mathematicians solve more new problems in such a way that it builds understanding and theory, which is still down the road in my opinion.

The other advocacy is convincing ITP users like Kevin and ITP designers like Leo that ML methods will make a tangible difference in their systems. Benchmarks are just numbers which can be gamed in a number of different ways. I think the most important step is to put useable ML-backed tools in the hands of users like Kevin and see if they are useful. I like what @Lasse, for example, is doing with Tactician, where he uses older-but-more-CPU-capable ML techniques to create a system which can be integrated with Coq and is testing it on Coq users. Also @Stanislas Polu making a MetaMath assistant website which can suggest next steps. We really need to convince, say, Kevin that ML is the next sledgehammer and the next library_search, and find ways to integrate it into systems like Lean.

Stanislas Polu (Sep 16 2020 at 13:13):

Completely tangeant, I thiiink that would be a formalization of Daniel's post's problem.
Screen-Shot-2020-09-16-at-15.12.14.png

Johan Commelin (Sep 16 2020 at 13:14):

Generated by GPT-f?

Stanislas Polu (Sep 16 2020 at 13:14):

Haha, no, mysefl, you would have to ask Szegedy for an autoformalization system :D

Johan Commelin (Sep 16 2020 at 13:15):

Note that you cheated... you recognized that 13 = 12 + 1 and 25 = 2 * 12 + 1. That's an important step, I think.

Johan Commelin (Sep 16 2020 at 13:15):

(For an AI, at least.)

Stanislas Polu (Sep 16 2020 at 13:16):

True that, and also made an error doing so, should be K + 1 on the last hypothesis

Stanislas Polu (Sep 16 2020 at 13:18):

I wanted to go for a general lemma to prevent brute force enumeration

Jesse Michael Han (Sep 16 2020 at 13:25):

For those of you who saw Dan's talk on the IMO Grand Challenge yesterday and would like to learn more, I have a blog post where I discuss the SearchT monad transformer, ARC, and our work on building an IMO geometry solver: https://jesse-michael-han.github.io/blog/imo-gc-geo/

Reid Barton (Sep 16 2020 at 13:31):

What I like so much about this problem is: how did you know the right generalization was to a string of length $2n+1$ containing $n+1$ 'a's unless you already knew roughly how the argument should go?

Reid Barton (Sep 16 2020 at 13:32):

Or is it that we can guess a relationship between 25 and 13 and this guides our search for a proof?

Reid Barton (Sep 16 2020 at 13:32):

There's a lot to unpack in this little step.

Daniel Selsam (Sep 16 2020 at 13:57):

Patrick Massot said:

I agree with everything Kevin and Reid wrote. In particular I like the IMO challenge idea but Kevin is right to point out that a lot of mathematicians (maybe 99%) have no IMO skill at all.

Might these mathematicians be even more appreciative of an automatic "puzzle"-solving assistant?

Johan Commelin (Sep 16 2020 at 14:20):

They might think that it is cute.

Johan Commelin (Sep 16 2020 at 14:21):

Until computers start proving lemmas that they need in their daily work, but for which they don't see immediately what the proof is.... they will probably not care too much.

Daniel Selsam (Sep 16 2020 at 14:28):

Kevin Buzzard said:

This is one trap which the metamath paper did not fall into, because there it says "here is a proof we found which is better than the proof in the metamath library"

But even here, how many people looked into the simpler proofs they found, and the existing minimization tools that didn't find them? Here is the first one I looked at: http://us.metamath.org/mpegif/ndmima.html

Abstractly, the proof is just the trivial two-step proof that:

P x y
P x y <-> Q x y
Q x y <-> R x y
|- R x y

There are only a few thousand lemmas in the environment at the time and chances are only a few of them unify with the goal, so it seems the simplest possible back-chaining baseline would find this proof in a few milliseconds. As far as I can tell, the only reason the original manual proof was longer is that it was written in 1998 and the lemma P x y <-> Q x y was not added until 2007.

http://us.metamath.org/mpegif/ndmimaOLD.html
http://us.metamath.org/mpegif/imadisj.html

Not all of the simplified proofs look this trivial, but clearly the existing minimization tools do not constitute a strong baseline. I still find the work very difficult to assess in its current state.

Mario Carneiro (Sep 16 2020 at 14:31):

I agree that it is probably possible to accomplish what OpenAI did in metamath with a simpler / less computationally intensive framework. But to me the important fact is that they did it, they actually built a system and contributed things to a formal library and now OpenAI is credited to a set of theorems quite respectable for a human contributor

Mario Carneiro (Sep 16 2020 at 14:32):

yes, it's easy to look at the problems post hoc and say "that's easy" but isn't all math like that?

Daniel Selsam (Sep 16 2020 at 14:33):

Mario Carneiro said:

But to me the important fact is that they did it, they actually built a system and contributed things to a formal library and now OpenAI is credited to a set of theorems quite respectable for a human contributor

I agree, this part is great.

Mario Carneiro (Sep 16 2020 at 14:34):

I'm not an ML researcher so I can't make any observations regarding what GPT-f means in terms of technical theorem proving ability, but it looks like a step forward in terms of ML/ITP interaction

Johan Commelin (Sep 16 2020 at 14:35):

We only need a solid Lean - MM0 interface, and GPT-f0.
Then we're all set (-;

Mario Carneiro (Sep 16 2020 at 14:35):

Patrick Massot (Sep 16 2020 at 14:36):

I have a question about geometry in the IMO challenge that neither watching the talk nor skimming through the blog post answers. Do you actually have Lean code formalizing geometry problem? Did you formalize Euclidean geometry in Lean 4?

Daniel Selsam (Sep 16 2020 at 14:37):

This thread is largely about how to measure incremental progress, with solving IMO problems as one of the long term goals. My point was only that finding simpler proofs to theorems in a large corpora is not an ideal metric, especially in the absence of ablations.

Daniel Selsam (Sep 16 2020 at 14:43):

Johan Commelin said:

Until computers start proving lemmas that they need in their daily work, but for which they don't see immediately what the proof is.... they will probably not care too much.

Yes, I meant that, assuming IMO-style "puzzles" do arise in mathematical work periodically, the mathematicians who are less interested in or skilled at puzzle-solving may derive the greatest utility from a (hypothetical) automatic puzzle-solving assistant.

Kevin Buzzard (Sep 16 2020 at 16:32):

Mario Carneiro said:

yes, it's easy to look at the problems post hoc and say "that's easy" but isn't all math like that?

When we were looking at formalising a whole bunch of things about localisations using only the universal property, it looked pretty daunting. Then Strickland came along and pointed out that f : R -> A was a localisation at S iff (three simple-to-write conditions on f which involved no quantification over all R-algebras) and this was a crucial step. I explain this observation, which unblocked schemes II, in talks in seminars sometimes and I always say "Proof: trivial". Because it is -- the hard part was isolating the key basic API.

Daniel Selsam (Sep 16 2020 at 16:37):

@Stanislas Polu @Reid Barton

I agree it would be great to have easier milestones along the way, though I feel very strongly that even for these easier milestones, systems must be checksummed before the problem statements are known. How about this as a starting point:

Every year, before the IMO shortlist is made public, we have a deadline for submitting system checksums. Then, once the shortlist is released, systems can be evaluated on several relaxations of the problems, for example:

the original problems
the desired theorem statements to determine problems
the original problems, where all terms that are used in the official solution are provided
all intermediate steps of each problem used in the official solution
various interpolations thereof (e.g. some intermediate steps there, some removed)

Thoughts?

Mario Carneiro (Sep 16 2020 at 16:49):

what do you mean by checksumming here?

Daniel Selsam (Sep 16 2020 at 17:09):

Mario Carneiro said:

what do you mean by checksumming here?

Think of it like a hash of the system. Suppose I release a checksum before the problems are released, and then once they are released, I claim my system solves them and I make the system available. A third party can now run the system, confirm that it solves the problems, and also compare the checksum they compute for the system with the submitted checksum to confirm that it is indeed the exact system as it existed before the problems were released. However, I am not suggesting this approach specifically -- I used "checksummed" broadly to mean some way of proving the artifact was finalized before the problems were released.

Ideally, teams directly submit Docker images before the problems are released, and then we just run them all on Azure once the problems are released. We could give every team an initial gratis resource budget (say $100 Azure dollars per submission) and teams whose systems require more resources could pay extra accordingly.

Daniel Selsam (Sep 16 2020 at 17:15):

Note: I am okay with teams hiding arbitrary expenditures during the training process, and I am also not inclined to impose cost restrictions on test-time execution, but I think it is important that the cost of running the system at test time at least be transparent.

Daniel Selsam (Sep 16 2020 at 17:23):

Also, although I said "team", I want to stress that I see the IMO-GC as first and foremost a community collaboration, not a competition. It is great if some group introduces some special sauce at the end to cross the finish line, but I think all plausible approaches will need to be built on top of the community formalization effort. As I stressed in my talk, I don't see formalizing and compressing historical proofs as merely producing "training data"; I see it as effectively building the solver itself.

Stanislas Polu (Sep 16 2020 at 17:31):

Daniel Selsam said:

the desired theorem statements to determine problems

Not sure I understand this. What do you mean by that ?

Thoughts?

That seems like a great way to proceed, especially as this will produce a continuous yearly stream of problems against which we can benchmark ourselves as a community.

Daniel Selsam (Sep 16 2020 at 17:35):

Stanislas Polu said:

Daniel Selsam said:

the desired theorem statements to determine problems

Not sure I understand this. What do you mean by that ?

The issue is explained at https://github.com/IMO-grand-challenge/formal-encoding/blob/master/design/determine.lean

Joseph Myers (Sep 16 2020 at 19:46):

Kevin Buzzard said:

There is a huge source of Olympiad problems available, and Lean is getting to the point where it can understand pretty much all of them, with Joseph Myers' recent work on affine geometry enabling formalisations of problems involving e.g. circles, triangles etc (indeed Joseph just finished formalising all the solutions to this year's British Maths Olympiad round 1).

I run a formalizing club at Imperial and I also run a puzzle-solving club where we work on Olympiad problems; when it all starts up again in October I could see if I could get the puzzle-solvers formalising olympiad questions.

Actually it was round 2 (easier problems sometimes seem harder to come up with a good formal statement of, because more of the difficulty for the human contestants is in understanding the problem and translating the words into maths, rather than in what goes into the formal proof - though BMO1 is generally hard enough that shouldn't be an issue, even if sometimes there are problems that can be solved by norm_num). But the geometry problem on the most recent BMO round 1 can't be stated until we define "tangent", and what it means for a point to be outside a triangle.

Joseph Myers (Sep 16 2020 at 19:53):

Neither of the geometry problems from IMO 2019 can yet be stated cleanly. Problem 2 for fairly trivial reasons: we need to define "between" and "concyclic". (I have a definition of cospherical, and some basic lemmas for that, ready to PR once #4127 is in; "concyclic" is them cospherical plus coplanar.) Problem 6 will need a bit more (incircle plus tangency). Exercise for readers who haven't seen the IMO 2019 shortlist (which becomes public next week): how many of the eight geometry problems on it do you guess could be stated naturally in the current state of mathlib (without needing to work around the lack of definitions that belong in mathlib but aren't there yet)?

Joseph Myers (Sep 16 2020 at 20:31):

A few more comments for people thinking of formalising olympiad problems. (a) Any definitions and lemmas plausibly useful in mathlib should be added there rather than just left as part of the statement of or solution to an individual problem. (b) There are lots of choices to make about how to state a problem, not just about how to solve it, and it may be worth thinking about the relative merits of different styles of formal problem statements. E.g. in my formal statements, I tended to separate out the different pieces of the problem statement into separate defs (convenient for then proving lemmas about those defs separately) which then get combined into the problem statement, whereas the examples in the IMO Grand Challenge repository tend to go for the approach of putting the whole thing in a single very long def. How much should subtypes versus separate hypotheses be used when a statement involves e.g. positive integers? How much should the formal statements avoid depending on things such as nat subtraction or division that deviate from mathematical conventions, to ensure the formal statement means something as close as possible to the informal statement? How close should combinatorial geometry statements be to the geometrical formulation (take IMO 2018 problem 4 as an example; the shortlist version was clearly purely combinatorial, the final IMO wording ended up with the geometrical statement involving $\sqrt{5}$ because one leader noted their students wouldn't be familiar with chess)? (c) This is clearly an area where it makes sense to formalise multiple different solutions to the same problem. (In the IMO 2019 shortlist I encouraged the inclusion of lots of alternative solutions.)

Michael R Douglas (Sep 16 2020 at 21:22):

To do something useful for mathematical scientists, one could try to take an existing systematic proof method which is very laborious to use, and make it easier to use. An example I have thought about is interval arithmetic, which is used as part of many computer-assisted proofs in ODE and PDE. In some sense one is dividing the problem into cases and proving bounds, but the computer can handle millions of cases. Some mathematicians find this interesting and useful, see for example https://aimath.org/workshops/upcoming/compproofs/ .
It is a lot of work to set up such computations and a lot of work to turn the results into a rigorous proof - see Fabian Immler's thesis
https://europepmc.org/articles/PMC6044317/ for one of the few (only?) cases where this was done using a theorem prover.
There are a lot of similar dynamical systems problems and even more PDE problems, so systematizing this would produce a lot of new proofs.

Junyan Xu (Sep 17 2020 at 01:13):

related: http://www.crm.math.ca/camp-nonlinear/#videos computer-assisted mathematical proofs in nonlinear anlysis

Junyan Xu (Sep 17 2020 at 01:26):

Jason Rute said:

Also, speaking of tree search, has anyone else noticed that tactic-based proving is a two-player zero-sum game? Player A chooses the tactics. Player B chooses one subgoal to work on. If Player A has a perfect strategy the theorem is provable. If Player B does, the theorem is not provable. I think this could speed up RL for theorem proving (since one doesn't have to explore all subgoals to get meaningful policy and value updates). It also probably means that maybe we should start on the hardest subgoal instead of the first subgoal (or the easiest subgoal). Anyway, food for thought...

Holophrasm (also for Metamath) https://arxiv.org/abs/1608.02644 has red and blue nodes; your player A selects action at red nodes and B at blue nodes. It uses UCT. "At a red node, the traversal proceeds to the highest valuation child blue node. At a blue node, the traversal proceeds to the worst-performing child." So it's sort of adversarial, though not explicitly stated as a two-player zero-sum game.

Junyan Xu (Sep 17 2020 at 02:13):

Kevin Buzzard said:

There is this team in Oxford trying to compute invariants of mathematical objects and they say things like "we predicted the rank of 80% of the elliptic curves in this database"

Don't you find such work interesting? "Deep Learning Gauss--Manin connections", https://arxiv.org/abs/2007.13786, involving students of John Voight and Gerard van der Geer.
And is this speaker a member of the Oxford team? http://www.physicsmeetsml.org/posts/sem_2020_07_15/ (String data and machine learning, by Andre Lukas) Apparently elliptic curves aren't their main focus.

Kevin Buzzard (Sep 17 2020 at 06:42):

My experience with this quintic threefold / elliptic curve work is that when I saw a talk about it I had some questions about trying to make sense of the results and the speaker was not able to answer them because they said they never looked at the data, they were just reporting the percentages, so I found it very hard to judge.

I would revisit this but I am in work hell at the minute with a ton of undergraduate videos to make so need to focus on this for the next three weeks.

Kevin Buzzard (Sep 17 2020 at 08:24):

@Junyan Xu I've now found the time to look at the Gauss--Manin work and I find it much more impressive. They explain what they're doing and are not just treating it all as a push-button technique. They give tables of ranks of Picard groups which make it clear that the answer is not "96% of them are 0" (this is my general fear in this area).

Minchao Wu (Sep 18 2020 at 02:02):

Mario Carneiro said:

I agree that it is probably possible to accomplish what OpenAI did in metamath with a simpler / less computationally intensive framework. But to me the important fact is that they did it, they actually built a system and contributed things to a formal library and now OpenAI is credited to a set of theorems quite respectable for a human contributor

Actually @Thibault Gauthier 's TacticToe has also contributed proofs to HOL4 library:
https://github.com/HOL-Theorem-Prover/HOL/commit/8990db3c7743feb190d98f4bce666d3163022557
But since TacticToe is not deep learning based, OpenAI's claim (i.e., the first time a deep-learning based system has contributed proofs that were adopted by a formal mathematics community) is still legit.

Jason Rute (Sep 21 2020 at 13:07):

It looks like the videos and (some of) the slides are posted. http://aitp-conference.org/2020/

Minchao Wu (Sep 21 2020 at 15:05):

It seems that the slides of my talk is not correctly linked. Here it is: https://minchaowu.github.io/assets/pdf/aitp2020talk.pdf

Last updated: May 02 2025 at 03:31 UTC