Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Solving (Some) Formal Olympiad Problems

Bolton Bailey (Feb 02 2022 at 19:27):

An OpenAI blog post today that Kevin tweeted. Very exciting!
The mathd_train_algebra_217 theorem seems to have a typo in the formalization: (h₁ : ∀ x, f x = b * x + a) should be (h₁ : ∀ x, g x = b * x + a). Hopefully this doesn't exist in the actual dataset.

Stanislas Polu (Feb 02 2022 at 19:52):

Bolton Bailey said:

An OpenAI blog post today that Kevin tweeted. Very exciting!
The mathd_train_algebra_217 theorem seems to have a typo in the formalization: (h₁ : ∀ x, f x = b * x + a) should be (h₁ : ∀ x, g x = b * x + a). Hopefully this doesn't exist in the actual dataset.

Oh! Very nice catch! The statement is indeed inaccurate and as a consequence much easier than the original one. The proof remains valid for the statement we show though. This is what happens with releases and having a large number of people looking at our work! :) Thank you for that.

This is not the first time it happens. These statements have been formalized by student contractors and, as humans, they naturally do make mistakes.

Let me think about what to do here. I could replace it with another statement but this proof makes a very clear point about the capability of the model we want to expose. Maybe I'll add a footnote disclaimer, would you feel this would be fair?

Bolton Bailey (Feb 02 2022 at 20:08):

I think that's perfectly fair, I don't think a minor typo like this is a very big deal, since it doesn't actually affect the truth of the theorem itself.

Stanislas Polu (Feb 02 2022 at 20:13):

Bolton Bailey said:

I think that's perfectly fair, I don't think a minor typo like this is a very big deal, since it doesn't actually affect the truth of the theorem itself.

Well it does make the problem trivial where it was supposed to be somewhat challenging so we do need to be explicit about it :+1:

Bolton Bailey (Feb 02 2022 at 20:17):

Alternative interpretation: It actually makes the problem more challenging, in that it includes this unnecessary h₃ as a red herring for the prover to puzzle the meaning of :grinning: .

Stanislas Polu (Feb 02 2022 at 20:21):

That's a very positive way to spin this :D

Mario Carneiro (Feb 02 2022 at 21:56):

@Stanislas Polu There is also a typo in the informal proof of problem 1: it ends with $x-p=x-2- 2p$ but it should say $x-p=2- 2p$

Stanislas Polu (Feb 02 2022 at 23:03):

Mario Carneiro said:

Stanislas Polu There is also a typo in the informal proof of problem 1: it ends with $x-p=x-2- 2p$ but it should say $x-p=2- 2p$

Indeed! Let me fix it :+1: Thanks thanks!

Stanislas Polu (Feb 03 2022 at 00:29):

@Mario Carneiro fixed now :+1:
@Bolton Bailey added the footnote here: https://openai.com/blog/formal-math/#fn3

Thanks to both of you for catching these up :+1:

Johan Commelin (Feb 03 2022 at 06:25):

@Stanislas Polu @Jesse Michael Han Congrats! Very nice blogpost. Can you please indicate whether the examples posted in the blogpost are "the first thing the AI came up with" or "solution n out of X attempts".

Fabian Glöckle (Feb 03 2022 at 10:19):

Impressive!
In appendix F, problem mathd_algebra_140, what do you mean by "This proof is
interesting as it demonstrates the model’s ability to evaluate symbolic expressions implicitly."?
It seems to me that evaluating at any three points on h0 would work (but of course 1, 2, 3 is a very natural choice).

Jason Rute (Feb 03 2022 at 10:51):

Johan Commelin said:

Stanislas Polu Jesse Michael Han Congrats! Very nice blogpost. Can you please indicate whether the examples posted in the blogpost are "the first thing the AI came up with" or "solution n out of X attempts".

I'm not sure this is as meaningful was you think. If I understand correctly, there are two search loops in that paper. The inner loop is a best-first tree search where the model comes up with tactics, tries them out in Lean, and chooses which branch of the search tree to explore next. I assume once they reach "no goals" they stop since they found a proof. The outer loop is continually retraining their model (expert iteration) to get better based on proofs it found to other problems. So if I understand correctly, that proof is the first proof it found, but that was only after a possibly long search. So the number of "attempts" might be interpreted as how many rounds of expert iteration it took, but a failed attempt is not a suggested proof, but a tree search which didn't find "no goals".

Johan Commelin (Feb 03 2022 at 10:55):

Aha, but after retraining (in the outer loop) there's a chance the a proof for P that was found before is now no longer found. Or maybe a different proof is found. Does it make sense to ask whether the current model can "routinely" solve all the problems that it ever solved?

Fabian Glöckle (Feb 03 2022 at 11:28):

"Performance is reported in terms of pass rate (percentage of successful proof searches) as a function of the number of attempts per statement, noted pass@k where k is the number of attempts per statement at test time. To reduce noise in these metrics we generally run more than k attempts at test time (generally 32 to compute pass@1 and pass@8), averaging across attempts as needed to obtain a smoother pass@k value"

Fabian Glöckle (Feb 03 2022 at 11:31):

And the graphs in the paper report the pass rates as a function of the number of expert iterations (outer loops)

Will Sawin (Feb 03 2022 at 15:49):

I have some random questions:
(1) There are example proofs found by the model of theorems with proofs already in mathlib. I can tell that the model proofs are sometimes shorter, but I can't really evaluate the code style. Are they less clearly written than mathlib proofs? Are some of them better?
(2) Some of the proofs have lines that don't do anything in them. Would it make sense to prune these lines before using the proofs as training data for the next iteration, to provide higher-quality work to train on?
(3) Could one test the quality of the proofsize objective by running it on both true and false statements and seeing how quickly it realizes the true statements are more promising? Specifically I'm thinking of the multiple-choice questions in the data set. If the model runs search starting with a tree consisting of leaves for the true answer and also the wrong answers, what % of time will be spent exploring the wrong answers before finding the solution on the correct answer?

Kevin Buzzard (Feb 03 2022 at 16:16):

(3) is interesting. As people who have looked at the work carefully will realise, there's a problem with multiple choice questions in the sense that you have to decide how to formalise them. The simplest approach is taken here -- just ask the computer to prove the true statement

Fabian Glöckle (Feb 03 2022 at 16:51):

One could also formalize the "(a) or (b) or (c) or (d)" statement and hope that the model picks one instead of dark magically proving that some needs to be true without stating which ;) If it uses a deconstructing tactic early in the proof search, this should more or less be equivalent to (3)

Fabian Glöckle (Feb 03 2022 at 16:53):

Concerning Will's other points, a "proof linter" might also be a tool our CS people could aim for without any ML magic in it.

Stanislas Polu (Feb 03 2022 at 18:09):

Johan Commelin said:

Aha, but after retraining (in the outer loop) there's a chance the a proof for P that was found before is now no longer found. Or maybe a different proof is found. Does it make sense to ask whether the current model can "routinely" solve all the problems that it ever solved?

Thanks @Johan Commelin

Throughout expert iterations there is no guarantee that the model will reprove something that it proved before. Generally speaking the performance as measured in pass-rate (as quoted by @Fabian Glöckle) do improve but some challenging theorem may get proved only once throughout the entire process. That being said, it is generally easy to prove it again by taking the latest model and attempting proof search for it many times.

Hope that answers your question?

Johan Commelin (Feb 03 2022 at 18:12):

Yeah, it does. Thanks!

Stanislas Polu (Feb 03 2022 at 18:14):

Will Sawin said:

I have some random questions:
(1) There are example proofs found by the model of theorems with proofs already in mathlib. I can tell that the model proofs are sometimes shorter, but I can't really evaluate the code style. Are they less clearly written than mathlib proofs? Are some of them better?
(2) Some of the proofs have lines that don't do anything in them. Would it make sense to prune these lines before using the proofs as training data for the next iteration, to provide higher-quality work to train on?
(3) Could one test the quality of the proofsize objective by running it on both true and false statements and seeing how quickly it realizes the true statements are more promising? Specifically I'm thinking of the multiple-choice questions in the data set. If the model runs search starting with a tree consisting of leaves for the true answer and also the wrong answers, what % of time will be spent exploring the wrong answers before finding the solution on the correct answer?

(1) I would defer to mathlib maintainers to answer that :grimacing:
(2) We're exploring this actively these days. Completely agreed that it should be beneficial :+1:
(3) Gathering all multiple choices and testing the value function on this sounds extremely interesting indeed :+1: It would be very neat metric for the quality of the value function that one can easily relate to.

Johan Commelin (Feb 03 2022 at 18:35):

About (1): the proofs that I've seen so far are very much "mathlib style" but with more comments (which is a good thing!)

Arthur Paulino (Feb 03 2022 at 18:38):

I think the comments are added by hand afterwards just to explain what is happening to people who don't understand Lean
(or are you talking about different comments?)

Johan Commelin (Feb 03 2022 at 18:44):

Yeah, I guess that makes sense. Sorry, I should have taken another look before commenting. I just remember that I looked at them yesterday and thought: "that looks a lot like mathlib style, but with more comments". But of course this model would be very self-conscious if it writes things like

-- The model directly proposes n + 1 as solution.

Arthur Paulino (Feb 03 2022 at 18:52):

The blog post has reached the Data Science team of the company I work at and it wasn't me posting it. People are loving it :smiley:

Yuri de Wit (Feb 04 2022 at 00:25):

Arthur Paulino said:

The blog post has reached the Data Science team of the company I work at and it wasn't me posting it. People are loving it :smiley:

Get some of them into Lean too! :-)

Last updated: May 02 2025 at 03:31 UTC