Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Evaluation of Formal proofs


Siva Gollapalli (Jul 18 2024 at 03:08):

Someone please help me with this,
If I have generated a formal proof, how do I evaluate if it is correct or not, can you provide an example also

Kim Morrison (Jul 18 2024 at 03:15):

Unless you give us more information about your background and what you've tried already, I think the best response is that you should read the learning resources at

https://leanprover-community.github.io/learn.html

Siva Gollapalli (Jul 18 2024 at 04:22):

I have used the machine translation evaluation metrics like BLEU but it seems, that is not that good for the proof evaluation.
In MiniF2F we don't have ground truth formal proofs right, so how are evaluating the proofs generated by our models

Jason Rute (Jul 18 2024 at 04:31):

There are a few ways. BLEU is absolutely not the right way. The point of Lean is that it can check your work. Please take some time to at least learn a bit more about lean. If you understand how lean works it will make it easier to explain how to check things from the command line.

Jason Rute (Jul 18 2024 at 04:32):

(For those who don’t know, BLUE is a similarity metric used in machine translation scoring.)

Kim Morrison (Jul 18 2024 at 12:16):

I'd suggest reading some basic tutorials so you understand how the proof checking interaction works for humans. After that it's easy to either pipe a proof to the lean executable and look at exit codes or error messages, or to use the https://github.com/leanprover-community/repl.

Jason Rute (Jul 18 2024 at 13:56):

So your first concern is whether to use Lean 3 or Lean 4. The original MiniF2F and Facebook’s fixed version are in Lean 3, but Lean 3 has been replaced with Lean 4 and this year’s papers have been using Lean 4. Note, GPT-4 I think still likes to output syntax closer to Lean 3 and needs some coaching to get Lean 4 syntax.

The next question is which version of MiniF2F to use. For Lean 3, use the Facebook version. For Lean 4, I guess the yangkai11 repo is the current de facto standard, but it has many errors. Don’t use the Hoskinsons Center’s huggingface version!

The rest is assuming you are using Lean 4. I assume you aren’t using a language agent and just have an LLM that outputs whole proofs. That will make your life easier for benchmarking. Then you have many options. The simplest is to have a file for each problem, replace sorry with the proof, and then just to Lean from the command line on that file and check for errors. I don’t remember the command off-hand, but I could find it. You will need to make sure you have installed Mathlib appropriately (including with compiled binaries).

Finally, there are a few types of proofs that are bad. sorry is cheating and Lean will warn you. There are other tactics that are also cheating and Lean won’t warn you. This may not come up, but ideally, you would also print the axiom list with #print axioms theorem_name to be sure no weird axioms got in.

I highly recommend Machine Learning researchers either partner with a Lean expert or take the time to understand Lean well.

Jason Rute (Jul 18 2024 at 14:21):

Also, maybe some other ML researcher is willing to share their checking code. Many of them are doing the same thing (checking LLM whole proof suggestions for MiniF2F), so it wouldn't hurt to have a good template to get started.

Ralf Stephan (Jul 18 2024 at 16:34):

It just depends when they scraped the input. Claude 3.5 does full Lean4, I think, for example.

Eric Wieser (Jul 18 2024 at 23:50):

Jason Rute said:

Don’t use the Hoskinsons Center’s huggingface version!

If you feel strongly about this, is it worth contacting the Hoskinson center and suggesting it be taken down / have a warning attached? This might be better resolved privately between you and them, since asking people to throw away their work is not something you want a dogpile to form around.

Eric Taucher (Jul 19 2024 at 11:13):

Jason Rute said:

Note, GPT-4 I think still likes to output syntax closer to Lean 3 and needs some coaching to get Lean 4 syntax.

Yes, this is still true when using ChatGPT.


Last updated: May 02 2025 at 03:31 UTC