Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: MiniF2F


Stanislas Polu (May 10 2021 at 10:23):

Hi Everyone! Today we're releasing MiniF2F. A cross-system formal to formal benchmark consisting of olympiads, undergraduate and high-school maths exercises. The goal of MiniF2F is to provide a shared benchmark to evaluate deep-learning approaches across formal systems. It currently targets Lean and Metamath, with an eye on Hol Light and Isabelle.

We see MiniF2F as a stepping stone towards the IMO Grand Challenge, as its statements are typically much easier than IMOs, but generally still out of reach of deep-learning based automated theorem provers. We also hope that it will help foster innovation in that space by allowing direct and fair comparison between the existing methods.

The benchmark is still very early and there's still a lot of work to do. Contributions are more than welcome (there's an indicative TODO list in the README). The Lean statements are released under the Apache license to make sure they can freely float from and back to mathlib.

Jason Rute (May 10 2021 at 22:31):

Thanks for putting this together Stan and team! I think overall this will be a good project. I’ve said for a while that we need standard benchmarks. I look forward to seeing the various neural, ML, decision procedure, and hammer-based systems put through this test. Right now there is no benchmark that I’m aware of that can compare DeepHOL to lean GPT-f to Isabelle Hammer, and it is great that we are creating such benchmarks.

Jason Rute (May 10 2021 at 22:32):

I could imagine this would be a good dataset to put on Papers with Code. (I used to keep the Paper With Codes SoTA up-to-date for automated reasoning, but I haven’t touched it in a long time.) Although, I think it would even be better just to maintain a list on this GitHub repo (maybe with some detailed CSVs in the repo too). This way one can not only see the results, but also all the caveats that comes with each method.

Jason Rute (May 10 2021 at 22:32):

For the name MiniF2F, I assume it is a reference to Daniel Selsam calling the IMO Grand Challenge a Formal to Formal (F2F) variant of the IMO. That is, a formal statement is given to the agent as input, and the agent produces a formal proof as output. This project is similar, but contains simpler problems as well, hence “Mini” in the name.

Jason Rute (May 10 2021 at 22:32):

You mention HOL-Light and Isabelle. I wonder if you would be able to recruit some Coq, HOL4, and Mizar folks as well to help with translation to those systems. Then you can target CoqGym/ASTactic, Tactician, TacticToe, ProverBot9001, CoqHammer, ENIGMA, and rlCoP.

Jason Rute (May 10 2021 at 22:32):

As for your initial baseline tests with Lean, I’d love to see you test a variety of Lean tactics including tidy, refl, simp, finish, suggest, library_search, omega, norm_num and more. I think this would be a good place ask the Lean community what sort of tactics would be make good baselines.

Jason Rute (May 10 2021 at 22:33):

I think it would be nice to have a paragraph (or a whole paper or tech report) detailing what sort of problems you included in your dataset and why. Although, it should be noted to anyone reading here that right now there are only 122 Lean problems, so it is easy to look over manually.

Jason Rute (May 10 2021 at 22:33):

Every benchmark is flawed, including this one, and we should be careful not to read too much into a good or bad score on this metric. (Although, I’m sure that will be impossible in practice.) Right away, I think it should be noted strongly that good performance on this benchmark isn’t necessarily indicative how useful a given system is to a Lean or Metamath user. One reason is that, after a quick glance of the theorems, I don’t think they are typical theorems that a Lean user would put into Lean. Moreover, when comparing across systems it is also just as much a comparison of the library and tools in each system as it is of the training methods used. A theorem might be easier for Lean GPT-f instead of MetaMath GPT-f not because of the subtle differences in training the two projects, but because of the libraries and automation differences. The same can be said of comparing either GPT-f project to DeepHOL.

Jason Rute (May 10 2021 at 22:34):

Also, for any system designed specifically against this benchmark, I’m afraid that there could be significant issues with with data leakage and overfitting to the test set. If someone uses a pure machine learning approach where they train on a separate dataset and test on the test theorems, I think the issues will be minimal (at least no worse than current leakage and over fitting issues with the ImageNet and Atari datasets). (I may however be too optimistic here. With only about 50 test examples, even casually looking at the test set could heavily influence the design choices of the prover and lead to a better score.) However, the larger issue is if someone writes a tactic to solve these types of problems, similar to Daniel’s IMO Grand Challenge project. One has to have a lot of self-restraint not to look too much at the test theorems.

Jason Rute (May 10 2021 at 22:35):

When talking about these results we really need to take this benchmark with a grain of salt. I hop to write up a longer document on the issues and challenges of testing and benchmarks, but the short of it is this: There are a lot of theorem provers. Comparing across them is hard. Also there are a lot of subtly different goals and it is not clear that all theorem provers are optimizing for the same goals. Apples-to-apples comparisons are likely impossible. Instead we need better (but not perfect) benchmarks and a lot of humility and self-reflection.

Jason Rute (May 11 2021 at 12:03):

Update: Looking back at what I wrote above, it may have come across too negative. I think this is a great start at a benchmark! Obviously the more problems it has and the more diversity of problems it has, the better. I think making datasets and benchmarks is one of the best ways for many people to help with AI research. Unlike training a large neural network or making an end-to-end AI agent, building datasets and benchmarks is the sort of project that a variety of people can help with. One could argue that every time you put a theorem into Lean (or any ITP library) that it is contributing to a dataset. Moreover, we don't even need proofs. Just theorem statements alone are a great benefit to both training and testing mathematical agents. One success of making datasets was Lean GPT-f. It basically came about when Jesse and I were both separately working on Lean related datasets. When these datasets reached maturity, it was easy to combine them and start collaborating with others.

Stanislas Polu (May 12 2021 at 10:01):

Agreed with all your points. A few quick comments:

  • The goal of the benchmark is not to evaluate the usefulness of ATPs for the typical Lean user but rather evaluate the reasoning capabilities of such agents using maths exercise as a metric. Of course we're also evaluating the underlying libraries/formal system/tactics at the same time but I think that's fine. As an example applying GPT-f to Lean and Metamath and eventually other systems with the same methodology and measuring against a benchmark like MiniF2F is a very interesting endeavor IMHO. Usefulness for formalization efforts I think is actually well covered by testing on a temporal split of libraries?
  • The benchmark is obviously in a prototype state and we very much hope that people from various labs and formal communities will help grow it into a more finalized state.

Stanislas Polu (Sep 12 2021 at 11:22):

In case useful, last week we released a paper to accompany the miniF2F benchmark. It's available here: https://arxiv.org/abs/2109.00110

MiniF2F now has 488 statements fully translated in Lean and Metamath + WIP on Isabelle.

We present miniF2F, a dataset of formal Olympiad-level mathematics problems statements intended to provide a unified cross-system benchmark for neural theorem proving. The miniF2F benchmark currently targets Metamath, Lean, and Isabelle and consists of 488 problem statements drawn from the AIME, AMC, and the International Mathematical Olympiad (IMO), as well as material from high-school and undergraduate mathematics courses. We report baseline results using GPT-f, a neural theorem prover based on GPT-3 and provide an analysis of its performance. We intend for miniF2F to be a community-driven effort and hope that our benchmark will help spur advances in neural theorem proving.

Scott Morrison (Sep 12 2021 at 11:36):

I'm terrified that tidy is now being used as a baseline. :-)

Kevin Buzzard (Sep 12 2021 at 11:50):

I find these ML papers very hard to read (nothing to do with your exposition, it's just that it's so far from my area, I don't know even the basic vocabulary in this area). Here's a dumb question. You give a sample theorem on p7:

import data.real.basic
import tactic

theorem algebra_sqineq_2unitcircatblt1
  (a b : )
  (h₀ : a^2 + b^2 = 2) :
  a * b  1 :=
begin
  nlinarith [sq_nonneg a,sq_nonneg b,sq_nonneg (a - b)]
end

Just to be clear -- does your AI find this one-line proof by itself with absolutely no human input?

Second (and final) dumb question: more generally, there are 244 theorems formalised in Lean here: https://github.com/openai/miniF2F/blob/v1/lean/src/test.lean , some with proofs filled in and some with sorry (I am unclear about when a proof is filled in and when it's not). Using both a GPT-related approach which has not seen solutions to these questions before, and also a tidy-like tactic, and taking as much time as you like, what percentage of these goals can you completely solve with no human input whatsoever? I see 10.5% on p5, 29.2% on p6, and figures nearer 40% on p7, all coming from slightly different methods which I find hard to distinguish between.

My reason for asking is that my impression of previous results in this area is that sometimes CS researchers sneakily do things which a mathematician would regard as "cheating" -- for example they do things like "well we looked at a human proof and then told the system which tactics to use but not how to use them" or "we looked at 100 consecutive theorems in an API and for each theorem we tried to prove it assuming all the previous theorems, thus implicitly but crucially using the human-generated ordering of the theorems". I'm certainly not claiming that you're doing this! I'm observing that I've seen others doing this in the past, and this is the sort of thing which one would not be able to do in, say, an IMO challenge or a computer attempt at a hard human conjecture for which no route is yet known. Basically I just want to check that there's not any "small print" which I'm missing.

Thanks in advance and sorry for the very basic questions!

Kevin Buzzard (Sep 12 2021 at 11:54):

PS does the a2+b2=2    ab1a^2+b^2=2\implies ab\leq 1 question show up in some kind of school Olympiad test?

Kevin Buzzard (Sep 12 2021 at 11:57):

Scott Morrison said:

I'm terrified that tidy is now being used as a baseline. :-)

It's a shame there's not more category theory on these high school olympiad exams!

Stanislas Polu (Sep 12 2021 at 19:48):

@Kevin Buzzard the AI find the on-line tactic of algebra_sqineq_2unitcircatblt1 indeed with no human input other than training on mathlib. That problem is not part of any olympiad as far as we know but it was one problem we used as a test esp when working with Metamath for quite some time. That's how it landed in miniF2F

The statements in miniF2F may indeed be accompanied by proofs but it's not a requirement, just a nice to have.

  • the 10% on p5 is a reference to the PACT paper. That's the pass rate of the tidy baseline one a test split of statements from mathlib
  • 29.6% on p6 is the pass rate on miniF2F / Lean of GPT-f pre-trained with PACT. PACT is an automated way to extract low-level artifacts from Lean proofs to train the model on as secondary objective on top of the tactic mode tracing data we train on.
  • 40% is the pass rate of the same model on the subset of problems that come form the MATH dataset.

The only bit of "cheating" I can think of here is related to the fact that the model was pre-trained on data coming from the web that may include informal proofs of these problems (that dataset does not include any Lean at all though). The model is later finetuned on mathlib data. I strongly believe and we will definitely work on verifying this that the results would remain the same if we were to mine for these and filter them.

Stanislas Polu (Sep 12 2021 at 19:50):

Happy to answer more questions on our methodology! To give you a bit more color on the pre-training / presence of informal data, the dataset we used was built up in 2020. As you may have noticed I formalized the AMC A 2021 problems last Friday and our models didn't suprisingly underperform on them in any way despite the guarantee that we never saw these statements or proofs even in their informal form.

Stanislas Polu (Sep 12 2021 at 19:53):

Also the goal of the paper is really to present the benchmark. It's common use to provide some baseline numbers as well so that groups working on the benchmark have some numbers to compare to. That's why we built that modified tidy + applied the methodologies we had already published in the past for Metamath and Lean. But destroying these methods is not really the focus of the paper.

Stanislas Polu (Sep 12 2021 at 19:58):

(This is the PACT paper: https://arxiv.org/abs/2102.06203)

Kevin Buzzard (Sep 12 2021 at 20:19):

Thanks for the further explanations!

Scott Morrison (Sep 12 2021 at 22:19):

@Stanislas Polu (or other coauthors), I'm curious about your experience modifying tidy. Do you think you learnt anything in modifying it that could be contributed back to the mathlib version?

I ask in part because I know tidy is embarrassingly bad. I started writing it before mathlib existed, and it was only every really "trained" (tuned, perhaps) on basic theorems from category theory. We now use it on other problems, but as far as I'm aware no one has every really gone through any list of plausible targets for it and analysed failure modes, outside of that category theory domain. I am quite certain it could be made much better! (And likely much less slow.)

Having good baselines is important, so your use of a variant of it adds to the motivation for this to happen.

Scott Morrison (Sep 12 2021 at 22:20):

(I realise, and apologise, that this question is not about the real point of your paper. :-)

Stanislas Polu (Sep 13 2021 at 07:14):

@Scott Morrison (cc @Jesse Michael Han @Kunhao Zheng) To be very honest we didn't thoroughly study the dynamics here. As mentioned in the paper we simply added a few tactics to tidy's list (namely linarith nlinarith ring_nf and norm_num) that are empirically useful to the kind of problems in miniF2F. The goal was really to get a decent non neural baselines and the result we got with this definitely matched that objective.

I presume there are many things we could do to improve the state of affair here. Some ideas:

  • Some pattern matching to select form the tactic list in a bit more opinionated way?
  • Upweight shorter goal states?
  • Attempts to pass as argument hypotheses when applicable?
  • Whatever sledgehammer does in Isabelle :shrug:

Happy to explore this further. Our immediate plan wrt to non neural baselines are the following:

  • Run sledgehammer once the Isabelle effort is complete and compare.
  • (aspirational) add TPTP as a supported format (not sure we can translate all problems easily) and run E prover on them.

In any case, very happy to collaborate and help on that front, I think it would be nice to have a stronger tidy for sure! One simple experiment we could run is to query the model inside the tidy loop to get extra-arguments to tactics and see if we can see emerge patterns that could be easily implemented as non-neural heuristics?

Yichen Xu (Apr 06 2022 at 07:19):

Hi miniF2F authors! @Kunhao Zheng @Stanislas Polu I am trying to evaluate a GPT model (using the lean-gym environment) on the miniF2F benchmark these days, but I am confused about one technical detail: it seems that when attempting the proof of a theorem we would need the decl name and open namespaces of it. For mathlib, these could be obtained from the .names files produced by the proof recording pipeline; but how to obtain these things for miniF2F?

Stanislas Polu (Apr 06 2022 at 17:43):

Hi @Yichen Xu! You indeed need the decl name. We parse it out of the minif2f repo simply matching for the theorem keyword with a rather simple regex :+1:

Stanislas Polu (Apr 06 2022 at 17:44):

THis is what we use

from smokey import Smokey
from typing import Optional

"""
Script for creating benchmarks, allows you to specify either a folder containing a bunch of lean
files, or a single lean file for which you would like to create benchmarks for. It also can create
fractional splits of that data.

Usage:

python -m formal.lean.datasets.create_benchmarks_from_folder --folder_path <FOLDER_PATH>

or

python -m formal.lean.datasets.create_benchmarks_from_folder --lean_path <LEAN_FILE_PATH>

"""


def recursively_grab_lean_files(folder_path: str):
    files = []
    for r, _, f in os.walk(folder_path):
        for file in f:
            if ".lean" in file:
                files.append(os.path.join(r, file))
    return files


def run(lean_path: Optional[str] = None, folder_path: Optional[str] = None):
    if folder_path and lean_path:
        err_msg = "Please specify *only* either a folder_path or a path to a specific lean file"
        raise ValueError(err_msg)
    if lean_path:
        file_paths = [os.path.expanduser(lean_path)]
    elif folder_path:
        file_paths = recursively_grab_lean_files(folder_path)

    data = []

    # Create benchmark from list of lean file_paths
    for benchmark_path in file_paths:
        with open(benchmark_path, "r") as f:
            opens = []
            imports = []
            begin_count = 0

            for line in f.readlines():
                if line.startswith("import "):
                    imports += [line[7:].strip(" \n")]
                if line.strip() == "begin":
                    begin_count += 1
                    continue
                if line.strip() == "end":
                    if begin_count > 0:
                        begin_count -= 1
                        continue
                if line.startswith("section"):
                    opens = []
                if line.startswith("open "):
                    opens += line[5:].strip().split(" ")
                if line.startswith("end"):
                    opens = []
                if line.startswith("theorem"):
                    name = line[7:].strip(" :\n")
                    # print(f"DECL {name}")
                    data.append({"name": name, "opens": opens, "imports": imports})

    random.shuffle(data)
    with jsonlines.open("out_benchmark.jsonl", "w") as w:
        for d in data:
            entry = {"decl": d}
            w.write(entry)


if __name__ == "__main__":
    Smokey(run)

Yichen Xu (Apr 07 2022 at 03:21):

A lot of thanks!


Last updated: Dec 20 2023 at 11:08 UTC