Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Lyra

Jason Rute (Sep 30 2023 at 19:48):

Lyra: Orchestrating Dual Correction in Automated Theorem Proving by Chuanyang Zheng, Haiming Wang, Enze Xie, @Zhengying Liu, Jiankai Sun, @Huajian Xin, @Jianhao Shen, Zhenguo Li, Yu Li

Mostly the same authors of DT-Solver (#Machine Learning for Theorem Proving > DT-Solver).

Uses large language models like GPT-4 to prove theorems in Isabelle. They claim SoTA on MiniF2F including solving 3 IMO problems. (Of course one should take this with a grain of salt as has been discussed in the past.) I think it has similarities to Draft Sketch Prove (#Machine Learning for Theorem Proving > More papers on autoformalization). They have two mechanisms I don't understand yet. Tool correction somehow uses SledgeHammer, and Conjecture Correction somehow uses previous Isabelle error messages. (I wonder if it is similar to Baldur, #Machine Learning for Theorem Proving > Baldur: Whole-Proof Generation and Repair with Large Lang....)

Kevin Buzzard (Sep 30 2023 at 22:21):

Yeah the two new IMO problems they claim to solve: the first is imo_1981_p6 but this is not IMO 1981 problem 6. IMO problem 6 ends "determine $f(4,1981)$ " and the Isabelle formalisation proves $\forall y, f(4,y+1)=2^{f(4,y)+3}-3$ which is most definitely not the same thing (and in particular the formalisation contains crucial information not in the question, although it is unfortunately still an open problem to figure out how to actually turn these kinds of "work out this number" question into a formal theorem statement without giving away information). And the second one they claim to solve is IMO 1974 P5 but this is not in the dataset as far as I can see (I can't find it here https://github.com/openai/miniF2F/tree/main) . On the other hand on p9 they do state an Isabelle formalisation of something related to IMO 1974 problem 5, however it is extremely far from the IMO problem. The IMO problem says "there's some function S of 4 positive real inputs; find its image"; the answer is "the open interval (1,2)" and the hard part is to prove that everything in that range is attained; the fact that the image is obviously contained in (1,2) is just a remark in the first line of the human solution; however the formalisation only proves this trivial inclusion. This is perhaps the most egregious abuse of the phrase "we successfully solve an IMO problem" which we've seen yet in this area. They successfully solve the trivial direction of an if and only if. This is not to take anything away from the research! But we are not seeing any new solutions of what humans would call IMO problems here, whatever miniF2F says an IMO problem is.

Kevin Buzzard (Sep 30 2023 at 22:27):

I guess there's also the issue that as part of the process they're asking an LLM to give a sketch proof of a question which it's seen already.

Notification Bot (Oct 02 2023 at 06:32):

16 messages were moved from this topic to #IMO-grand-challenge > fill in the blank by Johan Commelin.

Scott Morrison (Oct 02 2023 at 08:03):

While not at all disputing Kevin's analysis of these purported IMO solutions, I'll say again that it would be great if the IMO challenge committee could specify clearly what a successful solution would constitute.

Even a document with a checklist of items "does your proposed formalised statement contain crucial information about the solution not present in the informal statement", "does your statement include both directions of any iff", etc. would be useful. It doesn't need to definitively answer the "find" problem.

Then at least when claims are made about IMO solutions it is straightforward to ask authors "are you claiming your solution meets this standard?".

Jason Rute (Oct 02 2023 at 13:28):

That could be a start. On the miniF2F benchmark however, that ship may have sailed…

Kevin Buzzard (Oct 02 2023 at 15:20):

Right -- the authors are just saying "we proved something labelled IMO_... in miniF2F", but they naturally translate this as "we solve an IMO problem" which is very much not the same thing in the two new examples in the paper.

Johan Commelin (Oct 02 2023 at 15:28):

Could miniF2F release a new version of the benchmark? I suppose that comes with its own set of problems, because benchmarks aren't supposed to change. But if those statements could at least come with an *, that would already help a bit.

Albert Jiang (Oct 03 2023 at 06:56):

Johan Commelin said:

Could miniF2F release a new version of the benchmark? I suppose that comes with its own set of problems, because benchmarks aren't supposed to change. But if those statements could at least come with an *, that would already help a bit.

There's a complicated history to miniF2F. It's initially an openai project: https://github.com/openai/miniF2F. Then all of its authors left openai (apart from Dan?) and the library becomes unmaintained. No PR can be merged.

Last year during my internship, I figured out there are ~10-20 mistakes in the formalisations (unsolvable, trivial typos, and incorrect indices, I know there are non-trivial mistakes as discussed above), so created a new repo for it since I cannot push to the original: https://github.com/facebookresearch/miniF2F. This is v2 and Fabian Glöckle is maintaining it atm. For future versions one should PR to the facebook repo.

Johan Commelin (Oct 03 2023 at 08:07):

Thanks for pointing this out! And thanks to both you and @Fabian Glöckle for creating and maintaining v2.

Min-Hsien Weng (Oct 04 2023 at 22:23):

Jason Rute said:

Conjecture Correction somehow uses previous Isabelle error messages.

The Lyra Conjecture Correction procedure reminds me of the Kaggle competition (LLM Science Exam)
https://www.kaggle.com/competitions/kaggle-llm-science-exam/overview

In that competition, we are given 200 multiple-choice questions and must use an LLM to choose the correct answer to each question. The questions are very challenging and require some physics knowledge. To answer each question, we collect relevant Wikipedia texts as context and pass them to the LLM model along with the question. The conext is similar to the content (previous formal proof) used in the Conjecture Correction procedure. The issues with this approach are

(1) The context to be used as input to LLM models
(a) Similar text search methods work for Wikipedia, but mathematical reasoning requires a different approach due to mathematical notions and symbols.
(b) LLM models have a maximal input length. If the context for a question is too long, the LLM model will truncate it. However, if the context is too short, it may not be enough for the LLM model to make an accurate prediction.

(2) The computational resources required to run an LLM model.
~~7-B~~ 70-B parameter LLMs require at least 160GB of memory. To run such a large model on limited GPU memory, we need to run the model one layer after another. The model has 87 layers (+ input/ouptut layers) and running through all these layers is time-conusming (200 questions take over 5 hours on 2 T5 GPUs).

Junyan Xu (Oct 05 2023 at 14:29):

7B parameter LLMs require at least 160GB of memory.

One of these numbers must be off! 7B unquantized should only require around 16GB. Given that the model has 87 layers, I bet you're talking about 70B models (for comparison Llama 65B has 80 layers).

Adam Topaz (Oct 05 2023 at 14:33):

Min-Hsien Weng said:

7B parameter LLMs require at least 160GB of memory.

Is this really true? I have 64GB of RAM on my laptop, and I can run llama2 7B on my CPU just fine (albeit quite slowly)

Albert Jiang (Oct 07 2023 at 09:06):

If you just want to infer, 7B parameter requires 16GB memory max :)

Quantised versions (int4/8) can run on a mac m1 max / m2 no problem

Min-Hsien Weng (Oct 08 2023 at 21:53):

Thanks for pointing that out. Sorry for my typo. I was using Platypus2-70B-instruct, which has 70 billion parameters. 7-B parameter is not that big :sweat_smile:
Here is the LLM model: https://huggingface.co/garage-bAInd/Platypus2-70B-instruct

This 70-billion-parameter large language model can infer multiple-choice answers from relevant Wikipedia contexts (like in an open-book exam) with 86% accuracy. Some teams have achieved over 93% accuracy, and they may find ways to further improve performance.

I found it interesting and learned more about LLMs. The competition is ending soon, so join if you're interested!

Last updated: May 02 2025 at 03:31 UTC