## Stream: Machine Learning for Theorem Proving

### Topic: New paper: Magnushammer

#### Albert Jiang (Mar 09 2023 at 13:47):

We released a paper using contrastively-trained transformers for premise selection: https://twitter.com/PiotrRMilos/status/1633802335053905920.

1. The model is very small (38 or 86 million params). It can run on your laptop.
2. The model takes purely textual inputs, which means that the same neural network setup can be used for Lean.

#### Gabriel Ebner (Mar 09 2023 at 18:00):

Actual paper: https://arxiv.org/abs/2303.04488

#### Jason Rute (Mar 11 2023 at 04:16):

Here is a summary of the paper. To begin, it’s helpful to know how hammer systems work. A hammer—like Isabelle’s Sledgehammer, Coq’s CoqHammer, or Lean’s experimental hammer of @Gabriel Ebner—have the following parts:

• Premise selection: Select a number of lemmas in the library that might be related to the current goal.
• Translation: Translate the lemmas and goal into first order logic
• ATP: Use powerful external FOL provers like E or Vampire to prove the FOL translated goal using the FOL translated lemmas .
• Premise Selection (again): Select the subset of lemmas which the prover used to prove the goal.
• Reconstruction: Reprove the goal in the ITP using the selected lemmas and powerful tactics (like the MESON first order logic tactic).

#### Jason Rute (Mar 11 2023 at 04:16):

This quote from the paper sums up the situation nicely:

If successful, these external provers generate complete proofs, but the proofs are not trusted by the Isabelle system. Instead, the facts used in the external proofs are extracted and used to reconstruct the proof using native Isabelle methods. This process is known as proof reconstruction (see Figure 2). This means that, in essence, Sledgehammer is a premise selection tool.

#### Jason Rute (Mar 11 2023 at 04:16):

The premise selection in the initial phase of hammers usually has some form of machine learning in it, and recent work, The Isabelle ENIGMA, has even used neural networks to improve this part. But this paper shows you can basically replace the ATP-powered hammer altogether with just neural premise selection.

#### Jason Rute (Mar 11 2023 at 04:16):

The algorithm is fairly simple and most of it is in the picture in the tweet above. You take every theorem in the library and assign it a vector embedding using a Transformer. (Actually this isn’t too different from what @Zhangir Azerbayev has been working on for Lean.) You also take the goal and assign it a vector embedding with the same Transformer (but with a different projection layer at the end). You train these embeddings so that they point in the same direction if the theorem is a useful premise for the goal. Formally, one uses cosine similarity to compare and rank the premises. (While it may seem expensive to compute embeddings for all lemmas in the library, it can be amortized and just done once per lemma, either packaged along with the library, or computed on the fly as new lemmas are added. I don’t know if and how Magnushammer plans to do this in practice as it hasn't been released.)

#### Jason Rute (Mar 11 2023 at 04:16):

This cosine similarity could be enough (and they have ablations showing that just cosine similarity does pretty well), but after using cosine similarity to rank 1024 lemmas they rerank them again using a slightly more expensive method. They put both the goal and the premise together and pass that pair through a Transformer to get a score between 0 and 1, where 1 means it is a likely premise needed to solve that goal. They repeat this 1024 times, once for each pair, to rerank the premises.

#### Jason Rute (Mar 11 2023 at 04:16):

Now, with the final set of lemmas, Magnushammer (in parallel) tries a number of different tactics with different subsets of their 1024 lemmas. I’m not clear on all the details here yet, but they claim it is very similar to what SledgeHammer does during the reconstruction phase.

#### Jason Rute (Mar 11 2023 at 04:16):

So in summary, there is no external ATP or FOL translation, but it still relies on powerful tactics like MESON in the target ITP. Also, like SledgeHammer, the resulting proofs aren’t going to look very pretty.

#### Jason Rute (Mar 11 2023 at 04:16):

Overall, MagnusHammer seems to do much better than SledgeHammer. The comparison is messy as this sort of thing always is, since I think MagnusHammer uses a GPU and Sledgehammer doesn’t. But they have a nice plot on the first page which computes the proofs solved for a number of computational budgets. Even if the MagnusHammer and SledgeHammer budgets don’t perfectly align (which I’m sure they don’t!), it seems that even giving SledgeHammer a lot more resources isn’t going to help too much. (I really like this compute budget aspect of the paper!)

#### Jason Rute (Mar 11 2023 at 04:17):

Unlike GPT-f and HTPS, MagnusHammer doesn’t use any tree search or backtracking (other than what is in the tactics it uses, and the fact that it tries lots of tactics in parallel). They do however have additional experiments using Thor which is a neural tree-search-based theorem prover which can use SledgeHammer as one of its tactics. They replace any call to SledgeHammer with MagnusHammer. This proves many more theorems. (I’ve heard from Albert that this doesn’t come with a much higher computation cost, but I’m unsure on the details.)

#### Albert Jiang (Mar 11 2023 at 09:13):

Great summary!! Thanks Jason!

#### Kevin Buzzard (Mar 11 2023 at 10:59):

Jason's long posts here and on the proof assistants stackexchange site, where he explains modern advances in a jargon-free way, are indeed extremely helpful both to me and I'm sure to others.

Here is something I'm not clear about. If one is my PhD students is doing Masters level commutative algebra or number theory (elliptic curves, modular forms etc) in Lean then how much difference would a hammer actually make to our working environment in practice? Is the answer to this question simply unknown? Could the answer vary a huge amount depending on the exact nature of the work? Is it likely that in practice these tools are not going to be helpful at all at this level, now we have a solid API for a lot of the basics, or do they actually get better as the library grows? Perhaps @Sebastien Gouezel has experience with developing mathematics at this level in Isabelle? If @Manuel Eberl is listening -- is sledgehammer helpful for making a theory of modular forms in Isabelle/HOL?

#### Jason Rute (Mar 11 2023 at 11:16):

@Kevin Buzzard I personally feel that learned models like this one (along with GPT-f, Thor, DSP, and Baldur) have a lot of ability to apply to esoteric settings like your favorite types of math or different logics (which don’t already have powerful ATPs). One of Sledgehammer’s deficiencies are that it forces everything through FOL which can be a messy translation for HOL and DTT.

#### Jason Rute (Mar 11 2023 at 11:16):

Also, a frustration you’ve expressed, and one that you are not alone in, is that the vast majority of these systems are just closed research projects. Even when they are made available, they are often weaker than what is in the paper (like how the lean GPTf tactic didn’t do any proof search). Also the theorems they test on and talk about in the paper are not the ones you are interested in. I don’t have a perfect solution. But I would encourage you to start making a list of the sorts of little lemmas, theorems, and goals you would want a tool to be able to solve. Encourage the authors of these papers, like @Albert Jiang , to try them out and let you know how it goes. Maybe one day we can even turn this list into a sort of formal benchmark for these systems.

#### Zhangir Azerbayev (Mar 13 2023 at 21:15):

My favorite way to benchmark neural provers is the "future mathlib" evaluation, which was used in PACT.

In between the when you download your training data and when you run final evaluations, there will be a few thousand new theorems in mathlib, so you can use those as your benchmark. This kind of metric is not very useful for ML researchers, since we care about comparing results across papers . However, I think this metric is great for people like @Kevin Buzzard who want to know how whether these systems are useful for their real-world formalization work.

@Albert Jiang I wonder if there is enough new afp to try this kind of evaluation with magnushammer?

#### Albert Jiang (Mar 13 2023 at 21:18):

The rate of AFP growth is roughly 8K lines every 3-5 days. So I do think there is a possibility for this with AFP.

#### Albert Jiang (Mar 13 2023 at 21:20):

The problem I'm worried about is topic bias with AFP. There's a new entry every 3-5 days so if you wait for a month, you get 6-10 entries. This sample might not be very representative of the research you want to do.

#### Jason Rute (Mar 13 2023 at 21:24):

A physicist friend of mine once called this a “physical split” of the data, where the testing data is in the “future light cone” of the agent. It is essentially the same sort of evaluation the IMO grand challenge is going for, testing past agents on future IMO problems.

#### Jason Rute (Mar 13 2023 at 21:28):

(And to be fair, it wasn’t our primary evaluation in PACT. Our primary evaluation was a random split of all of mathlib. It was just a secondary evaluation we reported.)

#### Zhangir Azerbayev (Mar 15 2023 at 18:01):

Albert Jiang said:

The problem I'm worried about is topic bias with AFP. There's a new entry every 3-5 days so if you wait for a month, you get 6-10 entries. This sample might not be very representative of the research you want to do.

I see, if new afp is mostly new theories this would be an issue. In mathlib, new PRs are a good mix of refactors of existing math new additions to existing theories, and entirely new theories. For example, analysis.calculus.fderiv gets edited every month or so.

Although magnushammer uses Isabelle from 2021, so if I am not misinterpreting (maybe you're using the 2021 proof checker on new data?) there is over a year of new data.

#### Albert Jiang (Mar 15 2023 at 22:00):

Zhangir Azerbayev said:

Albert Jiang said:

The problem I'm worried about is topic bias with AFP. There's a new entry every 3-5 days so if you wait for a month, you get 6-10 entries. This sample might not be very representative of the research you want to do.

I see, if new afp is mostly new theories this would be an issue. In mathlib, new PRs are a good mix of refactors of existing math new additions to existing theories, and entirely new theories. For example, analysis.calculus.fderiv gets edited every month or so.

Although magnushammer uses Isabelle from 2021, so if I am not misinterpreting (maybe you're using the 2021 proof checker on new data?) there is over a year of new data.

Ah yes there is indeed one year of new data that can be used for evaluation. Very good point. I think we're gonna evaluate on new data when (and if) MH comes out as a tool to give people some idea of how well it does in a research scenario. The dream would be that the user writes down 100 theorems filled with sorries, launches MH, goes to sleep, and wake up to solve <=80.

#### Rahul Chalamala (Apr 19 2023 at 00:47):

What tokenization was used in the BM25 baseline?

Last updated: Dec 20 2023 at 11:08 UTC