Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: ML on tactics vs terms?

Sam (Feb 26 2024 at 09:12):

It seems that most current approaches to ML proving are making the implicit assumption that it's better to work with tactics than raw terms. This isn't intuitively obvious to me, for a few reasons:

The internal behaviour of tactics can be complex, and during training the model only sees the result of tactic applications, which is a fairly weak signal from which to infer that behaviour. In some sense, the tactics are actually hiding what's really happening - especially when they make use of implicit knowledge from the library.
The relatively large number of tactics means that the model needs to learn a wide range of tactic behaviours, some of which are seen quite sparsely in the data. This may limit the ability of the model to generalise/share knowledge between proofs using different tactics.
Training directly on terms is potentially more data efficient, because removing the abstraction means more data. Every theorem would be an instance of the same problem, which potentially leads to better generalisation.

Is there a good reason that tactic proofs are getting the focus at the moment, or is this one of those things that's just the current trend?

My personal hunch is that tactic proofs are the easiest way to get to 35% of problems solved, because they automate so much of the proof process that the model can get away with not really "understanding" what it's doing. Many proofs are solvable by a sloppy application of common tactics. The success of the k-NN techniques show how far we can get with really dumb methods. However, if we want models to tackle the harder problems, I suspect we need to go lower level.

Tactics have the advantage that they tend to produce nicer proofs for the user. If we care about that, we could always add a term proof -> tactic proof translation step later on, so I don't think this should be considered as an important factor.

Sam (Feb 26 2024 at 09:22):

I should probably also mention that I'm currently playing around with some ideas (in the notes stage - I haven't even got to code yet) for an alternative way to represent term proofs to allow for a more iterative solving process.

The basic idea is to take inspiration from e-graphs to represent partial term proofs, where a single proof step consists of non-destructively adding a new term to the e-graph. This would potentially allow iterative and out-of-order solving, and remove the need for backtracking in the proof search. I don't actually know if this idea works out yet, as it's still just a germ of a thought for now.

Johan Commelin (Feb 26 2024 at 10:37):

Tactics allow reasoning at a higher level of abstraction. This seems crucial to me. If you want to solve "theorem proving", then I think an AI might need to discover new abstractions on many levels:

intermediate steps, have, forward reasoning, "cut"
lemmas, theorems,
definitions
proof techniques, tactics (or some analogue/substitute)

So I think it's fine to consider proof terms and train on those, but I think that avoiding tactics will not speed up the path to strong AI for theorem proving.

Sam (Feb 26 2024 at 10:50):

To what degree are the higher level of abstraction offered by tactics not just common patterns that reoccur in term proofs, which the model can learn to mimic?

It's certainly true that tactics which internally do a search can make things easier for a model, because they reduce its burden. That's very helpful if you're doing a bruteforcy proof search with the model acting as a heuristic. However, if we want models which are good enough to put together working proofs without deep searches, it's less obvious to me that that's useful.

For the model to be able to predict whether a tactic will work, it will usually need to be able to forsee what that tactic will do, and to know the outcome of the search. At that point, what purpose does the tactic serve?

Kim Morrison (Feb 26 2024 at 11:19):

@Sam, you should look at the insanely large and near uninterpretable terms that even simp and rw produce, let alone omega, ring, linarith, etc.

Kim Morrison (Feb 26 2024 at 11:19):

It doesn't make sense to expect any model to synthesize these page long term mode proofs.

Sam (Feb 26 2024 at 11:22):

@Scott Morrison Ah so perhaps this is a bad assumption on my part, but I assumed that in most cases those large term proofs are put together very inefficiently due to the rote way the tactics are implemented, and could usually be optimised to something much smaller. Is that not the case?

Kim Morrison (Feb 26 2024 at 11:27):

Not really. You can find some examples of @Joachim Breitner making some very minor improvements to the terms produced by rw recently, but it just tinkering around the edges.

Kim Morrison (Feb 26 2024 at 11:27):

Tactics like rw have had a lot of effort making them efficient. But term size is not really a direct consideration for this!

Sam (Feb 26 2024 at 11:29):

I wasn't imagining optimising those tactics - I was imagining taking the term proofs they produce, and then running those through some kind of simplifier. But yeah - I don't actually know how viable this is!

Kim Morrison (Feb 26 2024 at 11:29):

There's not much to simplify. You need to write out a long sequence of congruence lemmas that allow you to modify a subexpression of a potentially large expression.

Sam (Feb 26 2024 at 11:36):

Ah, so that deep rw problem I maybe have a solution for... But probably not. (Disclaimer: I'm definitely the least educated person on these matters in this conversation.)

My instinct is to try representing the partial term proofs as something e-graph like, meaning that deep sub-terms don't quite exist in that same way. A rw operation becomes a merging of two terms into the same equivalence class, but nothing needs to change "upstream". A proof is then said to be completed when there's an extractable term in the e-graph that gives the goal - but technically this can be concluded without even extracting the full term.

But all of the above may be nonsense.

Sam (Feb 26 2024 at 11:39):

I'm not 100% sure that actually addresses the problem you're referring to!

Mario Carneiro (Feb 26 2024 at 12:54):

Scott Morrison said:

There's not much to simplify. You need to write out a long sequence of congruence lemmas that allow you to modify a subexpression of a potentially large expression.

No, the proofs generated by most lean tactics are massively inefficient compared to what a human would do

Mario Carneiro (Feb 26 2024 at 12:55):

it's just (1) difficult to find shorter proofs without taking longer, and (2) less uniform and more complex for the tactic code

Sam (Feb 26 2024 at 12:56):

@Mario Carneiro Do you have a sense of how practical it would be to optimise those generated proofs using rewrite rules, simp style?

Mario Carneiro (Feb 26 2024 at 12:57):

It won't work because simp explicitly disables itself on proofs

Mario Carneiro (Feb 26 2024 at 12:58):

but assuming it was actually treated as a goal, I think you could get quite a bit closer to hand-optimized proofs that way

Sam (Feb 26 2024 at 12:59):

Hmm, that simp issue seems like the type of thing that would probably be workaroundable, right? Like if we just pull the theorem values as Exprs it should be possible to run simp on those? Or maybe there's some additional complexity here I'm not aware of.

Jason Rute (Feb 26 2024 at 15:30):

@Sam Or you can do both. :) PACT added the term proofs to the training data in two forms, both exact <whole proof> and apply <next step> for every subtree in the full proof. This data (along with other data we added) seems to really help the results. The other Lean papers by OpenAI as well as the HTPS paper by Meta all used this data. And you could see it in the results (when these models used to be publically available for Lean users to play with). Often the model predicted exact <some long term proof> and it was correct.

Jason Rute (Feb 26 2024 at 15:31):

I gave a talk on PACT in this youtube video: https://www.youtube.com/watch?v=EXpmbAfBNnw

Jason Rute (Feb 26 2024 at 15:33):

It is sort of funny that many recent tools like Lean Copilot don't use data like this. It is pretty easy to gather and it seems to help the performance.

Jason Rute (Feb 26 2024 at 15:35):

Oh, and here is another recent paper which reimplemented PACT: DT-Solver

Jason Rute (Feb 26 2024 at 15:38):

Agda is interesting in that it doesn't have tactics (I think). So it likely has better term-proof data. Here is an Agda dataset, but honestly I haven't looked into it closer: https://arxiv.org/pdf/2402.02104.pdf

Jason Rute (Feb 26 2024 at 15:38):

(I should probably make a new thread for that paper.)

Sam (Feb 26 2024 at 15:43):

Ah, thanks for pointing out PACT @Jason Rute! I'm actually just realising that I watched that talk back when it came out, but 2021 is ancient history so it slipped my mind. I don't know if I ever fully read the paper though - so that's on my reading list now!

Jason Rute (Feb 26 2024 at 15:43):

Finally, I sometimes wonder if we should work more on AI for lower-level proof languages like Metamath and term proofs. The reason is not that we will solve more theorems without tactics. Tactics clearly help, ... a lot. (Just look at minif2f results for GPT-f in Metamath.) But because our current approaches rarely generate proofs that are more than say 3-5 tactics long it seems that we need new techniques and the utter failure on non-tactic benchmarks is a good place to start maybe.

Jason Rute (Feb 26 2024 at 15:48):

Also, to add to my list of results above, I should mention this short project on generating synthetic term proofs in Lean.

Sam (Feb 26 2024 at 15:49):

Yeah your recent Graph2Tac work seems to be in the direction which I've always assumed was promising, so that's very cool! I've been wondering whether a pure term approach makes more sense when you've got a graph of the library as input. If we treat the proof process as being (approximately) a pathfinding task, then operating directly on low level terms seems in many ways simpler. It (almost) becomes a subgraph extraction problem, + synthesised terms.

Sam (Feb 26 2024 at 15:52):

I (unreasonably) view tactics as cheating. It seems to me that they allow the model to shortcut actually solving the problem, because the model just "guesses" that a certain powerful tactic will find a solution, but doesn't know what solution it will find.

Sam (Feb 26 2024 at 15:54):

Doesn't that mean it's incapable of scaling to larger proofs at some point? When there's a trial-and-error process involved, there's a deep search, and it's very easy to hit combinatorial explosion.

Sam (Feb 26 2024 at 15:58):

My guess would be that, for sufficiently complex tactics, the model will never be able to reliably predict whether a tactic is correct in a given state. If it were good enough, it would necessarily have modeled the tactic's behaviour so well that the tactic itself is redundant because the model knows what it will do.

Sam (Feb 26 2024 at 16:05):

Or I'm wrong, and the tactic approach heuristic can get good enough to keep searches reasonably bounded even for large proofs. :shrug:

Xiyu Zhai (Feb 26 2024 at 16:32):

Suppose I have an informal representations of things, say, natural language, then terms in that informal representations can be seen as tactics in Lean. In this sense, tactics actually should be the output target. It's too hard to directly output a detailed low-level proof. It's sensible to output first a high-level proof sketch that is correct with high probability, then lowering it to a low-level rigorous proof. This is what mathematicians do in their brain.

Sam (Feb 26 2024 at 16:41):

To be clear, I'm not advocating generating whole term proofs dirctly with an autoregressive model. I'd assume some type of hierarchical decomposition of terms which can be generated out of order, so there's still some ability to do high level -> low level reasoning, preferably using a learned strategy.

Sam (Feb 26 2024 at 16:42):

I would also say that, to some degree, a high level proof sketch can exist in latent space.

Xiyu Zhai (Feb 26 2024 at 17:00):

I would say the whole thing is very exciting. A lot of room for breakthroughs.

Gerard Calvo Bartra (Feb 27 2024 at 09:47):

Sam ha dicho:

I would also say that, to some degree, a high level proof sketch can exist in latent space.

This is very interesting. Maybe there is some way to force the latent space to form some "distribution" that structures it as high-level proof sketches, similar to what VAE models do, but surely not by forcing a Gaussian distribution. Does anyone have any idea what it would look like? I imagine it would be like a graph or tree sketch... This might be achievable by playing with the latent space and the loss or regularization term.

Gerard Calvo Bartra (Feb 27 2024 at 12:17):

Maybe forcing the latent space to be represented by graphs is too restrictive, but similar ideas could be used to force the latent space to follow certain valuable properties in a more "loose" way. For instance, Topological Data Analysis (TDA) could be used to encourage the latent space to organize itself in a way that preserves the high-level structure of the proof space, right? (I'm no expert in TDA, but it intuitively sounds good to me).

Jason Rute (Feb 27 2024 at 12:29):

As for latent representations, I had some ideas in that direction here: https://github.com/jasonrute/thoughts-on-ai-for-theorem-proving/blob/main/search.md (Most of those ideas I’ve never had a chance to work on.)

Last updated: May 02 2025 at 03:31 UTC