Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Is restricting the set of tactics a good idea for RL?

Zhanrong Qiao (Oct 25 2024 at 03:53):

As hinted in a previous post, I am just interested in this question...

More specifically, I imagine a reinforcement learning agent interacting with Lean, which only spits the following tokens:

induction: as the universal elimination rule.
constructor (or maybe refine): as the universal introduction rule.
let: as the universal cut rule.
rw: for better handling of equalities.
A few language primitives including fun (lambda), -> (Pi types), local variables, etc: for constructing data.
A few pre-defined inductive types including PSigma, natural numbers and well-founded trees: they are used in the original Martin-Löf type theory and can be used to encode most inductive types.
apply?: for unification search within whole context (I imagine being able to refer to theorems by statements instead of names can be a boost to performance; this serves a similar role as retrieval augmentation).
- Maybe there can be an interactive exploration of the discrimination tree, where the agent can choose to expand some node and Lean lists its child nodes like in a file explorer tree.
A special tactic which calls all important automations in parallel (e.g. simp, linarith, aesop, etc), trying to close the current goal.

This should still be complete in that any well-typed term is constructible, and can possibly cut down on the number of branches each step, with less "duplicated" actions (e.g. use and constructor essentially do the same thing). So for example trajectory sampling can be more effective...?

Alex Meiburg (Oct 25 2024 at 21:42):

I mean, I think you could restrict to just intro [freshname], let [freshname] := [localname] [localname], let [freshname] := [libraryname], and apply [localname] -- where [localname] is a local hypothesis, and [libraryname] is a named theorem already declared. This lets you construct any expression tree, so it's obviously _complete_ in the sense you give! But also, obviously, very impractical. :)

Alex Meiburg (Oct 25 2024 at 21:43):

My point is that when you try to reduce the search space, you also lose out on some things that might make life easier; _and_ you lose out on side information. A model trained on existing proofs will see that humans use use and constructor in different places, even though they're equivalent, and that will (ironically?) teach them additional useful(!) facts about problem structure.

Alex Meiburg (Oct 25 2024 at 21:47):

For example, why do you want to have rw but not conv or simp_rw? What about measurability?

And while let may be a universal cut rule, if you use to produce local values of type Prop, then (at least in the default Lean infoview) those proof terms will clutter up the context a lot, making it very hard to actually understand and advance the local context. For that reason have could be good.

Zhanrong Qiao (Oct 26 2024 at 00:25):

Yes, that is equivalent! Actually many find term mode and tactic mode equally usable, as long as stronger tactics like simp_rw etc. are not needed (I agree I should put simp_rw instead of rw in my list). Mathlib contains many lemmas written entirely without tactics. As for have: maybe we can change the infoview to hide terms with types in Prop? There is definitional proof irrelevance nevertheless.

However, I doubt the importance of "side information" the tactics might carry, as different people already prefer to use different tactics (e.g. some prefer refine instead of constructor and use because it lets you write shorter proofs) and Mathlib is already a mixture of styles. And code comments would make up for the side information.

For things like ring or measurability, I would add it to the "special tactic of all available automations" if it does not take significant amount of time to run (the heavier tactics will be the bottleneck nevertheless).

Sid (Oct 26 2024 at 01:00):

Zhanrong Qiao said:

Yes, that is equivalent! Actually many find term mode and tactic mode equally usable, as long as stronger tactics like simp_rw etc. are not needed (I agree I should put simp_rw instead of rw in my list). Mathlib contains many lemmas written entirely without tactics. As for have: maybe we can change the infoview to hide terms with types in Prop? There is definitional proof irrelevance nevertheless.

However, I doubt the importance of "side information" the tactics might carry, as different people already prefer to use different tactics (e.g. some prefer refine instead of constructor and use because it lets you write shorter proofs) and Mathlib is already a mixture of styles. And code comments would make up for the side information.

For things like ring or measurability, I would add it to the "special tactic of all available automations" if it does not take significant amount of time to run (the heavier tactics will be the bottleneck nevertheless).

How many basic tactics would you need to simulate the "heavier" tactics? Because if it's a lot, then you're just creating more branch points. Or if you do as @Alex Meiburg is saying, then you just defer the bigger branch points to the selection of the right "localname" and "libraryname" which would have a huge number of possibilities

Zhanrong Qiao (Oct 26 2024 at 01:09):

Sid said:

How many basic tactics would you need to simulate the "heavier" tactics? Because if it's a lot, then you're just creating more branch points. Or if you do as Alex Meiburg is saying, then you just defer the bigger branch points to the selection of the right "localname" and "libraryname" which would have a huge number of possibilities

I let simple and heavy automation tactics run in parallel so:

When simple tactics e.g. measurability are able to prove the claim, other running tactics can be terminated immediately;
If the case cannot be handled by simple tactics, they should not take too much time to fail and the heavier tactics might find a proof;

And I suggest an interactive discrimination tree exactly for reducing possibilities when choosing an appropriate libraryname: one might claim a proposition by (partial) statement, which only involves data, and data constructors are normally smaller in number than theorems; then unification search like exact? is used to check if this claim is a known proposition.

Zhanrong Qiao (Oct 26 2024 at 01:17):

The construction of a proposition can be split into multiple steps, indeed this just creates more branch points. But a low branching factor also means that a policy table (or anything approximating it) can be much smaller (simpler). If there is a value function instead, choosing a policy involves much less evaluations.

Jason Rute (Oct 26 2024 at 01:23):

My point of view is that this sort of thing needs to be rigorously experimented with to know for sure either way. I am doubtful it would work, since big tactics give a lot of power, but maybe it could. Also a few more points: (1) it would be harder to get training data for a restricted set of tactics. However, It is possible by decomposing tactics or using RL with synthetic theorems. (2) I think proper RL would help to better handle the issue of multiple equivalent tactics by making the model more opinionated. (3) At least one use case of AI for Lean would be to have it output idiomatic proofs. If you mess with the tactics too much, it could be far from idiomatic (but as we have seen with the AlphaProof proofs, that also happens with RL).

Zhanrong Qiao (Oct 26 2024 at 01:30):

Jason Rute said:

My point of view is that this sort of thing needs to be rigorously experimented with to know for sure either way. I am doubtful it would work, since big tactics give a lot of power, but maybe it could. Also a few more points: (1) it would be harder to get training data for a restricted set of tactics. However, It is possible by decomposing tactics or using RL with synthetic theorems. (2) I think proper RL would help to better handle the issue of multiple equivalent tactics by making the model more opinionated. (3) At least one use case of AI for Lean would be to have it output idiomatic proofs. If you mess with the tactics too much, it could be far from idiomatic (but as we have seen with the AlphaProof proofs, that also happens with RL).

Thank you! I just want to confirm if this is not obviously wrong... Surely big tactics are included, I am just considering combining them into one (or a few); IIRC there is similar practice in Isabelle (?). As for idiomatic proofs, they might be reconstructed later, but during proof search they are not the primary concern.

Sid (Oct 26 2024 at 05:23):

One thing that might be relevant is -- if you're using a probabilistic policy to generate the next actions, there's already going to be a soft version of selection going on -- as some actions will have a much lower probability than others. Also a low branching factor is good but if it's leading to longer proof length, then the search space will not necessarily decrease.

Kim Morrison (Oct 26 2024 at 07:19):

Is anyone actually using the internals of apply? to do lookup during tree search? Once it had built the discrimination tree it is very fast, and it computes lazily so you can ask for as many results as needed.

Zhanrong Qiao (Oct 26 2024 at 08:37):

Sid said:

One thing that might be relevant is -- if you're using a probabilistic policy to generate the next actions, there's already going to be a soft version of selection going on -- as some actions will have a much lower probability than others. Also a low branching factor is good but if it's leading to longer proof length, then the search space will not necessarily decrease.

Yes. But (1) selection requires more learning, and (2) the search space in mathematical theorem proving is always recursively enumerable (has cardinality ω) and there is no way to essentially decrease it.

Jason Rute (Oct 26 2024 at 10:55):

Kim Morrison said:

Is anyone actually using the internals of apply? to do lookup during tree search? Once it had built the discrimination tree it is very fast, and it computes lazily so you can ask for as many results as needed.

Not that I’m aware, but also that would only be relevant to a few tactics like exact, apply, and refine, right?

Jason Rute (Oct 26 2024 at 10:55):

Premise/lemma/argument selection is much more general.

Zhanrong Qiao (Oct 26 2024 at 11:17):

Jason Rute said:

Premise/lemma/argument selection is much more general.

The point of premise/lemma/argument selection is that they are more likely to be used somewhere in the proof, where they (conceptually) close a subgoal. Using apply? at the point that some known theorem might be used should have similar effects?

Jason Rute (Oct 26 2024 at 11:22):

The problem with apply? and exact? Is that it requires a very specific syntactic match, so it wouldn’t be able to find a good simp lemma for example. It might be possible to make good use of apply? in some system, but I don’t think it would replace neural argument selection.

Zhanrong Qiao (Oct 26 2024 at 11:28):

Tagging simp lemmas is indeed difficult (a concept/design of some "normal form" is needed). But in the case of a congruence closure tactic, in theory they can just put the whole context into consideration by checking the discr tree for each individual rewrite?

Zhanrong Qiao (Oct 26 2024 at 11:31):

I agree that neural argument selection will not be replaced, since sometimes a fuzzy match might suggest a possible direction for a proof.

Jason Rute (Oct 26 2024 at 13:06):

Yes, for a number of tactics, like rw, exact, apply you could do some form of constrained decoding with discrimination trees.

Kim Morrison (Oct 27 2024 at 09:16):

I suspect this is a mistake to say apply? only suggests apply tactics! Just because apply is so fundamental and general.

Kim Morrison (Oct 27 2024 at 09:16):

Similarly there is rw? which suggests rewrites which will actually work.

Kim Morrison (Oct 27 2024 at 09:16):

And have? for forward reasoning.

Kim Morrison (Oct 27 2024 at 09:17):

Unless someone has specific evidence that these are too slow, it seems crazy to leave these suggestions on the ground.

Eric Wieser (Oct 27 2024 at 09:55):

Kim Morrison said:

Once it had built the discrimination tree it is very fast,

What's the recommended way to preheat this cache?

Kim Morrison (Oct 27 2024 at 22:49):

Excellent question, to which I don't know the answer. :-)

Sid (Oct 28 2024 at 00:12):

Zhanrong Qiao said:

Sid said:

One thing that might be relevant is -- if you're using a probabilistic policy to generate the next actions, there's already going to be a soft version of selection going on -- as some actions will have a much lower probability than others. Also a low branching factor is good but if it's leading to longer proof length, then the search space will not necessarily decrease.

Yes. But (1) selection requires more learning, and (2) the search space in mathematical theorem proving is always recursively enumerable (has cardinality ω) and there is no way to essentially decrease it.

(1) Agreed (2) Sure but in practice we have finite resources and so the number of nodes in the search tree that can be enumerated is going to be bounded by a constant. Given that constraint, if you replace a node with a large number of children with multiple nodes, each with a small number of children, is the latter better? It's not clear to me.

Sid (Oct 28 2024 at 00:21):

Kim Morrison said:

Is anyone actually using the internals of apply? to do lookup during tree search? Once it had built the discrimination tree it is very fast, and it computes lazily so you can ask for as many results as needed.

Apologies for the basic question (still early in terms of learning Lean) -- my understanding was that apply? works on the current active goal out of the list of goals and so for a particular application of apply?, only one match query is required. What do you mean by multiple results? Do you mean if an index is maintained for all the expressions apply? has been used for, then every time apply? is tried, one can try and match against all past such expressions in the index?

Kim Morrison (Oct 28 2024 at 00:33):

apply? returns a list of suggestions (unless one actually closes the goal --- then it stops). You can ask for this list to be any length.

Kim Morrison (Oct 28 2024 at 00:33):

If I remember correctly, you can probably actually get a continuation out of it that lets you keep asking for one more, but the code was refactored at some point and this may require some tweaks to get access to the continuation.

Sid (Oct 28 2024 at 01:54):

Kim Morrison said:

apply? returns a list of suggestions (unless one actually closes the goal --- then it stops). You can ask for this list to be any length.

Ahh ok. How does it decide which term to try and match the goal with next when generating the list of suggestions? From what I understand matching is a unification procedure right and so we don't know beforehand which terms might match the goal.

Zhanrong Qiao (Oct 28 2024 at 03:48):

Sid said:

(1) Agreed (2) Sure but in practice we have finite resources and so the number of nodes in the search tree that can be enumerated is going to be bounded by a constant. Given that constraint, if you replace a node with a large number of children with multiple nodes, each with a small number of children, is the latter better? It's not clear to me.

Also not clear to me, but many RL methods simply cannot effectively handle too many children (value-based or Q-function-based), and policy-based using LLM is equivalent to creating multiple nodes (LLMs output tokens one-by-one, and each token is conceptually a node with a smaller number of children). And say if each node has 10 children, the added internal nodes will only take up ~1/10 of the total number of nodes.

Kim Morrison (Oct 28 2024 at 03:56):

Sid said:

Kim Morrison said:

apply? returns a list of suggestions (unless one actually closes the goal --- then it stops). You can ask for this list to be any length.

Ahh ok. How does it decide which term to try and match the goal with next when generating the list of suggestions? From what I understand matching is a unification procedure right and so we don't know beforehand which terms might match the goal.

It builds a discrimination tree of all lemmas in the environment. We can then look up the goal in this discrimination tree to efficiently find lemmas that could possibly match, then filter by actually trying to apply them. Both the loop and apply are surprisingly fast.

Kim Morrison (Oct 28 2024 at 03:57):

(similar story for rw?)

Zhanrong Qiao (Oct 28 2024 at 04:52):

Sid said:

Ahh ok. How does it decide which term to try and match the goal with next when generating the list of suggestions? From what I understand matching is a unification procedure right and so we don't know beforehand which terms might match the goal.

Unification depends on both terms having the same shape (except metavariables which might match anything). Most lemmas, even after substituting metavariables into universal quantifiers, still have e.g. a non-meta head variable so if this does not match the goal then we can skip it, and the same goes for all subterms. A tree can be used to classify lemmas by their non-meta parts. So in many cases we can still filter out lots of things which definitely do not unify... (There are probably many details which I do not know about how Lean actually implements this though)

Zhanrong Qiao (Jan 08 2025 at 15:51):

I am guessing this looks more like the Emacs agda-mode, just with a list of predictions for what can be used to "refine"... Has anyone tried applying machine learning to an interface like the agda-mode?

Last updated: May 02 2025 at 03:31 UTC