Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Automated theorem proving as program synthesis?


Rish Vaishnav (May 31 2025 at 11:41):

I've been wondering about this for a while now: to what extent can the task of automated theorem proving be interpreted as a task in program synthesis? The vast majority of Lean theorems are proven using tactics, which are metaprograms operating on the proof state (specifically, in Lean's MetaM monad). Could the task of general automated theorem proving be framed as one of synthesizing a provably correct MetaM program (i.e. sequence of tactics) that brings the proof state from some initial state, consisting of the initial goal (originating from the theorem's annotated type), to a final state where all goals have been solved?

One obvious benefit of this approach could be that, in the context of program correctness, the task of automated proving is provided with the stateful semantics of metaprograms in the form of tactic specifications. The AI agent is entirely aware of the effect that different tactics have on the proof state, and effectively solves the problem of "tactic selection" (analagous to premise selection) to narrow down the search space of possibly applicable tactics given the current proof state. This is could improve on techniques used by other AI agents such as AlphaProof, which (as far as I understand) develop some notion of the effect of tactics through repeated trial and error -- though I'm not 100% sure it's enough to "close the gap" between the automation offered by proof assistants and the more haphazard nature of AI agents, that operate largely by "throwing stuff at the wall to see what sticks" and making incremental progress towards a goal in that way.

I'm not sure exactly what the "shortest path" to this dream could be. But one thing that seems pretty clear to me is that, sooner or later, we'll have to start writing tactic specifications. Though I don't necessarily envision a grand effort to specify every Lean tactic that has ever been written. We'll probably want to start with a few simpler tactics, just enough the "flesh out" the definitions, tooling, etc. for specifying tactics -- some of what's already been done in Lean4Lean will very likely be relevant here. Then, we could maybe attempt to train an AI to do some kind of "specification inference" on other tactics. Or instead, perhaps, in attemting to synthesize tactic sequences for proving theorems from an initial, very limited set of tactics, alight upon probable specifications for certain "highly desired" tactics towards which we can direct program synthesis for constructing the tactics themselves (i.e. "tactic synthesis").

Thoughts?

Rish Vaishnav (May 31 2025 at 12:10):

Another thing I've been thinking about: when proving the correctness of these metaprograms in solving goals, it may be very useful to assume a large set of axioms that make the task of program synthesis easier. Some of these axioms may be unsound, but still useful in setting up the proper proof state for directing program synthesis. We could perhaps also have some basic safeguards to ensure that this unsoundness isn't exploited. The kicker: even if the proof of correctness of a metaprogram is based on unsound foundations, it still may produce a proof term that is correct in the original theory (with no reliance on the axioms that were used to prove the program correct).

Rish Vaishnav (May 31 2025 at 13:37):

Rish Vaishnav said:

Or instead, perhaps, in attemting to synthesize tactic sequences for proving theorems from an initial, very limited set of tactics, alight upon probable specifications for certain "highly desired" tactics towards which we can direct program synthesis for constructing the tactics themselves (i.e. "tactic synthesis").

To elaborate a bit further on this: I can imagine a "sorry tactic" that operates on the meta level, where the effect of the tactic is specified in stateful preconditions and postconditions, which the AI agent can insert when it "gives up" on trying to find an applicable tactic. The postconditions of the tactic represent the AI's "guess" at what the next proof state should be. This effectively opens up this approach to the Draft-Sketch-Prove strategy. Applying the AI across many different theorems, we could possibly identify common patterns in the sorry tactics that it inserts, and determine an encompassing tactic specification that we send to a program synthesizer. If the synthesizer fails, then there's a good chance that we can't design a program satisfying the specification, which signals to the AI to try to find a different one, or possibly take another step back and change the specifications/instances of the sorry tactics themselves.

Rish Vaishnav (May 31 2025 at 14:23):

Or rather, the sorry tactic I have in mind would be more like a generalized version of the sorry tactic we already have. While the existing sorry tactic always takes you to the "current subgoal completed" state, this generalized sorry' [state] tactic would take as its argument the exact next MetaM state to transition to (not pre- and postconditions as I said above). After writing the sketches for a set of theorems, the AI would then go over all of the instances of sorry', look at the states before and after running them, and perform some kind of "clustering" on them followed by an attempt to come up with a tactic specification (i.e. Hoare Triple, that is, the preconditions and postconditions) of a tactic that covers all of them. This triple is then input to a program synthesizer. And if at any point we fail, we can backtrack to each of the previous steps, retrying in turn the specification synthesis, clustering, or initial tactic proof sketching.

Rish Vaishnav (Jun 01 2025 at 19:11):

Coming back here to contradict myself yet again. I think that some of what I was getting at with the sorry' [state] tactic above is in fact quite general to the problem of program synthesis, not specific to synthesizing tactic sequences. Let me try to lay things out quite explicitly here.

Suppose we have a program with a "monadic hole" ?X:

F = A >>= fun a => ?X a >>= fun x => B a x

We are trying to synthesize F such that it obeys the specifications {F1} F {F2} (using Hoare triple syntax, where {P} R {fun r => Q r} means that under the stateful precondition P, the return value (bound by r) and resulting state of running R satisfies the postcondition Q).
Suppose that we already have a proof of {F1} A {fun a => A2 a} and ∀ a x, {B1 x} B a x {F2}.
It sufficies to find some program M to fill in the hole such that ∀ a, {A2 a} M a {B1 a}, but we can do better than that -- we can try to find M satisfying the spec {M1 a} M a {M2 a} with weaker preconditions (such that A2 a implies M1 a) and stronger postconditions (such that M2 a implies B1 a).
I did a small formalization of this using the new monadic program logic library:

import MPL

axiom T : Type
axiom S : Type
abbrev ps := (PostShape.arg S PostShape.pure)

axiom F1 : Assertion ps
axiom F2 : PostCond T ps
axiom A2 : PostCond T ps
axiom B1 : T  PostCond T ps

-- suppose `A` and `B` are programs with known specs
axiom A : StateM S T
axiom B : T  T  StateM S T
axiom hA : F1 A  a => A2.1 a
axiom hB :  a x, ⦃(B1 a).1 x B a x  b => F2.1 b

-- code sketch; `X` is a hole
noncomputable def F (X : T  StateM S T) := do
  let a  A
  let x  X a
  B a x

-- the hole spec must "slot into" the postcondition of `A` and the precondition of `B`
theorem hF (X : T  StateM S T)
  (hX :  a, A2.1 a X a  x => (B1 a).1 x⦄)
  : F1 F X  b => F2.1 b := by
  unfold F
  mintro _
  mspec hA
  mspec hX r
  next r' =>
  mspec hB r' r

-- now, suppose we can synthesize `M1` and `M2` such that:
axiom M1 : T  Assertion ps
axiom M2 : T  PostCond T ps
-- M1 is a weaker precondition
theorem hM1 :  a, A2.1 a ⊢ₛ M1 a := sorry
-- M2 is a stronger postcondition
theorem hM2 :  a, M2 a ⊢ₚ  x => (B1 a).1 x := sorry

-- we try to synthesize `M` such that:
axiom M : T  StateM S T
axiom hM :  a, M1 a M a M2 a

example : F1 F M  b => F2.1 b := by
  refine hF M fun a => ?_
  mintro h
  mwp
  mspec hM
  exact (hM2 a).1 r
  exact hM1 a

In order to arrive at a general subroutine M that we can use elsewhere in our code, we'd want the preconditions to be as weak as possible, and postconditions to be as strong as possible. That might be one of the main challenges here. The "clustering" scheme I talked about above of grouping similar hole specs together, comparing their similarities and then arriving at a general spec to encompass all of them might be one way to go about this.

Anyways, this is just something I wanted to write down for myself/throw into the ether and possibly return to a bit later on. If anyone found my monologue intriguing, or knows of any existing work that goes in this direction, please do let me know!

Bas Spitters (Jun 02 2025 at 11:12):

This work uses GenAI for Program Synthesis (in F*)
https://arxiv.org/pdf/2405.01787


Last updated: Dec 20 2025 at 21:32 UTC