Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: HOList or Lean?

Aidan Swope (Nov 10 2020 at 20:40):

Hello! I'm interested in getting started developing ML-based search strategies for ATP, but I'm having some trouble deciding between hacking on HOList vs Lean. Lean is appealing because it's going to be used for the IMO challenge and I'm more familiar with it as a prover, but HOList seems easier to get started with because it's got a "gym environment" with a fixed training set and a leaderboard of existing results. I'm also worried about developing for Lean 3 knowing that Lean 4 is coming.

Does anyone here have advice on whether Lean is in a good state to begin integrating with ML-based search? In particular I'm curious as to whether there's work on a similar "gym environment" for Lean, and whether proofs & methods developed for Lean 3 will be portable to Lean 4.

Kevin Lacker (Nov 10 2020 at 21:20):

I am interested in the same area. I don't think Lean 3 is in a great state to integrate with ML right now. The problem is that any sort of interfacing with external processes and doing some moderate amount of work in Lean is pretty slow. Gabriel Ebner has done the most work like this - see his talk in January - that was integrating with ATPs rather than with a machine-learning-driven system, but the challenges are similar. But he's waiting on Lean 4 for more work and I think his judgment is correct there

Kevin Lacker (Nov 10 2020 at 21:21):

To me, HOList doesn't seem promising to work with either. There's a leaderboard but the only people who interact with it in any way are people at one particular group at Google. So it isn't so much a big public thing as it is a specific approach by that group. I just suspect they will abandon it at some point, it's not like there's a community using it for anything. Maybe it is a good fit for what you're doing, though, if you're interested in work that would map nicely onto what they are doing

Aidan Swope (Nov 10 2020 at 21:23):

I see, thanks. After looking into it more it does look like Lean 4 will have nicer utilities for integrating with external ATP. Have you found other environments you see as more promising?

Kevin Lacker (Nov 10 2020 at 21:24):

I think it depends what you want to do - do you want to publish an AI paper, or do you want to publish an ATP paper, or do you want to make some tool that will help mathematicians, or something else entirely

Aidan Swope (Nov 10 2020 at 21:25):

I'm primarily from an ML background, and I'm interested in developing reinforcement-learning-based search strategies that will help big ATP efforts like the IMO challenge (the work would look much more like an ML paper than an ATP paper)

Kevin Lacker (Nov 10 2020 at 21:29):

if you have some theory for how to beat the results in the "Learning to Reason in Large Theories without Imitation" paper, then hacking your idea into holist would make sense

Kevin Lacker (Nov 10 2020 at 21:31):

kind of "lower in the stack" is work like https://arxiv.org/abs/2005.02094 on the automated theorem provers - those are aiming at simpler logical forms like first-order logic instead of all this meta-inductive-type stuff. there, the state of the art generally does not use machine learning. it seems possible for machine learning to help out there though. i dunno the best way to approach, maybe find one of the open source theorem provers and tweak it, depends on your idea

Kevin Lacker (Nov 10 2020 at 21:33):

"higher in the stack" would be theorem proving that lives in the interface layer. afaict lean 3 is slower than isabelle so the coolest automated stuff happens in isabelle there, but neither one is really suitable for modern AI techniques unless you pipe all the data over to some other process that uses the gpu, or something else that is kind of mucking around in the internals

Aidan Swope (Nov 10 2020 at 21:35):

My work would probably be most appropriate "higher in the stack" -- at a high level I want to help turn interactive provers into automated ones with learned policies, possibly by imitating human proofs + RL, similar to the existing work on HOList. Mainly this is because proofs are shorter there.

I think I read that Lean 4 will have utilities to help with "symbolic simulation" so you don't have to translate Lean terms into your library before working with them, am I interpreting that right?

Kevin Lacker (Nov 10 2020 at 21:38):

I don't know - I don't think the features of Lean 4 are quite all figured out yet

Kevin Lacker (Nov 10 2020 at 21:41):

so I think that it will just not be easy to integrate Lean 3 with an external AI process. hopefully Lean 4 makes this a lot better, but don't hold your breath waiting for that to happen. with Isabelle it is certainly possible to integrate with external processes, sledgehammer does this, but I don't know enough more about the Isabelle system to firmly recommend for or against it.

Aidan Swope (Nov 10 2020 at 21:44):

This was very very helpful, thank you!

Jason Rute (Nov 10 2020 at 23:22):

My short answer is that ML for Lean would be bleeding edge. There is no good interface or dataset yet, but I'm optimistic that there will be more infrastructure soon. Having said that, I hope you keep an interest in Lean. It has a fast growing database, and I think ML for theorem proving does not need to be exclusive to any single theorem prover.

Jason Rute (Nov 10 2020 at 23:22):

There is a lot of interest in ML for Lean, some from large research groups like yourself (I'm assuming you are at NVIDIA from your LinkedIn), and some from hobbyists, or those firmly in the theorem proving community. I think one of the best things you and others can do now is clearly state what you would want if an interface for Lean 3 existed (or if a dataset existed). There are a few of us interested in building such things, and we would be happy to discuss further.

Jason Rute (Nov 10 2020 at 23:22):

As for Lean 4, it is unclear how long it will take. I do think that there will be advantages in theory, but there won't be an ML framework when it is built that I am aware of, so I'm inclined to start building something for Lean 3.

Jason Rute (Nov 10 2020 at 23:22):

As for the Grand Challenge, I'm not convinced that ML and RL done on human written Lean proofs will get very far in proving Olympiad problems, but I do think it will go really far in improving the ability of Lean to do trivial and simple tasks. Also, I see the appeal for at least some dataset in the same language as the Grand challenge.

Jason Rute (Nov 10 2020 at 23:26):

As for other ML fo ITP projects, it really depends what you want. HOList and CoqGym have some of the best datasets and RL interfaces, but unfortunately no one besides the developers uses them yet. Also look into ProverBot9001 for Coq. There are more datasets (for Isabelle for example), but I don't think they have interactive RL environments. Also, you can be like OpenAI and build your own Metamath environment since the logic is really simple to implement.

Aidan Swope (Nov 11 2020 at 00:58):

I definitely agree that naive RL on individual proof steps won't get us all the way to IMO problems. One hope I have is that ML-guided "mid-level provers" will enable the kind of "very-high-level" tactics Daniel Selsam described in his IMO challenge talk by dispatching intermediate goals. Another is that hierarchical procedural abstraction would allow learning multiple levels of guidance at once.

As for a Lean automation interface / dataset: I'd definitely be happy to discuss more. I have personal interest in Lean from a type theory and formal proving perspective, and I think big datasets with well-defined measures of progress are responsible for a lot of success in ML.

In the immediate future I'm leaning towards HOList for this project, partly for performance reasons as Kevin brought up. After discussing with the HOList authors, time taken to run the provers seems to be one of the biggest bottlenecks; their latest work uses a lot of compute already, mostly for generating experience. Fortunately, I don't think any given ML algorithm is likely to be too specialized to the prover it was developed in.

Kevin Lacker (Nov 11 2020 at 04:33):

Jason Rute said:

I think one of the best things you and others can do now is clearly state what you would want if an interface for Lean 3 existed (or if a dataset existed). There are a few of us interested in building such things, and we would be happy to discuss further.

I would like functionality similar to the Isabelle sledgehammer, where you could export a theorem statement to some external prover, and then have the Lean kernel verify that proof. On top of that, I would like a script that you could give your own theorem prover to it, ML-based or otherwise, and it would just check against all the existing theorems in mathlib to see which ones your theorem prover could handle. That combo would basically let you use "all the theorems in mathlib" as training data. I suspect over time that would become the largest & best ML-for-math training dataset, based on the velocity of the mathlib community.

Brando Miranda (May 20 2021 at 17:48):

Sorry for potentially hijacking the HOList vs Lean thread (though my question will probably be relevant)...but I was wondering, what is wrong with Coq vs HOList or Lean?

Jason Rute (May 20 2021 at 22:08):

Since this is resurrected and I had totally forgot about this conversation, @Aidan Swope, let me mention that since this conversation, we have produced two Lean datasets, a theorem prover evaluation harness, a trained Transformer model, and a tactic for users to use that model to assist their proofs. Here is the paper and a talk. We would be happy to assist anyone looking to use the dataset or evaluation harness for their own projects.

Jason Rute (May 20 2021 at 22:27):

Brando Miranda said:

Sorry for potentially hijacking the HOList vs Lean thread (though my question will probably be relevant)...but I was wondering, what is wrong with Coq vs HOList or Lean?

I don't think anyone here is knocking on Coq. (I even mention it above as a possibility.) I think Adrian expressed his choice as (1) HOList because it is the easiest to use right now, or (2) Lean because he knows it the best.

Jason Rute (May 20 2021 at 22:28):

I think however this highlights a big issue with this line of research. It is very hard to get started. Let's say you are interested in developing a neural theorem prover for Coq (or Lean, Metamath, HOL Light, Isabelle, HOL4, or Mizar).

First you have to learn the system, downloading it and playing around with it enough to get familiar. It is much easier if you already know the system.
Then you have to look for datasets and evaluation frameworks or build your own. Right away, you may run into a problem. You could be too constrained to other researchers choices. For example, CoqGym doesn't have premise selection if I understand correctly, meaning it does half the work for you by already giving you the premises of the proof. Tactician has a very specific way that a machine learning agent plugs into the framework, meaning a Transformer model wouldn't be useful. Neither support reinforcement learning. The same issue also applies to any other project including ours. Unless you want to be unduly constrained, you likely have to start hacking the theorem prover to make it do what you want. (Metamath and FOL (Mizar) have a slight advantage here since it is fairly easy for reconstruct the logic from scratch without having to plug it into another system.)
If you want to create a tool in the end which is helpful for users, that might be more work as well.

Jason Rute (May 20 2021 at 22:28):

For a while I dreamed of a gym like system that one could hook their neural prover into and it would work for all the standard ITPs, but I've since become more pessimistic. The best I hope for now is that a research institute like FAIR would put the effort into instrumenting all the theorem provers so they could try out their approach on all of them. However, I doubt other researchers would be satisfied with their approach and willing to use their system without modifications.

Brando Miranda (May 20 2021 at 22:52):

Thanks for the reply Jason! I am curious, is there something particularly bad about Coq?

Mario Carneiro (May 20 2021 at 23:10):

I think there could be a project going in the other direction, where ML researchers document their requirements from the prover so that the prover can support these use cases. Currently theorem provers are basically universally designed for interfacing with humans, not ML systems, which means there are a lot of square-peg designs in these ML interfaces where you try to drive the theorem prover in a way it fundamentally wasn't designed to be used, and as a result you have significant overheads and low quality data to work with, meaning less learning and poorer results.

As someone who has been designing a high performance theorem prover backend this is definitely on my mind, but I'm not an ML researcher and I don't want to make things harder for future ML systems that want to target MM0.

Aidan Swope (May 21 2021 at 02:19):

Jason Rute said:

Since this is resurrected and I had totally forgot about this conversation, Aidan Swope, let me mention that since this conversation, we have produced two Lean datasets, a theorem prover evaluation harness, a trained Transformer model, and a tactic for users to use that model to assist their proofs. Here is the paper and a talk. We would be happy to assist anyone looking to use the dataset or evaluation harness for their own projects.

Thank you for the mention and info, Jason! Over the last few months my group has been working with HOL Light, but I'll bring up this new benchmark with them. Nothing in our code or method is too specific to HOL Light, so I could see us switching.

I've noticed that Lean 4 seems to be nearing a stable release. I don't know very much about Lean's internals, so I'm wondering: when and how do you expect Lean 4 to change the ML + ATP landscape? I've heard that many of the proofs in Mathlib may be ported to Lean 4, but I don't know how long I'd expect to wait until a sizable dataset is available.

Aidan Swope (May 21 2021 at 02:24):

Brando Miranda said:

Sorry for potentially hijacking the HOList vs Lean thread (though my question will probably be relevant)...but I was wondering, what is wrong with Coq vs HOList or Lean?

I'll echo Jason's thoughts on it being hard to get started, with very little developed infrastructure for gathering data and interacting with an environment relative to, say, doing reinforcement learning on Atari games. Another big factor is that benchmarks face a sort of Schelling point problem -- it's hard to produce a convincing paper if you're not comparing to existing results, so most ML research tends to cluster around already-existing benchmarks (aside from the time and effort it takes to set up a new one).

Jason Rute (May 21 2021 at 04:00):

@Aidan Swope I think it will be a long while before mathlib is ported to Lean4, but I could be mistaken. It will be much easier to interface with Lean, but at the same time someone is going to have to put in the effort to provide the infrastructure for proof data gathering, evaluation, and reinforcement learning. Also the IMO grand challenge is in Lean 4 which attracts folks, but at the same time that is more of a showcase of Microsoft Research than a completion or benchmark.

Jason Rute (May 21 2021 at 04:07):

@Aidan Swope also to be fair to Brando, Coq has right now the most benchmarks (at least by different groups). There is ProverBot9001, Tactician, and CoqGym (ASTactic), as well as the CoqHammer benchmarks.

Daniel Selsam (May 21 2021 at 04:27):

Jason Rute said:

Also the IMO grand challenge is in Lean 4 which attracts folks, but at the same time that is more of a showcase of Microsoft Research than a completion or benchmark.

I can't speak for everybody involved, but I aim to develop the IMO-GC into the ImageNet of the 2020s, i.e. the premiere AI benchmark that helps shape the global AI research agenda.

Stanislas Polu (May 21 2021 at 07:54):

In case useful, I spent some time this week refreshing the PACT paper related projects (@Jason Rute lean_proof_recording and @Jesse Michael Han lean-step). I was able to successfully apply both techniques to lean-liquid on top of updated recent mathlib. @Mario Carneiro I think these two projects are a good specification by example of what ML researchers would need to extract ground-truth human examples. I would look in particular at Jason's code modification to capture tactic applications and Jesse's approach to extracting proof terms. They are both implemented in pure lean, Jesse's approach does not need any modification to Lean but Jason's does. Maybe exposing an improved version of what Jason is doing is a good idea for Lean4? I guess basically provide a way to hook some code at each tactic execution that changes the tactic state?

As for having an environment to interact with Lean3, I plan to hack one this week based on @Jesse Michael Han 's lean-tpe. It'll be slow-ish but also designed to be easy to parallelize; should give you again a good specification by example of what we're looking for. It'll be mostly a REPL on top of pure Lean metaprogramming, so again, Lean3 is already good enough here and speed will be the main feature request for Lean4 :)

Hope this helps!

Stanislas Polu (May 21 2021 at 08:02):

I think the very best we could do, is optimize the generation of these datasets, converging toward a standard version we all like and automatize their generation as part of Lean's CI pipeline, to make them one download away from any ML researcher + having a simple versatile gym like environment.

As for the environment, I think exposing an interactive tactic mode REPL (as is my plan) is a very solid MVP for this. We don't want to use the lean-server for that as we don't want to send diffs triggering reparse/rebuild as it makes backtracking in proofs a living hell. We instead want to have a stateful REPL instantiated on a declaration name which returns serialized tactic states along with persistent identifiers for them and accept applying tactic code to previously returned tactic state identifiers.

Jason Rute (May 21 2021 at 13:25):

Minor correction: The lean proof recording repo is currently here: https://github.com/jasonrute/lean-proof-recording-public. (The above link is missing the dashes.)

Jason Rute (May 21 2021 at 13:29):

Stanislas Polu said:

Maybe exposing an improved version of what Jason is doing is a good idea for Lean4? I guess basically provide a way to hook some code at each tactic execution that changes the tactic state?

Actually, in Lean4 I've been told, there is a built in way to do this sort of stuff, so one doesn't need as much hacking. See this message from Leo. I haven't had a chance to play around with it yet.

Jason Rute (May 21 2021 at 13:35):

But I agree, we already have what we need in Lean 3. As you said, the big advantages of Lean 4 are speed (and possibly more official support from MR).

Jason Rute (May 21 2021 at 13:42):

Stanislas Polu said:

I think the very best we could do, is optimize the generation of these datasets, converging toward a standard version we all like and automatize their generation as part of Lean's CI pipeline, to make them one download away from any ML researcher + having a simple versatile gym like environment.

Do you think we can come up with a format that "we all like"? That would be great, but I'm a bit skeptical. The current solutions out there are so different from each other that it is hard to know if we can all agree on a single environment. I however totally agree that we can come up with a good proof-of-concept/MVP. I'd also love more collaboration between research labs on data formats and environments. (Maybe that is already going on behind the scenes, but it doesn't seem like it to me.)

Jason Rute (May 21 2021 at 14:06):

Oh, I missed the "converging towards". So the "we all like" is in the limit. That I can get behind.

Stanislas Polu (May 21 2021 at 16:30):

Well all the existing datasets predate the uprise of the Transformer. I’m ready to bet that they’ll all look the same in the coming years ;-)

Jason Rute (May 21 2021 at 17:10):

I'm not sure the Transformer solves premise selection. Metamath GPT-f uses a Transformer to dream up the statement of premises, and uses a separate library search to match that statement to an available premise. That is a fairly specialized feature which is harder to implement in other provers like Lean. Lean GPT-f memorizes all the premises, which works well, but doesn't exactly scale to new projects. I think other future systems may well still use a query-key lookup like HOList (likely where a Transformer makes the keys), or similarly a cross attention like Perceiver. But I guess adding in support for listing all the premises in the environment (in an efficient way where one can cache premise embeddings) is going to be a fairly standard request. (I could also imagine going further and doing cross attention over all the proof data. This is essentially the TacticToe and Tactician approach. There, they search through all the proofs for the ones which best match the current situation. [They use k-NN instead of attention.] The analogy is looking up the relevant literature instead of just memorizing it all. I could imagine this reducing the model size and making it more scalable and adaptable to changes in the library. I could also imaging this sort of thing is already being worked on in regards to information retrieval from documents corpuses in NLP.)

Also, I'm not convinced that pretty printed text representations are always the way to go for tree/graph structured data. It is definitely easy to build a system using those, and it allows very powerful and flexible pre-training, but one is throwing away a lot of potentially useful data in the process.

Anyway, I guess I don't know if we've all settled on the same approach, or if there are still a lot of new directions to explore. But my bet is that we can't accurately predict what other labs are going to do and what data/representations they are going to use, especially if they are willing to hack the prover system to get the data they want. Of course, that doesn't stop us from building something right now.

Stanislas Polu (May 21 2021 at 17:13):

Agreed that listing all premises to pre-cache embeddings will be a feature request :+1:

Stanislas Polu (May 21 2021 at 17:14):

But my bet remains that moving forward there wont'be lot of interest for S-expressions :p

Stanislas Polu (May 21 2021 at 17:14):

(but that's just a bet)

Brando Miranda (May 21 2021 at 22:40):

Jason Rute said:

Aidan Swope also to be fair to Brando, Coq has right now the most benchmarks (at least by different groups). There is ProverBot9001, Tactician, and CoqGym (ASTactic), as well as the CoqHammer benchmarks.

Hi Jason! I didn't mean to sound like a criticism to Lean or HOList or that we should use Coq. I was genuinely curious how people were choosing the ITP language for ML. Though I did like Mario's train of thought - something deliberate perhaps for ML would be nice (or a perspective paper that argues the case for one would be nice for us non ITP experts).

Jason Rute (May 22 2021 at 19:30):

@Mario Carneiro @Brando Miranda I tried to specify my thoughts on what such an interface would look like between an ITP and an ML agent. There are a lot of considerations and I hopefully made the issues more clear, even if not providing a specific template. I'd also be curious on thoughts from other ML researchers, like @Stanislas Polu, @Jesse Michael Han, @Yuhuai Tony Wu, @Aidan Swope. (There are many more researchers which sometimes visit here, but I don't want to spam them.) Also, I'm making the assumption that the ML agent is a separate system from the theorem prover, so it doesn't cover advanced tactics which are also a form of AI (and a big part of Daniel Selsam's work).

Brando Miranda (May 24 2021 at 16:08):

Looks great! I will take a look at it. The page says secret at the top so I assume I should not share it with anyone? (btw thanks for putting the effort to do that!)

Jason Rute (May 24 2021 at 16:41):

Feel free to share. I usually leave my gists private, since I guess I don't feel a need to spam the world with my thoughts unless others think they are helpful.

Aidan Swope (May 25 2021 at 00:53):

Jason Rute said:

Mario Carneiro Brando Miranda I tried to specify my thoughts on what such an interface would look like between an ITP and an ML agent. There are a lot of considerations and I hopefully made the issues more clear, even if not providing a specific template. I'd also be curious on thoughts from other ML researchers, like Stanislas Polu, Jesse Michael Han, Yuhuai Tony Wu, Aidan Swope. (There are many more researchers which sometimes visit here, but I don't want to spam them.) Also, I'm making the assumption that the ML agent is a separate system from the theorem prover, so it doesn't cover advanced tactics which are also a form of AI (and a big part of Daniel Selsam's work).

This looks good! A couple of comments:

Communication architecture
I would bias towards ease of parallelization as much as possible, as long as that doesn't unduly sacrifice efficiency. In my mind, the ideal system consists of a pool of (GPU-backed) stateless ML model workers, a pool of (CPU-backed) stateless environment / ITP workers, and one stateful environment + tree search worker per theorem. I know that my group (and I suspect others) would be willing to throw a lot of parallelism at the problem.

That scalability depends on being able to efficiently serialize proof state and treat the ITP as a stateless "(proof state, action) --> next proof state" function. Is replaying the entire previous proof currently the only way to do this? Would it be possible to have the system "blindly trust" a given proof state, without needing to re-run the proof steps that constructed it?

Data format
I think the way to go would be a structured data format with all the gory details available (say, protobuf with all namespaces fully-qualified), plus a library for efficiently pretty-printing samples. It would be a shame for the data format to rule out all approaches other than text-to-text transformers. Even those kinds of models have reasons to care about structured data -- in my work we parse the HOL Light S-expressions to provide a more compressed tokenization for the model than plain BPE would allow. Additionally, some recent work has cleverly processed structured data with transformers in a way that pretty-printed data wouldn't allow.

Mario Carneiro (May 25 2021 at 02:04):

Aidan Swope said:

That scalability depends on being able to efficiently serialize proof state and treat the ITP as a stateless "(proof state, action) --> next proof state" function. Is replaying the entire previous proof currently the only way to do this? Would it be possible to have the system "blindly trust" a given proof state, without needing to re-run the proof steps that constructed it?

From the ITP side, this might be tricky to support, since it has to simultaneously be able to access and switch between several proof states. This could be accomplished if proof states are pure functional data structures which are allocated and given identifiers as part of the communication protocol, although with an approach like that it's not clear when it is safe to reclaim proof state objects.

That is to say, the protocol would be something like this:

The ITP maintains a map proof_data(id: ProofStateID) -> ProofStateData that can be queried by the ML system to get information about proof states.
You can call a function apply(old_state: ProofStateID, a: Action) -> ProofStateID to apply a tactic or proof operation (of some description) to any allocated proof state to produce a new proof state; the ITP returns the ID of the proof state and you can use that ID to query more information about it or apply more operations from that starting point.
Since the ITP doesn't know when the ML agent is done with a given proof state, it can't ever reclaim things, so perhaps we need a discard(id: ProofStateID) function to be able to tell the ITP that it can safely invalidate the proof state ID, and future uses of the proof_data or apply functions on that ID will fail. (It could reclaim the ID itself, but this is probably not necessary, and not reclaiming the ID means additional protection if the ML driver is sloppy.)

Mario Carneiro (May 25 2021 at 02:06):

Note that with this version, the ITP doesn't have to blindly trust any proof states, because it is responsible for maintaining the proof data map and can ensure that only validly produced states have IDs associated to them.

Jason Rute (May 25 2021 at 11:31):

As for treating the proof state as stateless (or not), as I emphasis in that document, there are two states to think about:

The environment
The immediate proof state

Jason Rute (May 25 2021 at 11:32):

The latter will depend on the former. In HOList, one can't use a definition in a proof state or apply a theorem (inside SIMP_TAC say) to a proof state unless it is already loaded in the environment. In Lean, one can't use notation, definitions, theorems, tactics unless they are in the environment. (And the behavior of some tactics like simp depending on the environment.)

Jason Rute (May 25 2021 at 11:32):

Serializing the environment and reloading it from scratch every time is likely not feasible for most any prover. However, there may be ways to preload the environment and identify the parts needed for each theorem. In HOList they load all theorems of the environment, but another agent (not the ITP agent) is careful to not use theorems from the future to prove a theorem. This is more doable in LFC style theorem proving such as the HOL family of provers. In Lean, we have a hack where we can load the environment associated with a theorem name, but IIRC it has an initialization cost so you don't want to do it on every proof step.

Jason Rute (May 25 2021 at 11:32):

Now, as for serializing the intermediate proof state, this depends a lot on the system. In HOL-Light, they have observed that modulo the environment, a proof state is just a list of sequents [HYP1, HYP2, ...] |- GOAL and therefore easy to serialize and load. Also, each sequent is independent of the others (unlike Lean) so you just load one sequent at a time. In Lean it is more complicated. Technically the proof state (the tactic state) is a complicated beast containing the environment (already discussed), sequents, extra metavariables, user settings, and a few other things. We worked out that one can probably get by with just serializing the list of sequents. There are a few caveats: This is a not small amount of information to serialize and pass around. One has to replace all the metavariables with fresh metavariables when loading which is some additional work. Any macros can't be serialized, so not all proof states can be serialized. A few tactics change the environment. We don't serialize the environment so those changes are lost. We don't (and likely can't) deserialize the intermediate term proof generated so far. This also means we lose strong guarantees of correctness. We are carefully checking each step, but information might get corrupted subtly in each serialization/deserialization of the proof state.

Jason Rute (May 25 2021 at 11:32):

What @Mario Carneiro proposes is easier for most ITPs including Lean and is likely a more practical model for other ITPs. I believe it is what Stan is proposing for a Lean server. Still @Mario Carneiro, one still has to deal with the environment unless certain environment check points can also be turned into an id as well which can be quickly loaded. (As I said, in Lean, it can be loaded by declaration name, but I don't know how quickly.)

Jason Rute (May 25 2021 at 11:32):

@Aidan Swope Do you think you could make such a system using proof ids work and still get the parallelization you need/want?

Jason Rute (May 25 2021 at 11:57):

As for data structure, if the information is coming straight from the theorem prover, like a goal state, it is fairly easy to give a fully elaborated s-expression, JSON, or protobuf as well as pretty printed version. (Of course, the more information provided, the larger the data being passed will be, unless there are settings to turn some things on or off. And with Lean at least, sometimes these terms can be huge. Also, it isn't always obvious how to match the pretty printed expression to its raw form, e.g. [0, 1, 2] verse @list.cons.{0} nat (@has_zero.zero.{0} nat nat.has_zero) (@list.cons.{0} nat (@has_one.one.{0} nat nat.has_one) (@list.cons.{0} nat (@bit0.{0} nat nat.has_add (@has_one.one.{0} nat nat.has_one)) so I don't know how useful this would be to a tokenizer.)

A more difficult thing to do is parse the training data when that data is coming from human written code, especially Lean tactics which each come with their own parsing syntax. For example, we could theoretically parse lean tactics down to their components, and we have the code to do it, but they parse in a funny way which looks very different from how they are written. The tactic rw [nat.add_comm, ←add_assoc] looks a lot different when fully parsed:

tactic.interactive.rw
  {rules := [{pos := {line := 8, column := 6}, symm := ff, rule := ``(nat.add_comm)},
                     {pos := {line := 8, column := 20}, symm := tt, rule := ``(add_assoc)}],
   end_pos := some {line := 8, column := 30}}
  (interactive.loc.ns [none])

Mario Carneiro (May 25 2021 at 16:38):

Jason Rute said:

What Mario Carneiro proposes is easier for most ITPs including Lean and is likely a more practical model for other ITPs. I believe it is what Stan is proposing for a Lean server. Still Mario Carneiro, one still has to deal with the environment unless certain environment check points can also be turned into an id as well which can be quickly loaded. (As I said, in Lean, it can be loaded by declaration name, but I don't know how quickly.)

I didn't mention the environment in that protocol, and I was envisioning it to be entirely proof-local. I am less sure about how to perform environment changes, and they are much less likely to be parallelizable short of running multiple copies of the ITP, although some ITPs (like lean) might be able to handle multiple environments simultaneously. Here's a sketch:

load_file(file: String): Initialize the environment to the end of a given source file
load_file_upto(file: String, thm: String): Initialize the environment to just before a given theorem in a given source file
add_theorem(name: String, id: ProofStateId): Assuming id is a finished proof state, commit it to the environment
add_sorry_theorem(name: String, sig: ThmSignature): Commit a theorem with the given signature to the environment without a proof
new_theorem(sig: ThmSignature) -> ProofStateId: start a new proof in the current environment, receiving the initial proof state

All of these commands invalidate and clear the proof state map, and there are no IDs for navigation here because the ITP maintains the environment statefully. The only way to jump to a random other location is to use load_file_upto which might be expensive since it runs all dependent files (although the ITP might be able to share work here with previous calls). There might also be additional add_ functions for other kinds of environment items.

Aidan Swope (May 25 2021 at 19:24):

Jason Rute said:

Aidan Swope Do you think you could make such a system using proof ids work and still get the parallelization you need/want?

I could see a parallel ID-based system working with some strategy to keep multiple environment workers synchronized. For example, the ML model might try several steps in parallel with different workers, and then re-run any successful steps on all workers in the pool to keep the state updated. This likely captures a lot of the benefit of parallelism without needing to serialize proof state and reload the environment to a point at every step.

Aidan Swope (May 25 2021 at 19:28):

Jason Rute said:

As for data structure, if the information is coming straight from the theorem prover, like a goal state, it is fairly easy to give a fully elaborated s-expression, JSON, or protobuf as well as pretty printed version. (Of course, the more information provided, the larger the data being passed will be, unless there are settings to turn some things on or off. And with Lean at least, sometimes these terms can be huge. Also, it isn't always obvious how to match the pretty printed expression to its raw form, e.g. [0, 1, 2] verse @list.cons.{0} nat (@has_zero.zero.{0} nat nat.has_zero) (@list.cons.{0} nat (@has_one.one.{0} nat nat.has_one) (@list.cons.{0} nat (@bit0.{0} nat nat.has_add (@has_one.one.{0} nat nat.has_one)) so I don't know how useful this would be to a tokenizer.)

A more difficult thing to do is parse the training data when that data is coming from human written code, especially Lean tactics which each come with their own parsing syntax. For example, we could theoretically parse lean tactics down to their components, and we have the code to do it, but they parse in a funny way which looks very different from how they are written. The tactic rw [nat.add_comm, ←add_assoc] looks a lot different when fully parsed:
tactic.interactive.rw
  {rules := [{pos := {line := 8, column := 6}, symm := ff, rule := ``(nat.add_comm)},
                     {pos := {line := 8, column := 20}, symm := tt, rule := ``(add_assoc)}],
   end_pos := some {line := 8, column := 30}}
  (interactive.loc.ns [none])

It'd require more effort from the interface developer, but the nicest solution for the ML researcher would probably be a parameter (either set at the start or per-request) determining whether to return fully-elaborated responses or pretty-printed ones.

Mario Carneiro (May 25 2021 at 19:28):

@Aidan Swope How important is parallelism across multiple proof states for different theorems / different environments as opposed to multiple proof states for subgoals in the tree search for a single theorem?

Aidan Swope (May 25 2021 at 19:31):

I personally expect the IMO challenge (and similarly hard problems) will require massive parallelism on a single problem. As far as simpler benchmarks, within-problem parallelism may be just a nice-to-have. But, given that the tree search is likely very fast and the ML model is very parallel, it would seem to be a shame to introduce a slow, CPU-bound, sequential bottleneck on each problem.

Mario Carneiro (May 25 2021 at 19:33):

Do you think that the workaround of having multiple ITP instances running is viable? It would probably require a lot more memory

Mario Carneiro (May 25 2021 at 19:37):

The interface I gave is inherently serial wrt environment changes, but if you added EnvId similar to ProofStateId for environment changes the ITP could at least in principle support parallel environment access, and then a serial ITP would just have an expensive switching process if you use the wrong environment ID

Aidan Swope (May 25 2021 at 19:38):

Yes, it's probably viable if synchronizing state across multiple ITP instances isn't too hard. I don't think memory will be too much of a bottleneck if a CPU server farm is available. How important are environment changes within a single problem? I was assuming the environment was mostly fixed during a given proof.

Mario Carneiro (May 25 2021 at 19:43):

As I'm conceptualizing it, "within a proof" is synonymous with "fixed environment / theorem statement", although as Jason points out this isn't quite true for lean, which is capable of running tactics that add definitions to the environment. (Note that since architecturally a lean proof can't actually change the environment, what really happens is that the modifications to the environment are discarded when the proof is complete, and any definitions that are created are inlined during proof post-processing.)

Aidan Swope (May 25 2021 at 19:45):

In that case (assuming it's safe to ignore environment-modifying tactics) an EnvId likely isn't necessary and the ITP can just load the environment once at the start of a proof.

Mario Carneiro (May 25 2021 at 19:47):

In lean it is pretty safe to ignore environment changing tactics; most tactics users use directly don't do this (although @Jason Rute might have some counterexamples)

Mario Carneiro (May 25 2021 at 19:58):

Here's an example of an environment changing tactic (actually a term constructor) that comes up:

example : true :=
begin
  try { let : ℕ × ℕ → ℕ := λ ⟨x, y⟩, 0, revert this, done },
  try { let : ℕ × ℕ → ℕ := λ ⟨x, y⟩, 0, revert this, done },
  trivial
end

The λ ⟨x, y⟩, _syntax, as well as any use of match, will create an auxiliary definition so it can use the equation compiler, which only works on top level defs. If you look at the intermediate states you will notice that the first one creates _example._match_1 and the second one creates _example._match_2, even though the state was rewound after the first let. So this can create some nondeterminism in the names that tactics produce but at least it doesn't spuriously fail after creating _example._match_1 twice.

Aidan Swope (May 25 2021 at 20:04):

Since we're talking about an interface for ML agents to communicate with Lean: what is the current best way of doing this? Is it @Jesse Michael Han 's lean-tpe?

Mario Carneiro (May 25 2021 at 20:07):

Assuming that this is the interface that GPT-f uses (as it appears to be), I would say that's probably the best option that currently exists. I don't think lean C++ has an IPC protocol

Jesse Michael Han (May 25 2021 at 20:58):

currently lean-gptf and lean-tpe assume the proof search is driven from Lean; @Stanislas Polu is currently adapting the code to create a REPL-like interface which maintains a similar protocol to the one described by Mario

Jason Rute (May 25 2021 at 21:02):

Also, @Stanislas Polu is cleaning up the dataset generation tools we used for lean-step to make them more accessible to other researchers. If you are interested, we can ping you when these are ready.

Aidan Swope (May 25 2021 at 21:52):

Please do, thank you!

Stanislas Polu (May 26 2021 at 05:34):

:+1:I plan to release the REPL this week. It’s still a prototype but is already usable.

Stanislas Polu (Jun 01 2021 at 09:11):

We've just open-sourced our new REPL (based on @Jesse Michael Han's lean-gptf) for Lean 3 loosely following @Mario Carneiro's proposed interface. It exposes a JSON interface easy to integrate from your language of choice:

$ lean --run src/repl.lean

["init_search", ["int.prime.dvd_mul", ""]]
{"error":null,"search_id":"0","tactic_state":"⊢ ∀ {m n : ℤ} {p : ℕ}, nat.prime p → ↑p ∣ m * n → p ∣ m.nat_abs ∨ p ∣ n.nat_abs","tactic_state_id":"0"}

["run_tac",["0","0","intros"]]
{"error":null,"search_id":"0","tactic_state":"m n : ℤ,\tp : ℕ,\thp : nat.prime p,\th : ↑p ∣ m * n\t⊢ p ∣ m.nat_abs ∨ p ∣ n.nat_abs","tactic_state_id":"1"}

["run_tac",["0","1","apply (nat.prime.dvd_mul hp).mp"]]
{"error":null,"search_id":"0","tactic_state":"m n : ℤ,\tp : ℕ,\thp : nat.prime p,\th : ↑p ∣ m * n\t⊢ p ∣ m.nat_abs * n.nat_abs","tactic_state_id":"2"}

["run_tac",["0","2","rw ← int.nat_abs_mul"]]
{"error":null,"search_id":"0","tactic_state":"m n : ℤ,\tp : ℕ,\thp : nat.prime p,\th : ↑p ∣ m * n\t⊢ p ∣ (m * n).nat_abs","tactic_state_id":"3"}

["run_tac",["0","3","simp"]]
{"error":"run_tac_failed: pos=(some ⟨1, 2⟩) msg=simplify tactic failed to simplify","search_id":null,"tactic_state":null,"tactic_state_id":null}

["run_tac",["0","3","exact int.coe_nat_dvd_left.mp h"]]
{"error":null,"search_id":"0","tactic_state":"no goals","tactic_state_id":"4"}

It smoothly handles and return tactic application errors and allows multiplexing of proof searches on a single REPL (interleaving of run_tac and init_search commands). It is still subject to a few crashes, but empirically they happen less than 0.01% of the time, which is within bound for a production use at scale. We've successfully run pools of 128+ REPLs each multiplexing 64+ searches in parallel.

Hopefully, this will help make Lean 3 (and surely 4 very soon) a prime choice for machine learning research.

The project is hosted here: https://github.com/openai/lean-gym -- Contributions and feedback welcome!

Stanislas Polu (Jun 01 2021 at 09:14):

(@Mario Carneiro @Daniel Selsam hope it'll be informative and useful for the development of Lean 4!)

Stanislas Polu (Jun 01 2021 at 09:16):

(Also hope it'll provide a clear answer to this topic: Lean :p)

Eric Bond (Jul 16 2021 at 23:41):

Hello.
W.r.t the transformer approach, how would you handle custom notation and syntax frequenctly used in proof developments?

Eric Bond (Jul 16 2021 at 23:42):

Also I see alot of discussion about premise selection.. but what about proof term selection?

Eric Bond (Jul 16 2021 at 23:44):

Many proofs require induction targeting a specific proof term, unfolding a proof term, or performing a partial substitution of a proof term.

Jason Rute (Jul 17 2021 at 08:44):

As far as syntax, with the gptf approach, it (theoretical) handles syntax in the training data, and also can adapt to syntax in the goal, say if that same syntax would be used in the tactic command.

Jason Rute (Jul 17 2021 at 08:45):

It currently doesn’t know about syntax not found in either location, say earlier in a new file not used for training.

Jason Rute (Jul 17 2021 at 08:47):

Also gptf can generate terms as needed for exact tactics or existentials.

Jason Rute (Jul 17 2021 at 08:48):

None of this works perfectly, but it isn’t a hard barrier like in some systems.

Last updated: May 02 2025 at 03:31 UTC