Zulip Chat Archive

Stream: general

Topic: Lean code into AST/XML/JSON for parsing/generation for ML?

Alex Meyer (Mar 28 2021 at 02:03):

Hello! Is there BNF grammar for Lean code? Can I use it for generating ANTLR parser for Lean code? In that way I could parse Lean theories and convert them into graph neural networks (from AST to GNN) and use those GNN for machine learning the proof and mathematical reasoning strategies. And I can also use AST manipulation for subsequent Lean code generation. So - is such BNF avilable? Or similar interface for converting Lean theory code into programmatically manipulable structures? Or maybe there is no such grammar/APi available, but is it possible to export Lean into XML/JSON and I can manipulate those? Any hints in this direction? I have made similar question in Coq club about Coq https://sympa.inria.fr/sympa/arc/coq-club/2021-03/msg00105.html and in that message I have elaborated on the 1) incrementality of grammar; 2) incrementality of loaded sessions. I don't know how this can be applied in Lean, but if Lean has facilities to define additional grammatical/notational rules in the code then this Coq question applied to Lean as well. All in all - I understand that I am asking a much, but - maybe there is good interface for machine learning the Lean theories, theories written in Lean? Thanks, Alex.

Mario Carneiro (Mar 28 2021 at 02:33):

The short answer is no. I think Coq is in a slightly better situation because there is a Coq XML export (I think by the MMT folks, similar to the isabelle export)

Bryan Gin-ge Chen (Mar 28 2021 at 02:36):

For the long answer, search for posts by @Jason Rute , e.g. this most recent one.

Mario Carneiro (Mar 28 2021 at 02:37):

I'm thinking of https://link.springer.com/chapter/10.1007/978-3-030-23250-4_17 for Coq XML output BTW; I'm not sure where the project itself is

Roman Bars (Mar 28 2021 at 03:41):

Maybe read the paper https://arxiv.org/abs/2102.06203? They seem to do some machine learning.

Jason Rute (Mar 28 2021 at 11:39):

@Alex Meyer I’m unavailable for the next few days, but I’ll respond with some additional thoughts (and questions) in the middle of next week.

Jason Rute (Mar 30 2021 at 16:36):

@Alex Meyer All the above advice is good. One thing to keep in mind is that there are two types of data in Lean. There is the top level code including tactic proofs, human written Lean code (with notation, type class inference, and special Lean syntax). Lean takes this, parses it, and "compiles" it down to a more low level syntax, throwing away most information in the process. It is hard to get back anything resembling and AST for the human written code. Parsing raw Lean code is quite complicated. For the LeanStep dataset, we did manage to get back ASTs for the tactic proofs, but it was a lot of work and the techniques wouldn't work on other types of human written Lean code (like the human written term proofs). However, as I said Lean does "compile" this down to a low-level syntax which is stored by the running Lean process and available for introspection and export. Such information includes the statements and low-level-term-proofs of all theorems, the types and values of all definitions, and some other stuff that the Lean parser and tactic environment needs. This is stored in the environment, and can be gotten out by working in the meta-programming framework. However, it should be noted that whenever someone needs this data, they typically write their own custom code. There isn't a one-size-fits-all exporter. Besides the code snippets I shared in that other thread, a more complete exporter of this information (which we used too the PACT/GPT-f paper is here, but again that was the information we needed for our purposes.

Jason Rute (Mar 30 2021 at 17:18):

@Alex Meyer I don't really know what you mean by "parse Lean theories". When you say "Lean theory" what do you have in mind? For this GNN, do you envision the goal as being given by an AST, and then from (a GNN embedding of that AST) you can predict the tactic? If so, I'm planning on adding some support for this in the tactic recoding repo.

Jason Rute (Mar 30 2021 at 17:18):

I also have a similar question what you mean by "machine learning the proof and mathematical reasoning strategies". Do you mean predicting tactic proofs, term proofs, or something different entirely? For tactic proofs, that is exact the data gotten by the tactic recoding repo. Right now the data is exported as plain human-written text, but it could also be exported as a sort of AST. However, there are a number of subtleties there, including reading a generated AST back into Lean, as well as how to generate a tactic AST.

Jason Rute (Mar 30 2021 at 17:19):

I'm not sure how familiar you are with machine learning proofs for Coq, Lean, Isabelle, HOL-Light, Metamath, and Mizar. There are generally four approaches right now:

Covert the problem to be proved into some sort of first-order logic thing and then try to prove it with an AST. Where machine learning comes in is in predicting which theorems in the library one should use a premise to proof the goal. This is what is used in various hammer systems in Isabelle, Coq, and HOL as well as much of the ML theorem proving for Mizar.
From an encoding of the goal to prove, build the AST of the next tactic. For example, given a goal, first use an ML agent to decide which tactic to predict next, e.g. rw. Since rw takes in one or more theorems, then use a separate premise selection component to select which theorems in the library are most relevant. This is the approach in CoqGym/ASTactic, ProverBot9001, HOList/DeepHOL, and others. The problem here is that you need to have a special component for building each type of tactic argument. (This approach and all the rest, are usually combined with a tree search.)
From an encoding of the goal, find the most similar goal in the training data (or in all the goals which have come earlier in the environment) and then use that tactic command used on that goal. This approach is often very fast, but obviously the tactic found may not be the best. Often one will further modify the variables or premises in the tactic command to match the current goal better. (Again, this is combined with a tree search.) This is the approach of TacticToe and Tactician.
From an encoding of the goal, generate the tactic command as plain text (using a language model, like a transformer). Either the language model memorizes the theorems in the library and synthesizes them (like Lean GPT-f), or instead of producing a theorem name it produces the statement of the theorem name and a library search is used to look up that theorem (if it exists) in the library (like MetaMath GPT-f). The Lean and Metamath GPT-f projects fall in this camp as well as the older Holophrasm projects.

Jason Rute (Mar 30 2021 at 17:19):

There is no reason to think this is the end-all-be-all of ideas in this area, and more ideas are appreciated, but at the same time it isn't as easy as say playing around with an OpenAI gym. A lot of knowledge of the system is needed to get started, but I'd be happy to talk with you more about what has been done in Lean and what could be done.

Jason Rute (Mar 30 2021 at 17:19):

Last, check out #Machine Learning for Theorem Proving if you haven't already. Possibly this thread should be moved there.

Kevin Buzzard (Mar 30 2021 at 18:19):

Whether or not it gets moved, thank you Jason for such an informative post!

Last updated: May 02 2025 at 03:31 UTC