Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Coq machine learning dataset based on MathComp 1.9.0

Karl Palmskog (Jun 28 2020 at 21:43):

Dear Lean ML community,

My colleagues and I have recently released a dataset for use in machine learning, based on Coq 8.10.2 and Mathematical Components version 1.9.0. It is based on 449 source files from 21 Coq projects, which are in total over 297k lines of code, with projects divided into tiers based on quality. Each source file has two machine-readable files which are lists of S-expressions: one file with Coq tokens and one file with Coq abstract syntax trees. Each token/AST contains detailed information on its source file provenance. These representations were obtained directly from Coq's implementation via (OCaml PPX) metaprogramming. We also provide raw and processed ("chopped") Coq kernel terms as S-expressions for each lemma statement (proof terms were not our focus in this work).

So far, we have used the dataset for deep generation of lemma names, as discussed here, and for suggesting spacing. We would be interested in if similar datasets/techniques can be done for Lean (3 or 4).

For precision, our work uses ML for proof assistant user productivity and not for proof automation (although the dataset might be useful for proof automation by others).

Jason Rute (Jun 28 2020 at 23:03):

I’ll be sure to take a look!

Jason Rute (Jun 28 2020 at 23:07):

I have a superficial question. This is at least the fourth dataset for machine learning for Coq (after GamePad, CoqGym, ProverBot, and TacticToe for Coq). Do you all use the same tools to mine your data or are they all different?

Karl Palmskog (Jun 28 2020 at 23:09):

to my knowledge, none of the other datasets have machine-readable source files. We did lots of engineering work together with a Coq developer to properly serialize all the tokens and the ASTs and the kernel terms. AST serialization has even been roundtrip tested - Coq can be made to process the serialized ASTs instead of source files.

Karl Palmskog (Jun 28 2020 at 23:13):

I believe CoqGym and ProverBot also use a previous version of the library (SerAPI) we used for serialization, but they use it to interact directly with Coq and produce their own intermediate data (e.g., capturing the proof context before and after executing a tactic). Based on my understanding of GamePad, they hacked Coq 8.6 and produced their own ad-hoc tactic tracer which is obsolete now. TacticToe for Coq is based on a homegrown plugin which hooks directly into Coq's internals. Our ML toolchain is completely independent of Coq for training/generation, since the dataset doesn't need Coq to be processed. The Coq (de)serialization pipeline is actively maintained and works, e.g., for Coq 8.11.

Karl Palmskog (Jun 28 2020 at 23:29):

Also a note about data quality: this is a curated set of projects which we divided into tiers based on discussion with MathComp devs. ProverBot's dataset comes from a single project (CompCert), and CoqGym selected projects from Coq's OPAM index without any curation other than that they could be built easily with the Coq version they were using. TacticToe for Coq seemingly only uses Coq's stdlib, which is acknowledged by Coq devs as being of variable quality overall.

Jason Rute (Jun 28 2020 at 23:49):

That is a very helpful answer!

Jason Rute (Jun 28 2020 at 23:53):

Karl Palmskog said:

Coq source files are known to be near-impossible to lex/parse accurately by tools external to Coq.

This is definitely the state we are in with Lean 3. I've written some tools, and am writing more, to work with Lean programmatically, but definitely it is not possible to parse a Lean file by itself without recreating Lean. (Lean has an export format, but it throws too much away for machine learning. It's only good for checking proofs with external checkers.) I know Lean 4 is supposed to have an intermediate AST layer. I don't know if that will be easy to access or work with. I've been told it is, but I haven't seen any examples proving it.

Karl Palmskog (Jun 28 2020 at 23:59):

Jason Rute said:

I know Lean 4 is supposed to have an intermediate AST layer.

Right, I think something like this has to be available for viability of ML-for-productivity (which tends to use ASTs a lot). The "trick" used in SerAPI is to introspect on the OCaml data structures used to implement the Coq source AST. Coq does have abstract term-level definitions/structures in MetaCoq, but the interesting stuff usually happens long before a proof becomes a term.

Jason Rute (Jun 29 2020 at 00:04):

Just so I am clear, what is ML-for-productivity? Does "productivity" here mean cleaning up code by suggesting variable names, spacing etc? And I assume the ML is because it is hard or impossible to formalize all this stuff into a perfect linter, so it helps to learn some of it automatically? Is this the idea of the term and your work?

Karl Palmskog (Jun 29 2020 at 00:07):

I would define it along those lines, yes, basically we work on tools/techniques to assist proof assistant users that are not directly related to proof automation. The general ideas come from software engineering and NLP, and they also consider other areas than just coding conventions: https://ml4code.github.io

Jason Rute (Jun 29 2020 at 00:09):

Also, the question I have for every ML for TP project, is this a usable product?. Most projects are research and information flows out of the theorem prover, but can't be "put back in" in a reasonable way. In other words, a user of that theorem prover couldn't use it without a lot of work.

Jason Rute (Jun 29 2020 at 00:12):

Sorry, my thoughts are bouncing around a lot. Also, I'd be curious if Lean's extremely flexible custom notation and custom parsing would make it difficult to do something like ML-for-productivity. Maybe not, since even though a user could go crazy with notation, they usually pick something reasonable, and especially the ML part would conform to how notation is usually created and used. It wouldn't have to cover every corner case.

Karl Palmskog (Jun 29 2020 at 00:14):

Jason Rute said:

Most projects are research and information flows out of the theorem prover, but can't be "put back in" in a reasonable way.

Currently, the output format in our name generation tool is a simple list of current/suggested lemma name pairs. However, my ML co-author has been able to replace lemma names directly in source files, since we have exact location information via SerAPI. It will be based on community feedback if we implement this in a convenient interface. For example, a user may pinpoint files or whole projects and get lemma names (or spacing) instantly replaced with top-1 ML suggestions.

Karl Palmskog (Jun 29 2020 at 00:19):

Jason Rute said:

Also, I'd be curious if Lean's extremely flexible custom notation and custom parsing would make it difficult to do something like ML-for-productivity. Maybe not, since even though a user could go crazy with notation, they usually pick something reasonable, and especially the ML part would conform to how notation is usually created and used. It wouldn't have to cover every corner case.

Coq also has flexible notations, and it was a major challenge to handle this (that's how we ended up extending the SerAPI library with a lot of functionality and three CLI tools). In our ablation studies on lemma naming, it's very clear that having elaborated terms as input helped a lot, since a lot of information used by Coq users for naming is hidden by notations.

Jason Rute (Jun 29 2020 at 00:25):

As for useability, what you have actually seems fairly usable. Maybe this is more of an issue with automatic theorem proving, but most such projects could not be used to help with proof creation in new development without basically rebuilding the whole project in OCaml.

Also, many projects are tested and trained on data from the same file (and used a fixed vocabulary, which can't handle a new definition). While that might be fine to show that a NN can learn theorem proving, it prompts the question: What would happen if you used this trained agent on a new project? Would you have to retrain it? If so, would that be practical or useful? Actually, that is a good question for you: Can your tool suggest new lemma names for new projects without retraining or does its performance significantly degenerate?

Jason Rute (Jun 29 2020 at 00:26):

Sorry. I used "project" in two different ways. That last sentence is referring to new coq projects.

Karl Palmskog (Jun 29 2020 at 00:37):

Jason Rute said:

What would happen if you used this trained agent on a new project? Would you have to retrain it? If so, would that be practical or useful? Actually, that is a good question for you: Can your tool suggest new lemma names for new projects without retraining or doesn't its performance significantly degenerate?

For our lemma name generation work, we did try our tool on a project that was not in our training/validation sets and investigated generalizability. Firstly, we did a manual evaluation where we sent a bunch of suggestions to the project maintainer and asked him to rate them :thumbs_up: / :thumbs_down: . This is described starting on p. 13 in the paper. Around 24% of (top-1) suggestions were rated as good/adequate. Secondly, we did a generalizability case study (p. 30) to automatically measure how well our tool works on a never-seen project and how performance increases as training data from that project is added.

The tool is obviously only as good as the training/validation sets can make it, so as naming conventions evolve, one will have to retrain (for our best model trained on four core projects, this takes about a day of machine time). There may not always be a connection between a good score on a testing set and good names in practice.

Jason Rute (Jun 29 2020 at 00:42):

I’ll have to read the paper! It sounds well thought out. (And I haven’t even asked about the ML ideas yet.) I also hope more people start posting their papers here!

Karl Palmskog (Jun 29 2020 at 00:44):

As far as I know (I mainly provided the Coq expertise for this work, my ML knowledge is limited), the models and evaluation approach are fairly standard in the context of NLP / machine translation, which seems distinct from ML proof automation. One of the senior co-authors does research in linguistics.

Jason Rute (Jun 29 2020 at 00:49):

It seems in the last few months there have been a lot of papers showing that modern NLP tools (transformers) are capable of a lot of simple reasoning tasks. I could imagine that we will see more overlap between ML for productivity and ML for theorem proving and ML for code generation.

Karl Palmskog (Jun 29 2020 at 09:52):

Looks like there is some transformer stuff in this paper using Isabelle: https://arxiv.org/abs/2006.09265

Jason Rute (Jun 29 2020 at 10:44):

Yeah, I’ve been meaning to write about that one.

Karl Palmskog (Jul 03 2020 at 18:11):

My co-author presented the paper today at virtual IJCAR: https://youtu.be/Ysd-dcizw1A?list=PLl1dj5prwUJz6KV4Q1EPG9Gx0bjiKa_ey&t=3462

LEZHI HU (Dec 09 2020 at 13:28):

@Karl Palmskog hello! I am really interested in the "tree chopping" methods mentioned in your paper. However I know very little about the internal data structures of Coq, I wonder is there any documentation easy to read about this?
for example, when I use coq_serapi to input :
forall xyz:nat, xyz<3
I got output:
(Prod (Name (Id xyz)) (Ind (((Mutind (MPfile (DirPath ((Id Datatypes) (Id Init) (Id Coq)))) (DirPath ()) (Id nat)) 0) (Instance ()))) (App (Const ((Constant (MPfile (DirPath ((Id Peano) (Id Init) (Id Coq)))) (DirPath ()) (Id lt)) (Instance ()))) ((Rel 1) (App (Construct ((((Mutind (MPfile (DirPath ((Id Datatypes) (Id Init) (Id Coq)))) (DirPath ()) (Id nat)) 0) 2) (Instance ()))) ((App (Construct ((((Mutind (MPfile (DirPath ((Id Datatypes) (Id Init) (Id Coq)))) (DirPath ()) (Id nat)) 0) 2) (Instance ()))) ((App (Construct ((((Mutind (MPfile (DirPath ((Id Datatypes) (Id Init) (Id Coq)))) (DirPath ()) (Id nat)) 0) 2) (Instance ()))) ((Construct ((((Mutind (MPfile (DirPath ((Id Datatypes) (Id Init) (Id Coq)))) (DirPath ()) (Id nat)) 0) 1) (Instance ()))))))))))))
This looks so terrifying! I really wonder about the meanings of Mutind, MPfile and Rel blablabla....
So I can better understand what parts of the kernel tree is important for information extraction.
looking forward to your reply, thanks a lot!

Kevin Buzzard (Dec 09 2020 at 13:33):

Just to let you know, there is a Coq Zulip as well, and it's also pretty active; you might be more likely to get a response there. https://coq.zulipchat.com

LEZHI HU (Dec 09 2020 at 14:17):

@Kevin Buzzard although I knew that Zulip, thanks for your help!

Last updated: May 02 2025 at 03:31 UTC