Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Learning based space formatting of proof assistant code

Karl Palmskog (Jul 08 2020 at 15:21):

Dear Lean ML community,

My collaborators and I recently presented our work-in-progress on learning-based space formatting of Coq files at the Coq Workshop (see abstract on arXiv and slides). We used a dataset that was discussed elsewhere on Lean's Zulip.

We believe similar techniques can easily be applicable to Lean files (especially Lean 4, through its accessible ASTs). Our main limitation right now that we don't handle hierarchical formatting of proof scripts. We would welcome any feedback from the Lean ML community on this, in particular how one can define features to expose formatting/indenting based on the proof goal structure.

The best idea I have right now is that one could attach a tuple or list of integers, e.g., "(2, 3)" or "2 3", to tokens inside a tactic expression that is operating on the 3rd subgoal of the 2nd subgoal. But there is likely a general way to handle this, perhaps even across proof assistants like Lean and Coq.

Gabriel Ebner (Jul 08 2020 at 15:29):

It is not customary to indent tactic scripts in Lean based on the goal structure. If you have multiple goals, then it is common practice to put each of the subgoal into a { ... } block (which ensures that the goal is solved).

example : A ∧ B :=
begin
 split,
 { unfold A, ring },
 { simp only [c, d, e],
   intro n,
   induction n; simp * },
end

Karl Palmskog (Jul 08 2020 at 15:31):

but you can have nested { ... } though, right? Doesn't that bring "implicit" hierarchical indentation?

Gabriel Ebner (Jul 08 2020 at 15:31):

Right, but the indentation should always follow the {} blocks.

Gabriel Ebner (Jul 08 2020 at 15:32):

Maybe I misunderstood your question. I thought this shouldn't be a problem because it is syntactically obvious (and doesn't require any knowledge of the tactic states).

Karl Palmskog (Jul 08 2020 at 15:38):

so Lean itself enforces that the meaning of the following is always different? (In Coq similar sequences could have the same meaning depending on options)

{  ring, /- subgoal with one subgoal -/
   { ring } }

vs.

{ ring ,  /- two subgoals at the outset -/
  ring }

Mario Carneiro (Jul 08 2020 at 15:50):

what does /- two subgoals at the top -/ mean? After a { there is always exactly one goal. If the result of running ring produces two goals then it is conventional to put ring, { proof1 }, { proof2 } after it, otherwise (if there is only one goal) we just write ring, proof

Karl Palmskog (Jul 08 2020 at 16:01):

If the result of running ring produces two goals then it is conventional to put ring, { proof1 }, { proof2 } after it

But this is not mandated by Lean itself, right? You would have to use a linter or similar to enforce this? One goal of our work was to learn (in lieu of specifying precisely) when subgoals should be newlined, e.g., based on discussion here one could either write:

ring, { ring }, { ring }

ring,
{ ring },
{ ring }

But I guess you might be already be using linting to get the latter?

Johan Commelin (Jul 08 2020 at 16:07):

No, we currently have no linter for checking style.

Johan Commelin (Jul 08 2020 at 16:08):

Usually, { foo, bar, baz } would go on a new line, but if it's really short, sometimes we put it after the tactic that created the multiple goals.

Johan Commelin (Jul 08 2020 at 16:09):

Example:

induction n, { simp }, -- the case n = 0 is short enough to deal with on this line
-- now there is only one goal left. We sometimes omit the {}
tactic_foo,
bar,
rw IH,
simp

Karl Palmskog (Jul 08 2020 at 16:13):

thanks, that's very helpful. I guess the only way to figure out if adding the proof goal structure as a feature will actually help Lean formatting is to do a machine learning eval, then. It might go either way here, but for Coq it was very clear that we were missing a lot of indents/newlines

Johan Commelin (Jul 08 2020 at 16:13):

Of course we could slightly change our style guide, if this means we get an auto-formatter (-;

Karl Palmskog (Jul 08 2020 at 16:20):

to give you an idea of the current status:

here is the suggested formatting we got based only on observing Coq token streams, and
here is the manually formatted file.

Unfortunately, tuning is lacking for things like getting the Qed newlines right.

Since our team doesn't have Lean competence, we are open to working with a Lean (ML) collaborator to do something similar for Lean. In particular, the main limitation may be that we rely on independently readable code with spacing information to learn from (which is very hard to obtain for Lean 3).

Alex J. Best (Jul 08 2020 at 16:20):

Yeah there are a few things it would be interesting to see if ML would learn, like work_on_goals and all_goals tactics, which are an exception to the "rules".

Mario Carneiro (Jul 08 2020 at 16:23):

In particular, the main limitation may be that we we rely on independently readable code with spacing information to learn from (which is very hard to obtain for Lean 3).

I'm not sure I understand. Why doesn't mathlib satisfy this requirement?

Karl Palmskog (Jul 08 2020 at 16:26):

From the dataset topic I referenced at the initial post in this topic:
Jason Rute said:

Karl Palmskog said:

Coq source files are known to be near-impossible to lex/parse accurately by tools external to Coq.

This is definitely the state we are in with Lean 3. I've written some tools, and am writing more, to work with Lean programmatically, but definitely it is not possible to parse a Lean file by itself without recreating Lean. (Lean has an export format, but it throws too much away for machine learning. It's only good for checking proofs with external checkers.) I know Lean 4 is supposed to have an intermediate AST layer. I don't know if that will be easy to access or work with. I've been told it is, but I haven't seen any examples proving it.

Karl Palmskog (Jul 08 2020 at 16:28):

@Mario Carneiro if there was an independent format of .lean files with their tokens and their locations, we could definitely do it. If there is interest I can give very specific examples of what data we use from serialized Coq code.

Mario Carneiro (Jul 08 2020 at 16:28):

What tool does this for coq?

Gabriel Ebner (Jul 08 2020 at 16:28):

Looking at the two Coq files, Lean doesn't have the same issue with indentation in the the tactic scripts. Tactics should be indented if and only if they are in a {} block. But the end still needs to be on a new line with the same indentation as the begin.

Mario Carneiro (Jul 08 2020 at 16:29):

I think lean tactic fomatting should be about as simple as formatting languages like java, once you get the relevant parsing info

Karl Palmskog (Jul 08 2020 at 16:29):

We currently use SerAPI to serialize Coq files, to obtain both ASTs and tokens as S-expressions (it uses OCaml PPX metaprogramming): https://github.com/ejgallego/coq-serapi

Karl Palmskog (Jul 08 2020 at 16:30):

Java was one of the first languages to get a learning-based formatter: http://groups.inf.ed.ac.uk/naturalize/

Mario Carneiro (Jul 08 2020 at 16:32):

the only thing that might require more intelligence in tactic formatting is determining when to put a newline after a tactic, i.e.

  tac1,
  tac2,

  tac1, tac2,

Mario Carneiro (Jul 08 2020 at 16:34):

Relevant correlates might include the author, the available line length, as well as the identity of the tactics / whether they "form a unit"

Mario Carneiro (Jul 08 2020 at 16:34):

which might involve looking at the tactic states

Mario Carneiro (Jul 08 2020 at 16:35):

however, a decent number of users just put line breaks after every tactic and that's not an unreasonable style (though it is a bit vertically verbose)

Johan Commelin (Jul 08 2020 at 16:36):

Mario Carneiro said:

Relevant correlates might include the author, the available line length, as well as the identity of the tactics / whether they "form a unit"

If you find stuff that goes on till column 235, it's a safe bet that it was written by Scott.

Jason Rute (Jul 09 2020 at 02:41):

@Karl Palmskog what specifically do you need from Lean to do this? The lean server can provide some information like which tactic state a character is associated with, but in my experience it is slow and it is not always aligned correctly. Also it can’t handle hierarchical states like tactics inside tactics. Alternately, one can modify the interactive lean code (which governs the behavior of tactics) to record positions. If I recall, you might be able to record the starting and ending positions of tactics using that. This is more accurate but involves a bit of hacking (just the Lean, not the C++). There are probably other approaches as well.

Last updated: May 02 2025 at 03:31 UTC