Zulip Chat Archive

Stream: Lean for teaching

Topic: Lean for Waterproof file format design


Pim Otte (Sep 23 2025 at 09:11):

We are currently about to attempt to add (Verbose) Lean support to the Waterproof project. We're having some internal discussion about the file format, and we're looking for some community input. (In particular, @David Thrane Christiansen and @Patrick Massot ) Any input is welcome, but especially if you see any major advantages/problems with these approaches.

The key question: What file structure should we use for exercise sheets as the basis Lean for Waterproof?
Options:

  • A (custom) Verso Genre
  • A regular lean file with magic comments
  • A Lean Verbose file, with magic comments (if needed?)
  • Something else entirely?

If there is a feasible migration path and a good reason for it, we could also start with one format and migrate to another, but this is not our preference.

Example of the current file format here

Current file format features:

  • Marking input areas for students
  • Collapsible sections with titles for pedagogical hints
  • Collapsible sections with titles for hiding technical code
  • Markdown with LaTeX for explanatory text.

Goals of the choice for the Lean format:

  • Good LSP support for the raw file format
  • Well-specified format for interleaving code and text
  • Connect to the Lean ecosystem in the appropriate "spot"
  • Future possibilities of building on top of this format (e.g. interactive websites)

Some optional context:

  • We currently split code cells for input areas, which is currently not supported by Verso (issue). Example here
  • There's prior art for using magic comments for Anchors in Verso. We suspect this means that we could in the future likely render a magic-comment version with Verso with a similar setup as in for example Tao's analysis
  • Currently, our editor connects to the LSP and uses that information. Building the whole document with Verso is likely too long of a feedback loop at this point, and is only relevant for the prospect of building future (interactive) website versions.

David Thrane Christiansen (Sep 23 2025 at 09:15):

I want to give this a proper think before responding, but I do want to let you know that I've seen the question and am pondering

Patrick Massot (Sep 24 2025 at 19:42):

I am very happy that you are working on this, but indeed I think this is a question for David rather than me.

Jon Eugster (Sep 25 2025 at 06:39):

As a side note, in lean4game we use regular Lean files with magic, custom commands (elabs), but this dates from a time before verso existed. In a recent discussion with @Jason Reed from the FRO, it has been discussed that this could change to a custom verso genre in the far future.

The issue with that is that last time I checked, verso was still very much in progress with documentation only being created, so I haven't yet learned how it works in detail.

David Thrane Christiansen (Sep 25 2025 at 13:45):

We discussed this thoroughly at today's team meeting. I have to go pick up my kid but I will report back soon!

Alex Kontorovich (Sep 25 2025 at 22:27):

@Jon Eugster , what are the chances that verbose can be combined with the game server?... :)

David Thrane Christiansen (Sep 26 2025 at 09:53):

OK, first off, here's a summary of my understanding of your needs. If these assumptions are wrong, let me know so I can revise the thoughts!

Assumptions:

  1. You need to edit these files both as text files and through the Waterproof interface
  2. Your input areas are always editable code, rather than e.g. a sequence of rich text and code that's interleaved
  3. It's important to be able to interject explanatory text into the middle of code. A proof by cases might need to have substantial text along with each case.
  4. The borders between code and rich text are always aligned with a newline (there's no inline rich text in the middle of a single line of code, eg let x := /-RICHTEXT-/ 5; x).
  5. "Good LSP support for the raw file format" includes support for the non-Lean parts as well as the Lean parts, e.g. giving code actions, catching markup syntax errors with red squigglies, etc.

Next, a key bit of info: Verso is really a collection of fairly reusable parts. It's a Lean library first and foremost, rather than a program you run. In particular, you can use the Verso markup language without using any of the other parts of it. The genre system is probably not relevant to you, as it's concerned with things like generating output, controlling user extensibility, and tracking cross-references within and between documents.

One thing that's not a huge impact here, but would be good to know: should we think of a Waterproof document as being a sequence of interleaved snippets of code and rich text, or should we think of it as hierarchically organized, so that a bit of text in the middle of a proof is considered "part of" or "subordinate to" said proof? We suspected the former but confirmation would be good.

All right, all that said: you should be able to use the Verso markup language to build the frontend you need, for interactive use. This isn't something you'll get out of the box, but we thought about the likely challenges:

  1. No "proper" literate programming support. We don't yet have a way to start the Lean parser out in Verso mode, which means that you'll have to use a hack of some kind to control the rendering of imports. If it's just a simple "show/no show" flag, then this won't be so bad; if you need something more, then it might require a bit of special-purpose comment processing.
  2. Processing Lean blocks independently won't work well. Right now, "pausing" the Lean parser at the end of a block and resuming in the next one is not so easy to do. Because a Lean proof is a single syntactic form, rather than a sequence of top-level commands, a solution must be found. Here's some ways to explore:
  • Have a pre-processing step on the Verso syntax tree that produces "plain" Lean syntax, and use this with the language server. There's a few ways to do this, including replacing each character in the original file that's not part of Lean code with a " " or " " (depending on weird UTF-16 specifics), causing the LSP source positions to still line up. If you don't have inline code with column positions to preserve, then you can just replace whole lines with blank lines and the whole thing is easier. You can do all this without generating extra files, but incrementality will be a bit of a challenge. The Verso LSP server extensions can be mostly reused here, I think - if not, we can generalize them.
  • Write an LSP proxy for your frontend syntax that forwards the Lean commands on to the Lean language server. This could use the Verso parser. You can write this in Lean - we have a good library for writing LSP servers, after all ;-) This means that the student code is truly "pristine Lean", it gets rid of any challenges with import syntax not being extensible. I've never written such a thing, but it seems manageable, if substantial.
  • Use magic comments to interleave into Lean. This seems to be a bad choice, but it's here for completeness. This would mean there were two ways of including text, and the LSP support in the comments would be bad.
  • Use custom syntax to escape Lean to Markup, as @Jon Eugster mentioned. I tried this pretty seriously for code examples in Verso, and it didn't work well in practice - I'm working towards eliminating the feature. The problem is that we'd need these custom syntax formers in many, many syntax categories, some of which aren't extensible, and they are hard to make look nice while reading the code. I really think some kind of approach outside Lean's syntax is best for what you're doing here.

You probably can't get parsing to be resumable by serializing parser states to restart them on the next block. While most parser state is represented in Parser.State, there's also an implicit call stack that we don't have a good performant way to capture, which is relevant for things like <|>. Making the parser pauseable while still being fast would be a cool research project, but I suspect it's not one that you're interested in embarking on.

So, in summary: it seems that using Verso's markup language is a fairly short path to getting something with good LSP support, because the Verso parser already does things like tracking source positions, and it already integrates into Lean. The rest of Verso is probably not relevant to you right now and can be ignored, but it'll be there waiting if you ever find a use for it (e.g. turning some Waterproof-based materials into a printed textbook, or automatically adding links to the reference manual). These pieces are fairly orthogonal and can be used on their own.

A downside is that using Verso means your file is "mostly Markdown" rather than "markdown". If you need to process it with something else, parsing it outside Lean will be difficult. I can help you write an exporter, though.

Happy to have a call about this sometime if that'd be easier, or to answer questions here.

Pim Otte (Sep 26 2025 at 14:46):

Thanks so much, this is very thorough and enlightening!

About the assumptions:

  1. Correct
  2. Is effectively true right now, but the Rocq version does support the sequence. (To clarify, this means input areas for students, teachers can edit the whole document through our interface)
  3. The reason the splitting is an issue is because we have exercise sheets with partial proofs followed by an input area. That being said, the thing you are mentioning is something that is very nice to have, but not a blocker if we don't.
  4. Correct
  5. Correct

On interleaved vs hierarchical:

I think functionally most current exercise sheets fit the interleaved model. We translate the document to prosemirror, which does have a more hierarchical bend. My personal view is: interleaved code and text, with a hierachy for input areas and hints. These effectively give only one level though, we're not nesting these.

Challenge 1: We've thought about this and we figured we need some sort of hack for this, but we also suspect it won't yield any fundamental issues.
Challenge 2: The pre-processing option seems the most promising to me. A couple of questions about this option:

  1. The whitespace is single-space and double-space? or is it something more esoteric? (I suspect we can do whole lines, but still wanted to check this)
  2. Do you have any pointers to code that might be useful as examples for this approach? (I have found this and this, but we might not even need to dig in so deep if the LSP extensions can be reused)
  3. What do you suspect are the consequences of incrementality being a challenge? Would this even be a challenge at the lemma level or just inside proofs?

Pim Otte (Sep 26 2025 at 15:01):

Also, am I right in assuming we could trade off some complexity by initially dropping Assumption 3, disallowing text in proofs, and doing input areas with magic comments? Then we could still use the verso parser to structure most of the document, but we don't have to do the LSP extension business.

Pim Otte (Oct 02 2025 at 07:36):

@David Thrane Christiansen friendly tag to ensure you saw the messages above:)

David Thrane Christiansen (Oct 02 2025 at 08:00):

Thanks for the ping!

Pim Otte said:

Challenge 2: The pre-processing option seems the most promising to me. A couple of questions about this option:

  1. The whitespace is single-space and double-space? or is it something more esoteric? (I suspect we can do whole lines, but still wanted to check this)

It's just single vs double space, depending on whether the character being replaced takes one or two bytes in 16-bit Unicode. If all boundaries are on line boundaries, this may not even be necessary, and you can try just having the line numbers line up. That should be a quick experiment.

  1. Do you have any pointers to code that might be useful as examples for this approach? (I have found this and this, but we might not even need to dig in so deep if the LSP extensions can be reused)

The former is an example of writing custom LSP handlers that hook in to the Lean LSP server. It may have some useful techniques, but it's not really what I was thinking of. The latter is part of Lean, which gives you good datatypes for the protocol.

An LSP proxy would be a language server just for Waterproof that invoked the Lean language server as a subprocess. When it received Lean messages from either side, it'd forward them on. This would be a significant amount of work, and result in something flexible, but there's likely to be many corner cases.

  1. What do you suspect are the consequences of incrementality being a challenge? Would this even be a challenge at the lemma level or just inside proofs?

Both. By "incrementality" I mean that Lean doesn't need to re-check a whole file in response to changes. It can "rewind" to a known-good state prior to the changes and respond much faster. Incrementality exists at the level of commands (lemmas, definitions, etc) and at the level of tactics within proofs. Without it, Lean is much less pleasant to use. The preprocessing version would need to take this seriously to give a good user experience. Delegating to Lean's own incrementality is doable, but constrains the design space; writing your own is possible but very challenging and I'm not sure we have the bandwidth to provide quite enough support on that.

David Thrane Christiansen (Oct 02 2025 at 08:02):

Pim Otte said:

Also, am I right in assuming we could trade off some complexity by initially dropping Assumption 3, disallowing text in proofs, and doing input areas with magic comments? Then we could still use the verso parser to structure most of the document, but we don't have to do the LSP extension business.

Yes, this would make it much simpler. If Lean commands don't span code blocks (that is, if it's more tree structured than interleaved segments) then much of the difficulty evaporates. Having an implementation to iterate on is always good, and if the restriction is relaxed, then it should be quick to write a Python script to migrate older documents.

Pim Otte (Oct 02 2025 at 08:13):

@David Thrane Christiansen Thanks! Maybe to clarify, in my question 2 with "This approach" I meant the preprocessing idea, not the proxy.

David Thrane Christiansen (Oct 02 2025 at 08:20):

Oh, sorry!

Here's an implementation of the string whitespace edits in Verso.

David Thrane Christiansen (Oct 02 2025 at 08:21):

I don't have an implementation of the whole thing anywhere that I know of.

David Thrane Christiansen (Oct 02 2025 at 08:22):

I'm likely to get rid of that code soon now, because the Lean parser now supports stop positions prior to the actual EOF, so I'll be able to just parse Lean blocks directly instead parsing a "shadow string" like this

Pim Otte (Oct 24 2025 at 15:30):

I ended up coding up a hacky version of the multilean directive, and built two sample files: one with Verbose and one more to have some syntax examples.

Pim Otte (Oct 24 2025 at 15:32):

@David Thrane Christiansen Thanks for the pointers, they helped a lot:)

Unless I am mistaken, it is not possible to nest directives. Is this something you would consider in scope for Verso?

David Thrane Christiansen (Oct 27 2025 at 05:07):

They can absolutely be nested - just like you nest code blocks by using more backticks, you nest directives by using more colons. E.g.

````
```lean
blah
```
````

and

::::outer
Text text
:::inner
 * More text
 * And yet more
:::
and outer directive again
::::

Pim Otte (Oct 27 2025 at 05:13):

Ahh, great! I did try varying the number of colons, but I put more on the inside rather than the outside.

David Thrane Christiansen (Oct 27 2025 at 05:14):

That makes sense!

I think we should relax the syntax a bit and just require a distinct number of colons for each matching pair, rather than a strict hierarchy. That would accept both variants, and be more convenient.


Last updated: Dec 20 2025 at 21:32 UTC