Zulip Chat Archive

Stream: Lean Together 2024

Topic: A Pretty Expressive Printer - Sorawee Porncharoenwase

David Thrane Christiansen (Jan 10 2024 at 16:44):

Here's the paper: https://arxiv.org/abs/2310.01530

namibj (Jan 10 2024 at 16:52):

The conceptual flexibility here w.r.t. the layout-vs-cost correspondence together with how flexible the layout choices can be, makes me hopeful about getting an easy extension to support for proportional fonts.

David Thrane Christiansen (Jan 10 2024 at 16:54):

I implemented support for proportional fonts in a variant of a Wadler-style pretty printer once, in Haskell. The technique was that the renderer (Doc -> String) had to run in a monad that supported the ability to measure the empirical width of a string

David Thrane Christiansen (Jan 10 2024 at 16:55):

I had a backend with ghcjs that would reflow documents in response to browser resizing

namibj (Jan 10 2024 at 16:57):

Yeah, "measure empirical width" is what I'd have reached for as the naive way, too. Does neglect kerning, though: e.g. width of spaces is somewhat flexible, and mild variation there usually trumps mild vertical unalignment.

Henrik Böving (Jan 10 2024 at 16:58):

Here's my not (yet) so well optimized clone: https://github.com/hargoniX/pfmt/

David Thrane Christiansen (Jan 10 2024 at 17:00):

This is why it's necessary to measure strings rather than characters - this is needed for scripts like Arabic or Devanagari even moreso than for Latin

Alexandre Rademaker (Jan 10 2024 at 17:02):

So @Henrik Böving, to make it clear, this is your ongoing implementation of a lean library from the ideas of the paper, right? despite the performance, is it feature complete?

Henrik Böving (Jan 10 2024 at 17:03):

Yes, it doesnt have any proofs etc. but it can do the stuff from the paper, see the unit tests in Main.lean: https://github.com/hargoniX/pfmt/blob/main/Main.lean#L5-L128

The two things that have to be done are the performance and actually integrating the algorithm with Lean.

David Thrane Christiansen (Jan 10 2024 at 17:03):

Though vertical alignment is even more complicated with variable-width fonts - I pretty-printed Haskell and had to have ways of aligning = and | vertically such that the | was centered WRT the =

David Thrane Christiansen (Jan 10 2024 at 17:05):

The other thing we did was add semantic annotations (like the tags in the Lean pprinter), and we needed a way of having annotations affect font choice and make that interact with the measurement (e.g. bolding keywords makes them wider)

Alexandre Rademaker (Jan 10 2024 at 17:11):

Thank you @Henrik Böving; I am looking for a pretty printer for my https://github.com/arademaker/delphin. It can parse structures like https://github.com/delph-in/docs/wiki/MrsRFC#overview. But I also need to serialize them.

David Thrane Christiansen (Jan 10 2024 at 17:12):

Lean itself contains a pretty printer that's not as fancy as @Sorawee Porncharoenwase's - the document datatype is called Std.Format

Alexandre Rademaker (Jan 10 2024 at 17:14):

Yes @David Thrane Christiansen , I started to read Wadler's paper to understand it. I do need more time, I could not make it work yet.

David Thrane Christiansen (Jan 10 2024 at 17:19):

Quick demo to get started:

import Lean

inductive Expr where
  | atom (name : String)
  | list (entries : List Expr)
deriving Inhabited

partial def pp (e : Expr) : Std.Format :=
  match e with
  | .atom x => .text x
  | .list xs => .text "(" ++ .group (.nest 2 (.joinSep (xs.map pp) .line) ++ ")")

partial def ex : Nat → Expr
  | 0 => .atom "x"
  | n+1 =>
    let ns := n.fold (· :: ·) []
    .list (ns.map ex)


/-
((() x) (x) () x)

-/
#eval pp (ex 5)

/-
(((((x) () x) (() x) (x) () x) ((() x) (x) () x) ((x) () x) (() x) (x) () x)
  (((() x) (x) () x) ((x) () x) (() x) (x) () x)
  (((x) () x) (() x) (x) () x)
  ((() x) (x) () x)
  ((x) () x)
  (() x)
  (x)
  ()
  x)

-/
#eval pp (ex 10)

David Thrane Christiansen (Jan 10 2024 at 17:19):

Note that Wadler's implementation won't work in a strict language - the code needs refactoring for that

David Thrane Christiansen (Jan 10 2024 at 17:20):

But I think I'd learn this not by reading the paper, but instead by working with a few examples to get a feel for the API

David Thrane Christiansen (Jan 10 2024 at 17:23):

You can write a reasonable pretty printer just with .line (newline), .group (undoes newlines inside itself as long as it fits on the line), .text (inserts a literal string), ++ (append two documents), .nest (controls indentation following newlines in its body)

Patrick Massot (Jan 10 2024 at 17:30):

Do these tools also allow to enforce code style, such as forbidding space after an opening parenthesis?

David Thrane Christiansen (Jan 10 2024 at 17:30):

Here's an annotated version of that program that might be helpful:

partial def pp (e : Expr) : Std.Format :=
  match e with
  -- Emit atoms literally
  | .atom x => .text x
  -- A list is parens surrounding its contents. Put them on one line if
  -- possible, or one item per line otherwise with two spaces indentation.
  | .list xs =>
    -- Start with a (
    .text "(" ++
    -- "group" means to put the contents on the same line if possible. It does
    -- this by replacing all the "line" calls immediately under itself (that is,
    -- not under a nested "group") with spaces
    .group
      -- "nest 2" means to indent by two extra spaces after newlines
      (.nest 2
        -- joinSep takes a list of documents and a separator document, and
        -- appends the list with the separator between them. Here, the separator
        -- is "line", putting newlines between the items. The final closing
        -- paren is appended here to put it on the same line as the last list
        -- element.
        (.joinSep (xs.map pp) .line) ++ ")")

David Thrane Christiansen (Jan 10 2024 at 17:31):

Patrick Massot said:

Do these tools also allow to enforce code style, such as forbidding space after an opening parenthesis?

Only indirectly, by comparing the output of the formatter to its input

Henrik Böving (Jan 10 2024 at 17:31):

Patrick Massot said:

Do these tools also allow to enforce code style, such as forbidding space after an opening parenthesis?

the idea is that we take your code as a syntax tree together with a series of valid ways to format it and pick the optimal one. If your valid ways to format it dont contain spaces after an opening parenthesis you won't get that in the result.

Justin Pombrio (Jan 10 2024 at 17:32):

Note that Wadler's implementation won't work in a strict language - the code needs refactoring for that

Yeah, as you said it's not too hard to work out how to do so oneself. There's also a short paper describing how to implement Wadler's algorithm in a strict language: https://lindig.github.io/papers/strictly-pretty-2000.pdf

David Thrane Christiansen (Jan 10 2024 at 17:33):

Ken Friis Larsen has an SML port that's very short, with few features, but that also illustrates the technique

David Thrane Christiansen (Jan 10 2024 at 17:36):

https://github.com/kfl/wpp

Sorawee Porncharoenwase (Jan 10 2024 at 17:37):

Here's my slides:
lean-together-2024.pdf.

It's my working hours right now, so I won't be able to participate in this thread yet. Though @Justin Pombrio (a co-author of this work) should feel free to chime in :)

Justin Pombrio (Jan 10 2024 at 17:42):

namibj said:

The conceptual flexibility here w.r.t. the layout-vs-cost correspondence together with how flexible the layout choices can be, makes me hopeful about getting an easy extension to support for proportional fonts.

Is there a reason to want proportional fonts for Lean in particular?

Adding support for proportional fonts in Wadler-style printing is "easy" in terms of the algorithm, because the only thing the algorithm needs to check is "is this line longer than the max width". (The hard part just being figuring out how big text is before you render it.)

Adding support for proportional fonts in any printer that supports vertical alignment (e.g. Bernardy's printer, or Sorawee and my printer) is harder in terms of the algorithm, because the running time of the algorithm is worst-case polynomial in the number of _possible_ line widths. With a fixed-width font, there are 80 possible line widths between 0 and 80 characters long. With a proportional-width font, there are very many possible line widths between 0.0 and 80.0 em. I think what you would want to do is clump similar widths together, saying e.g. that a width of 40.12 and 40.17 are "close enough" and treat them as equal when determining the pareto frontier.

Justin Pombrio (Jan 10 2024 at 17:48):

Henrik Böving said:

Here's my not (yet) so well optimized clone: https://github.com/hargoniX/pfmt/

Henrik, this is great. We should stay in touch. Sorawee in particular has spent a lot of time optimizing PrettyExpressive and probably has helpful advice. Definitely email us if you have any questions; I'll DM you our email addresses.

David Thrane Christiansen (Jan 10 2024 at 17:54):

@Justin Pombrio The Wadler-Leijen pprinter does have vertical alignment, right? How does that fit into your framework?

David Thrane Christiansen (Jan 10 2024 at 17:57):

Are they not considered due to breaking optimality?

Henrik Böving (Jan 10 2024 at 17:58):

There is a vcat operation in the OCaml implementation that is based on the other operators.

David Thrane Christiansen (Jan 10 2024 at 17:59):

By "framework" here I meant "conceptual framework" - sorry for being unclear

Justin Pombrio (Jan 10 2024 at 18:13):

Leijen's pretty printer does technically have vertical alignment, but you really don't want to use it because it can make some very bad choices due to being greedy. For example, it thinks that at max-width=4 this format:

AAAB
   C
   D

is better than this format:

AAA
BCD

This isn't a weird edge case that would rarely come up in practice, I think you'd very commonly hit situations like this.

David Thrane Christiansen (Jan 10 2024 at 18:15):

Interesting - I've used it a lot in the past without encountering many bad layouts in practice. But it's certainly theoretically poorly behaved

Justin Pombrio (Jan 10 2024 at 18:16):

Oh I'm interested to hear this. Where did you use vertical alignment, and could it ever be far to the right? That's where it runs into issues, when the alignment is near the right margin. It certainly behaves very badly for Racket, but Racket has a _ton_ of possible alignment points.

David Thrane Christiansen (Jan 10 2024 at 18:19):

I used it in Idris, for doing the traditional layout of do, like:

main =
  repeat $ do x <- getLine
              print x

In practice, I think it occurred fairly leftwards, which explains the lack of pain

David Thrane Christiansen (Jan 10 2024 at 18:20):

Also for types in REPL output

David Thrane Christiansen (Jan 10 2024 at 18:20):

https://github.com/search?q=repo%3Aidris-lang%2FIdris-dev+align+language%3AHaskell&type=code&l=Haskell

Justin Pombrio (Jan 10 2024 at 18:20):

Yeah, the place it would get painful would be if the do was far right and you got (near the right margin):

do x
     <-
     getline
   print
     x

David Thrane Christiansen (Jan 10 2024 at 18:21):

The do would in practice be preceded by a grouped linebreak that would break first, IIRC

David Thrane Christiansen (Jan 10 2024 at 18:21):

It's about a decade since I worked on this though

Justin Pombrio (Jan 10 2024 at 18:22):

Well the trouble is when do x _does_ fit on the line. Wadler's algorithm is greedy so at that point it commits. Unless there's something more complicated going on I don't understand.

David Thrane Christiansen (Jan 10 2024 at 18:23):

Right!

David Thrane Christiansen (Jan 10 2024 at 18:24):

We only used this for compiler output, not for formatting user code, and I suppose that do blocks in compiler output are typically in the left in e.g. a unification error

David Thrane Christiansen (Jan 10 2024 at 18:38):

Another question - reference equality (a la eq?) is important for observing the graph structure in the Racket implementation. Would hash-consing be an alternative technique? That is, is it important that accidentally equal subdocuments not be eq? with each other?

Justin Pombrio (Jan 10 2024 at 18:49):

David Thrane Christiansen said:

Another question - reference equality (a la eq?) is important for observing the graph structure in the Racket implementation. Would hash-consing be an alternative technique? That is, is it important that accidentally equal subdocuments not be eq? with each other?

Would it be possible to give each document a unique id when it's constructed, without needing to do something horribly tedious like use a DocIdGenerator each time? That would be the simplest/fastest. Though understandable if Lean doesn't allow you to violate the property that identical values are indistinguishable.

If you can't do that, I'm not sure hash-consing would be sufficient? You might need to compare hash-conses and then fall back to a full equality check if the hashes are the same.

Another option would be to scan the document once and give each node a unique id, if there's an appropriate time to do so. I forget when exactly these eq? checks are used.

This might be a question for @Sorawee Porncharoenwase (who's unavailable at the moment but will respond later).

Henrik Böving (Jan 10 2024 at 18:51):

We can do pointer equlaity in Lean with a bit of unsafe.

David Thrane Christiansen (Jan 10 2024 at 18:51):

The situation is basically like it would be in Haskell - adding unique IDs could be done with unsafe features, or in a monad

David Thrane Christiansen (Jan 10 2024 at 18:52):

Lean also has a nice computed fields feature that could be used to make document hashes immediately manifest on the constructors

Justin Pombrio (Jan 10 2024 at 18:53):

Yeah that's probably what you want to do (pointer equality). It's _technically_ an optimization, the behavior shouldn't change if you duplicated a node with a different pointer value.

David Thrane Christiansen (Jan 10 2024 at 18:53):

I was more curious as to whether _inequality_ ever mattered - sounds like it doesn't

Justin Pombrio (Jan 10 2024 at 18:54):

You should wait until Sorawee responds to be too sure about this though.

David Thrane Christiansen (Jan 10 2024 at 18:54):

For sure :-)

David Thrane Christiansen (Jan 10 2024 at 18:56):

The hash-consing would be interesting because a document would be represented _by_ its uniqueID/hash, rather than using the hashes to compare. I think the gain would be usability rather than performance - it wouldn't require so much care to maintain sharing

David Thrane Christiansen (Jan 10 2024 at 18:57):

Anyway, thanks for answering so many questions!

Justin Pombrio (Jan 10 2024 at 19:05):

Yeah. There's a very nice implementation of tree-shaped data, where you store the nodes in a vector and "pointers" from one node to another as indices into that vector. I've seen this give something absurd like a 2x speedup over regular data structures in Rust, presumably due to the lack of allocation and maybe also cache locality.

Justin Pombrio (Jan 10 2024 at 19:06):

Happy to answer questions, it's exciting to see our work being used!

Sorawee Porncharoenwase (Jan 12 2024 at 07:02):

David Thrane Christiansen said:

That is, is it important that accidentally equal subdocuments not be eq? with each other?

It's totally fine for accidentally equal subdocuments to be eq? with each other. In fact, we ideally want the document to be maximally shared!

I think hash consing could work, but we should never hash the whole document or compare documents structurally on the whole value, because a document could be large. The right way would be to compute a hash incrementally based on children node's hashes.

Lookup cost

One thing that I want to point out, is that we switched away from using hash table whose key is document (or document identity) for resolve in the current implementations of PrettyExpressive, due to the lookup cost.

Just so that we are on the same page, what we would like to do is to (essentially) memoize the function resolve : Doc -> Config -> Set Layout.

Our earliest approach was to create a memoization table with the type:

table : HashEQTbl Doc (HashTbl Config (Set Layout))

where HashEQTbl is a mutable hash table whose key is (efficiently) compared by reference equality and HashTbl is a regular mutable hash table. Not all languages provide HashEQTbl. Racket does have it (hasheq), and we used this approach for the Racket PrettyExpressive. For languages without HashEQTbl, we can instead forge an identity integer key for each newly constructed document (say, in the smart constructors of documents). This approach was employed by the OCaml PrettyExpressive. The memoization table in this case would have the following type:

table : HashTbl Integer (HashTbl Config (Set Layout))

What we discovered, though, is that for a large document, there will be many hash table lookups (on the document / identity integer key), which is pretty costly. So the current version of PrettyExpressive (both Racket and OCaml) instead stores memoized results in the document structure itself.

type Doc = { table : HashTbl Config (Set Layout),
             val: DocCase }
data DocCase =
| Text String
| Concat Doc Doc
| Choice Doc Doc
| ...

This eliminates the hash table lookup cost: when we resolve on a document, the memoized results are already in its field. Note, though, that it introduces several complications:

(1) The tables must be cleared manually, which matters if we want to pretty print multiple times with different cost factories, since results for one factory is not applicable for another factory.
(2) Garbage collection can't claim space until the tables are all clear.
(3) Since documents are now stateful, they can't be pretty-printed concurrently.
(4) Layout actually contains its cost, so it is parameterized by the cost type. This means Doc must now be parameterized by the cost type as well.

These complications are no big deal for PrettyExpressive though.

For (1) and (2), it turns out that this inconvenience can sometimes be a feature: it makes _incremental formatting_ easy! In code formatting "on editing", the language server would invoke the pretty printer multiple times on syntax trees that differ only slightly. Provided that we reuse the existing (sub-)documents for identical (sub-)trees, the memoized results are automatically transferred. Therefore, subsequent formatting does not need to compute as much.

In the OCaml PrettyExpressive in particular, the pretty printer is a module functor, parameterized by the cost factory module. Therefore, there is not even a need to do the manual table clearing. If you change a cost factory, you need to construct a new document anyway.

For (3), we can workaround easily by copying the document, and I personally never need to pretty print concurrently.

For (4), PrettyExpressive has an extra feature cost, which allows us to arbitrarily add extra cost to a document. This already makes the type of Doc parameterized by the cost type anyway, so it's not an issue for PrettyExpressive.

Henrik Böving (Jan 12 2024 at 07:40):

I think hash consing could work, but we should never hash the whole document or compare documents structurally on the whole value, because a document could be large. The right way would be to compute a hash incrementally based on children node's hashes.

This is something that we can do nicely in lean with the computed fields features, see src#Lean.Expr

where HashEQTbl is a mutable hash table whose key is (efficiently) compared by reference equality and HashTbl is a regular mutable hash table. Not all languages provide HashEQTbl

This is also something that we can do in Lean with a bit of unsafe I would say, the ocaml approach would also be viable though.

What we discovered, though, is that for a large document, there will be many hash table lookups (on the document / identity integer key), which is pretty costly. So the current version of PrettyExpressive (both Racket and OCaml) instead stores memoized results in the document structure itself.

This is the thing I was wondering about when I was porting the OCaml version. We can't really do this nicely in Lean since it doesn't really have a mutable HashMap, instead if the HashMap is used linearly you have a guarantee that it gets updated in place. But this is of course not the case if we employ document sharing. I guess we could put an STRef (HashMap k v) in here? but I have no idea if that would be nice/how good that would work or if something else is a better approach here, maybe someone can chime in on that regard.

Sorawee Porncharoenwase (Feb 18 2024 at 20:26):

@David Thrane Christiansen asked in the talk if it's possible to support a "tabular layout". The answer is ... kind of. I prototyped this new two_columns feature in the OCaml implementation at https://github.com/sorawee/pretty-expressive-ocaml. It allows you to format something like:

def Doc.size : Doc → ℕ
  | Doc.text _       => 1
  | Doc.this_is_a_super_long_name _
                     => 1
  | Doc.cc d₁ d₂     =>
    Doc.size d₁ + Doc.size d₂ + 1
  | Doc.choice d₁ d₂ =>
    Doc.size d₁ + Doc.size d₂ + 1

where both | and => across rows are aligned.

Note that this is actually not a table. For one, it only supports two columns. Also, in each row, The first line of the right column starts at the last line of the left column (see the Doc.this_is_a_super_long_name row, where the left column has two lines). And the column separator is only "inserted" to the lines where both columns meet (so other lines like Doc.size d₁ + Doc.size d₂ + 1 don't matter, and can pass through the column separator position).

This two-columns layout follows most conventional programming syntax (top to bottom, left to right). So it's actually good that it's not an actual table (though this means we can't format something like 2D syntax, which is an actual table).

My initial implementation is horribly inefficient, but I later improved that a bit via the zipper data structure and caching (pushed to the git repo, but not yet published on opam). It should work really well if there are not too many rows.

Last updated: May 02 2025 at 03:31 UTC