Zulip Chat Archive

Stream: lean4

Topic: RFC: "String gaps" for multiline string literals


Kyle Miller (Nov 05 2023 at 03:36):

I was looking at creating an RFC for implementing "string gaps" to be able to break long string literals. It would be good to get some community input before creating the actual RFC. There's a draft PR at lean4#2821, and in it there's a draft RFC. I'll paste it here on Zulip for comment.

Kyle Miller (Nov 05 2023 at 03:36):

Updated from Haskell-style gaps to Rust-style ones after discussion. Switched to more restrictive syntax and parsing after more discussion.

Proposal. We should implement "gaps" for string literals. These are escape sequences of the form "\" newline whitespace+ that have the interpretation of an empty string. For example,

  "this is \
     a string"

is equivalent to "this is a string". These are like string continuations in Rust. It is a parse error for the whitespace after the newline to contain newlines; this is to help prevent confusions such as users believing that they can put comments in the gap.

The name "gap" comes from the Haskell-style lexical syntax, which has the same purpose but instead with the syntax "\" whitespace+ "\".

Rationale. In projects that have a line length limit (such as std and mathlib), long strings need to be wrapped to respect this limit. A workaround one sees is using string interpolation with the empty string, since the expression between curly braces may itself contain whitespace that is ignored:

  s!"this is {
  ""}a string"

However, this term produces "this is " ++ "" ++ "a string", which is not an atomic string literal. Note that users do not write

  "this is
   a string"

because this includes extraneous whitespace in the string. This particular example is "this is\n a string".

Caveats.

  1. Note that the whitespace in a string gap must be whitespace. Comments are not permitted. The parser does not interpret string literals (that is, it does not do escape sequence replacements), and they are instead interpreted during elaboration time. If the gaps were to admit comments, this interpretation process would need to be able to handle comment syntax. This limitation is also found in the Haskell 2010 language specification.

  2. The implementation needs to be careful not to allow string gaps inside character literals. String literals and character literals share code for handling escapes, but the string gap feature only makes sense for strings.

  3. The Rust implementation uses a warning instead of a parser error when its gaps contain multiple newlines, since they have found allowing multiple newlines to be confusing. For simplicity, we have this condition raise a parser error (rather than waiting for elaboration time to emit a warning). Note that the lexical syntax for gaps is still "\" newline whitespace*, but we choose to have the parser throw an error if whitespace* contains a newline.

Other designs for multiline strings include:

  • Trying to adapt C's automatic string concatenation, where the juxtaposition "this is" " a string" is equivalent to the concatenated version. However, this creates an ambiguity when passing strings as arguments to functions.
  • Allowing interpolated strings to include expression-free {} notation
    s!"this is { }a string"
    However, this forces users to switch to using an interpolated string if they want a multiline string literal.

  • Using Haskell-style gaps. While they do give the user more precise control over whitespace, the second \ can easily be confused for an escape. Everything you can do with Haskell-style gaps can be represented in some way with Rust gaps, with small variations in line lengths.

  • Using unrestricted "\" whitespace+ gaps. These are easy to parse, but here is an example confusion this can lead to.
    "this is \ -- that's the first half a string" -- and that's the second
    The first "comment" is actually part of the string. Requiring that the first whitespace character be a newline prevents this. Even requiring that the first whitespace character be a newline, another example confusion is
    "this is \ -- that's the first half a string" -- and that's the second
    Again, the first "comment" terminates the string gap and becomes part of the string.

Summary. There is no need to reinvent lexical syntax for multiline string literals. The Rust solution satisfactorily solves the problem of long strings that need to be wrapped over multiple lines. There are no downsides to adding gaps to Lean's lexical syntax, and the feature has a simple implementation.

Julian Berman (Nov 05 2023 at 03:55):

I don't know anything about Haskell, but is it worth considering whether this fits in with yet more kinds of useful string literals, specifically -- dedenting string literals where the common whitespace is trimmed? Some other languages have "tags" or other funky prefixes you can put before literals (I have vague memories of some friends telling me the programming language E has some specifically elaborate/nice version of this).

But perhaps it's worth considering whether s!, dedenting, and this kind of single-line literal can/should have common form?

Timo Carlin-Burns (Nov 05 2023 at 03:58):

Swift's multiline string literals ignore leading indentation up to the indentation of the closing quote, which would be nice to consider for Lean as well. If we're already ignoring that indentation, then the approach with just a \ at the end of the line is more concise.

Julian Berman (Nov 05 2023 at 04:01):

nod -- there's a collection of some dedent syntaxes here in the TC39 proposal.

Kyle Miller (Nov 05 2023 at 04:13):

@Julian Berman Could you expand on what you mean by "But perhaps it's worth considering whether s!, dedenting, and this kind of single-line literal can/should have common form?", in particular with s!? You'll find this {""} trick in s! and m! strings all over std and mathlib too, so multiline literals aren't just about plain string literals.

Julian Berman (Nov 05 2023 at 04:18):

Essentially I'm asking about "standardizing" on <modifiers>*!"<literal>" -- (to expose some ignorance I don't know what m! literals do! EDIT: OK I guess I half understand what they do from MessageData). But e.g. if s! did what s! did today, perhaps your string could/should be

j!"foo bar
baz quux"

j! for "joined" perhaps.

And then similarly dedented strings could/should be

    d!"
       foo
       bar
    "

and when someone dreams up some other one it too becomes whatever!.

And mixing them perhaps then is just mixing the prefixes, e.g. sd!"foo" is a dedented string with string interpolation, sj!"..." is a Kyle one-line string with interpolation, etc.

Julian Berman (Nov 05 2023 at 04:27):

I don't know, maybe the above fails your "has a simple implementation" -- I suppose it's likely it does, given that some combinations don't / will not make sense. But just threw it out as it was the first reaction to your message (of course the idea regardless looks totally reasonable to me.)

Kyle Miller (Nov 05 2023 at 04:28):

@Timo Carlin-Burns Swift's multiline string literals seem nice, though they also seem to be somewhat orthogonal. I think the question is whether string gaps now would interfere with multiline string literals later.

Something I like about gaps is they give you freedom to indent the continuing line as much as you want.

  "this is \
       \a string"

Your single-\ proposal is concise, but then you would definitely have to write

  """
  this is \
  a string
  """

Maybe your proposal and gaps are compatible, though: you can add the rule for """ strings that if the next line has no additional whitespace then you can omit the second \ for the gap. That way you can write

  """
  this is \
    \a string
  """

if you want.

Kyle Miller (Nov 05 2023 at 04:39):

@Julian Berman I like designs with composable features -- though I think you can get this easily with Lean macros right now. There are two syntax types, str and interpolatedStr, for string literals, and for example s! strings are implemented here.

You could define d! strings in a similar way. If there were a String.dedent function, then you could define

syntax "d!" str : term

macro_rules | `(d! $s) => `(String.dedent $s)

and then get d!"..." syntax.

Mario Carneiro (Nov 05 2023 at 04:43):

Allowing a \ at the end of a line to mean string continuation. For example
"this is\ a string"
However, such continuations would not respect indentation since any whitespace at the start of the next line would be included. Furthermore, there is the small question of \r\n vs \n for the line ending.

Note that in rust, this exact syntax is supported, but "any whitespace at the start of the next line would be included" is not true, the \ consumes both the following newline and all leading indentation on the following line. IMO that's the much better default behavior, although I can see advantages to the haskell style if we want to also include leading indentation in the string itself. (A trick I will use in rust to get leading indentation is to start the next line with \n , indented to taste, as the \n escape inserts a newline but also doesn't count as whitespace for the purpose of the \ escape)

Kyle Miller (Nov 05 2023 at 04:50):

Would anything go wrong making the second \ optional? That gives you the Rust behavior while giving you finer-grained whitespace preservation if you need it.

The gotcha is if the second line starts with an escape, like say

  "this is\
  \ta string"

which would come out as "this ista string". You could instead write

  "this is\
  \\ta string"

to get "this is\ta string"

Mario Carneiro (Nov 05 2023 at 04:51):

I don't like optional because of the ambiguity around escapes, in particular if the last line is \"

Mario Carneiro (Nov 05 2023 at 04:51):

I suspect this will break regex syntax highlighters everywhere

Mario Carneiro (Nov 05 2023 at 04:52):

Not to mention that it looks like an escape on quick reading

Mario Carneiro (Nov 05 2023 at 04:52):

or if the trailing \ is off the screen

Kyle Miller (Nov 05 2023 at 04:54):

Yeah, that's something I don't like about Haskell's syntax. The main point in favor is that it's in an established language.

One alteration is to have the closing of a gap be a different character from \. Like perhaps "\ whitespace+ >"

  "this is \
       >a string"

Kyle Miller (Nov 05 2023 at 04:55):

Or \ followed by newline is "rust-style", and \> followed by whitespace followed by > is "haskell-style"

Mario Carneiro (Nov 05 2023 at 04:56):

Note that rust's syntax and several of these other points of comparison have been brought up in a recent RFC for multiline string literals in rust, although as you say this is a somewhat orthogonal feature

Mario Carneiro (Nov 05 2023 at 04:58):

in my proposal the "closing of a gap character" is \n (the character escape)

Mario Carneiro (Nov 05 2023 at 04:58):

  "this is \
       \n  a string with two space indent"

Kyle Miller (Nov 05 2023 at 04:59):

Ah, so "\ whitespace+ \n" would be the lexical syntax for a string gap? (the \n wouldn't be an actual newline, right?)

Mario Carneiro (Nov 05 2023 at 05:00):

the \n is an actual newline

Timo Carlin-Burns (Nov 05 2023 at 05:00):

Mario Carneiro said:

IMO that's the much better default behavior, although I can see advantages to the Haskell style if we want to also include leading indentation in the string itself.

When you say "include leading indentation", the indentation that Haskell style allows you to include is in the middle of a line. For example, Haskell style

  "this is \
    \  a string"

is this is a string.

It seems rare to want many spaces in the middle of a line, but If you really want that in Rust-style, you can always move them to the previous line:

  "this is   \
    a string"

Mario Carneiro (Nov 05 2023 at 05:00):

@Kyle Miller that example string has a newline in the middle with one trailing space on the first line and two spaces at the beginning of the second line

Mario Carneiro (Nov 05 2023 at 05:02):

If you don't want a newline, then this syntax is not so convenient if you want to have leading spaces on the second line, you would have to use a hex escape \x20 or something to make a space that isn't literally a space character. But this isn't usually a problem, you just put all literal spaces you want before the \ instead of after

Kyle Miller (Nov 05 2023 at 05:03):

Ok, I get it. This seems reasonable, and I'm convinced that you can do everything you'd want out of the Haskell-style gaps.

Mario Carneiro (Nov 05 2023 at 05:03):

"there are three spaces between the brackets <   \
                         >"

Mario Carneiro (Nov 05 2023 at 05:09):

One thing I will definitely agree with though is that anything is better than

"a long string {""
  }with no line break"

Kyle Miller (Nov 05 2023 at 05:41):

I've updated the PR and the RFC with Rust-style gaps, but I didn't make it so that the first whitespace character needs to be a newline like it does in Rust (I don't have any strong opinion on this, other than it's easier to implement without that check)

Mario Carneiro (Nov 05 2023 at 05:41):

does that mean that "foo \ bar" is also syntax for "foo bar"?

Kyle Miller (Nov 05 2023 at 05:43):

Yeah

Mario Carneiro (Nov 05 2023 at 05:43):

I think it's a bit weird for a line continuation character to be allowed in the middle of a line, but I guess it's mostly harmless... unless the \ is followed by some other escape letter like \n or \"

Mario Carneiro (Nov 05 2023 at 05:44):

it might restrict options for adding more escapes later

Mario Carneiro (Nov 05 2023 at 05:44):

or is the rule that the \ has to be followed by "a whitespace character" (including \n \r \t etc)

Kyle Miller (Nov 05 2023 at 05:45):

It requires whitespace after the \

Mario Carneiro (Nov 05 2023 at 05:46):

seems fine to me, the only related example I can think of is latex where \ is an actual escape sequence

Sebastian Ullrich (Nov 05 2023 at 08:29):

The Rust-style gaps look great to me, I wasn't aware of them before

David Thrane Christiansen (Nov 05 2023 at 18:28):

The linked Rust documentation explicitly calls out that it's confusing that a consequence of this rule is that

"One string \
      with a gap"

denotes the same string as

"One string \

      with a gap"

and they recommend against using the latter form, as it is a confusing feature that may be changed in the future.

Why should the rule for a gap be to skip all whitespace, rather than skip spaces, then a single newline, then spaces? This seems to run less risk of obscure corner cases in the language. You state that we could consider adopting a warning mechanism like Rust's later, but I think it would be easier to start with a parser that solves the problem users are having (convenient multi-line string literals that denote long single-line strings) and then add additional flexibility if there is every a use for it.

That is, I'd propose amending the proposal to replace

   string_item  : string_char | char_escape | string_gap
   string_char  : [^"\\]
   char_escape  : "\" ("\" | '"' | "'" | "n" | "t" | "x" hex_char{2} | "u" hex_char{4})
   hex_char     : [0-9a-fA-F]
   string_gap   : "\" whitespace+

with

   string_item  : string_char | char_escape | string_gap
   string_char  : [^"\\]
   char_escape  : "\" ("\" | '"' | "'" | "n" | "t" | "x" hex_char{2} | "u" hex_char{4})
   hex_char     : [0-9a-fA-F]
   string_gap   : "\" space* newline space+

What do you think? Is there an important use case for the multi-newline version that Rust warns about that I'm missing?

Mario Carneiro (Nov 05 2023 at 18:54):

If it's going to be like that I would prefer just string_gap : "\" newline space* (possibly including other non-newline whitespace like tabs), I think trailing spaces after the \ should be an error. Or should it allow line comments? :thinking:

Mario Carneiro (Nov 05 2023 at 18:55):

That behavior of rust-style gap consuming multiple lines is news to me, I would have guessed that it only consumes one line and agree that makes more sense

Mario Carneiro (Nov 05 2023 at 18:58):

On the other hand, I can get behind the rust behavior of just warning on multiline gaps independently of interpretation - it seems like the kind of thing that will inevitably be unclear / misinterpreted

Timo Carlin-Burns (Nov 05 2023 at 19:06):

It's worth noting that

"One string\

      with a gap"

with the proposed one linebreak rule would denote the same string as

"One string
      with a gap"

which seems like a much clearer way to write it.

Kyle Miller (Nov 05 2023 at 19:31):

Regarding an error/warning, a question is whether the error should be at the lexical level or at the elaborator level. I would lean toward the elaborator level since you can surface richer warnings/errors that way. The elaborator is what is responsible for decoding escapes.

Making the lexical grammar of string_gap be "\" space* newline space+ would mean that Timo's example would be the interpretation of the first string, unless there's an additional string_gap_error that matches "\" whitespace+ minus string_gap. @Mario Carneiro's suggestion of "\" newline space* seems good, with a lexical error if there's not newline immediately after the "\" (just to check -- when Lean loads a file, it translates all line endings to \n right?)

Why should the rule for a gap be to skip all whitespace, rather than skip spaces, then a single newline, then spaces?

I could easily flip that question. A technical reason for the reverse is that the proposal's implementation is very simple, and limiting the allowable whitespace that comes after \ means adding restrictions (i.e., adding flexibility later would mean removing code.)

One good justification for adding restrictions is that it might be surprising to see that

"this is \ -- that's the first half
  a string" -- and that's the second

would be equivalent to

"this is -- that's the first half
 a string"

with the first "comment" included within the string.

Kyle Miller (Nov 05 2023 at 22:06):

I've pushed a change (and adjusted the RFC) to make the lexical syntax be "\" newline whitespace* while making the parser throw an error if whitespace* contains a newline.

Mario Carneiro (Nov 06 2023 at 02:25):

At the lexical level, I think it would be simplest to just say that \<newline> is an escape sequence in the same way that \n is an escape sequence, and there is nothing else one has to do to find the end of the string, which is all the lexer is responsible for. The elaborator later parses this string and that's when the "\" newline space* rule becomes relevant.

Regarding \n vs \r\n newlines in the source text: I think you do have to handle both cases. Actually this is a good question because the non escaping behavior of just including the newline literally would mean that the same source text would end up denoting a different string if line endings are converted to CRLF. Quick question for any windows users: are .lean files using CRLF or LF most of the time? (It says one of the two in the bottom right when the lean file is open.)

Mario Carneiro (Nov 06 2023 at 02:28):

actually this is easy to test:

#eval "
".length

prints 1 when the line endings are LF and 2 when they are CRLF

Kyle Miller (Nov 06 2023 at 03:51):

@Mario Carneiro I've tested the PR on windows already -- line endings appear to be normalized when read. (I'm assuming the CI changes line endings for windows... I'll have to check this)

Re lexer vs elaborator: I found this was a lot quicker to implement on the lexer side. Touching the elaborator means changing the metaprogramming api for string Syntax literals, which we could look at revisiting later.

Kyle Miller (Nov 06 2023 at 04:55):

Mario Carneiro said:

actually this is easy to test:

#eval "
".length

prints 1 when the line endings are LF and 2 when they are CRLF

Oh, I think I misread your message. Did you have access to a Windows machine and did you observe 2 for CRLF line endings?

Mario Carneiro (Nov 06 2023 at 05:05):

No, I used vscode to change the line endings to CRLF and observed that the output changed

Mario Carneiro (Nov 06 2023 at 05:07):

What is not clear to me is whether windows users normally see CRLF or LF lean files when interacting with them in vscode

Kyle Miller (Nov 06 2023 at 05:31):

Hm, I see that Lean's file open sets O_BINARY on the windows platform, which means it turns off line ending normalization (without that flag set, your test should show 1 on Windows no matter if it's LF or CRLF -- the flag has no effect on posix so it's not surprising you'd see 2). I'm also curious if anyone is using Lean with CRLFs (and what the result of your #eval test is for them; I'd expect 2...)

James Gallicchio (Nov 07 2023 at 00:18):

Kyle Miller said:

One alteration is to have the closing of a gap be a different character from \. Like perhaps "\ whitespace+ >"

this reminds me of scala's multiline syntax, where the default margin character is a pipe:

val whatever = """this is a
                 |multiline string
                 |wahoo""".stripMargin

which I found pretty nice and readable. hard to have any unexpected behavior when the syntax is so explicit.

Kyle Miller (Nov 07 2023 at 00:24):

That is equivalent to "this is a\nmultiline string\nwahoo" right? For gaps, what we're looking for is a way to continue a string onto the next line without inserting any \n's.

James Gallicchio (Nov 07 2023 at 00:37):

yeah, it suggests syntax along the lines of

d!"""blahblah\
    |blahblah\
    |blah
    |now this is a newline"""

to cover the multiline dedent and multiline without \n needs.

James Gallicchio (Nov 07 2023 at 00:40):

all of these are :bike::house: but figured it's another datapoint to throw on the pile

Kyle Miller (Nov 07 2023 at 07:38):

@James Gallicchio I added a variation on that in the tests for lean#2821. It has

#eval d!"this is \
            line 1
        |  line 2, indented
        |line 3"

give "this is line 1\n line 2, indented\nline 3". (This d! syntax is just defined within the tests.)


Last updated: Dec 20 2023 at 11:08 UTC