Zulip Chat Archive

Stream: lean4

Topic: Basic string slicing


Eric Wieser (Jan 31 2026 at 01:25):

What's the new way of spelling the following?

#eval "hello world".extract 6 11

this fails, since I'm no longer allowed to construct the index in this way.

Eric Wieser (Jan 31 2026 at 01:28):

Is there a better answer than

#eval (fun s => s.extract (s.startPos.nextn 6) (s.startPos.nextn 11)) "hello world"

which is very verbose?

Eric Wieser (Jan 31 2026 at 01:30):

Is

instance {s : String} {n} : OfNat s.Pos n where
  ofNat := s.startPos.nextn n

reasonable?

Robin Arnez (Jan 31 2026 at 01:32):

If you limit yourself to ascii, you can use

#eval String.Pos.Raw.extract "hello world" 6 11

But what do you specifically need this for?

Eric Wieser (Jan 31 2026 at 01:36):

I have a string of the form foo_YYYY_bar_Z and want to extract the components

Eric Wieser (Jan 31 2026 at 01:37):

Since foo and bar are known at compile-time and the fields are fixed-width, I just want to write something short

Eric Wieser (Jan 31 2026 at 01:37):

String.Pos.Raw.extract seems like a weird place for that function to live, considering dot notation

Thomas Murrills (Jan 31 2026 at 02:28):

Eric Wieser said:

String.Pos.Raw.extract seems like a weird place for that function to live, considering dot notation

I think the reason for this is simply that you're not really supposed to use it, and are supposed to work with String.Slices now instead?

Though it does seem like there's no easy way to construct a Slice or even an s.Pos by character count without that verbose startPos in there...

Eric Wieser (Jan 31 2026 at 02:31):

Is the OfNat instance a bad idea (and therefore missing from core) because it encourages doing the linear utf8 search twice?

Thomas Murrills (Jan 31 2026 at 03:04):

I think so! My guess is that we would ideally have something like

def String.slice' (s : String) (start end : Nat) (h : start  end) (startingFrom : s.Pos := s.startPos) : String.Slice := ...

(but with a better name).

Even more ideally, we would have dependent slices (I asked about this on zulip a little while ago and I believe the answer was that they're in the works!) so that this could return a String.SliceOf s (or equivalent), and we could reuse the resulting endPos as a starting pos for the next String.slice' call :)

Markus Himmel (Jan 31 2026 at 05:59):

We don't want to make this too short, because it needs to be clear from the code that getting the byte index of the n-th character is a linear-time operation.

If you know that you're dealing with Ascii, then #eval (fun s => s.extract (s.pos! ⟨6⟩) (s.pos! ⟨11⟩)) "hello world" is also a possibility, or, if you don't actually want to copy out the result, #eval (fun s => s.slice! (s.pos! ⟨6⟩) (s.pos! ⟨11⟩)) "hello world!".

Markus Himmel (Jan 31 2026 at 06:01):

(And if you're not dealing with Ascii, I don't think there is much you can do, because what you perceive as YYYY in foo_YYYY_bar_Z could be any number of codepoints.)

Eric Wieser (Jan 31 2026 at 06:23):

I guess what I'd like here is to be able to write s.slice! (.pos! ⟨6⟩) (.pos! ⟨11⟩)) to avoid repeating s

Markus Himmel (Jan 31 2026 at 07:01):

That's reasonable. I'll see what I can do.

Eric Wieser (Jan 31 2026 at 07:05):

I suppose open String (pos!) then s.slice! (pos! _ x) (pos! _ y), letting unification repeat the string, is not the worst

Eric Wieser (Jan 31 2026 at 07:07):

Maybe having a less verbose way to write the following would also be good

#eval "hello world!".sliceFrom (String.pos! _ 6) |>.sliceTo <| (String.Slice.pos! _ 5)

which I guess is an argument for moving the .pos! namespace so I can write the same thing for both positions

Thomas Murrills (Jan 31 2026 at 07:13):

Markus Himmel said:

We don't want to make this too short, because it needs to be clear from the code that getting the byte index of the n-th character is a linear-time operation.

True, but one thing I'd like to mention here is that Eric's first "lawful" approach was equivalent to String.slice (s.startPos.nextn a) (s.startPos.nextn b), which counts from the start twice. If it existed, a hypothetical definition of the form of slice' could be more efficient by remembering aPos := s.startPos.nextn a and finding aPos.nextn (b-a).

Anecdotally, I also saw something of the form s.positions.count - (s.dropWhile patt).positions.count recently, to count the number of characters dropped by dropWhile.

So the general point I'm trying to make is that while I certainly agree that it should be hard to construct a footgun, I think it is a bit of a balance, as too little API will cause people to reach for the simplest inefficient things when they need to anyway—and these tend to become more inefficient the more you combine them! :sweat_smile:

Michael Rothgang (Jan 31 2026 at 10:47):

My unverified guess is that this was some version of mathlib's whitespace linter (i.e., #30658). :-)

Thomas Murrills (Jan 31 2026 at 18:01):

Correct! :grinning_face_with_smiling_eyes: (Namely, the present version on master version on master last night!) Also, I tried briefly to find a better way, and couldn't come up with one that wasn't more awkward or required new API, so I'm "people" here too. :)

Thomas Murrills (Jan 31 2026 at 18:07):

(Context for other readers of this thread: the PR that upgraded to slices from (sub)strings was reverted because it hurt performance! I'm not sure if it was this line specifically, but positions.count seemed related based on the thread.)


Last updated: Feb 28 2026 at 14:05 UTC