Zulip Chat Archive

Stream: lean4

Topic: current definition of `String.splitOn` is incorrect

Bulhwi Cha (Apr 03 2024 at 06:52):

#eval "012".splitOn "12" -- output: ["0", ""]
#eval "007".splitOn "07" -- ["007"]

The value of the second expression above should be ["0", ""]. For more context, see https://github.com/leanprover/lean4/issues/3829.

It took more than 160 hours for me to find this counterexample. :melting_face:

Issue

Context

I was trying to prove String.splitOn_of_valid in Std.Data.List.Lemmas, but I realized that the current definition of String.splitOn is flawed. See https://github.com/leanprover/std4/pull/719.

I think Mario Carneiro's definition of List.splitOnList in Std.Data.List.SplitOnList is correct, as can be seen below:

import Std.Data.List.SplitOnList

#eval ("012".1.splitOnList "12".1).map String.mk -- output: ["0", ""]
#eval ("007".1.splitOnList "07".1).map String.mk -- ["0", ""]

Version

4.7.0-rc2

Bulhwi Cha (Apr 03 2024 at 07:06):

@Mario Carneiro, do you think we should replace the current definition of String.splitOn with the one below?

import Std.Data.List.SplitOnList

def String.splitOn (l sep : String) : List String :=
  (List.splitOnList l.1 sep.1).map String.mk

Mario Carneiro (Apr 03 2024 at 07:08):

Maybe for modeling purposes, but that's a wasteful implementation

Mario Carneiro (Apr 03 2024 at 07:11):

actually we can't do that either, because String has so many non-WF cases that the equality theorem is false in full generality (maybe not for String because the inputs are simple, but definitely for Substring)

Mario Carneiro (Apr 03 2024 at 07:11):

so it would need to be a lemma

Bulhwi Cha (Apr 03 2024 at 07:13):

Then I guess we should come up with a new definition.

Bulhwi Cha (Apr 03 2024 at 07:16):

I'm also worried there may be other flawed definitions in Lean core.

Mario Carneiro (Apr 03 2024 at 07:17):

no need to worry about it, they will come out in the wash when we prove characterizing theorems

Bulhwi Cha (Apr 03 2024 at 07:20):

Right, but it seems I'm the only one willing to prove the eleven theorems on the to-do list in Std.Data.String.Lemmas. It'll be a painful journey. :joy:

Bulhwi Cha (Apr 03 2024 at 07:22):

Still, thanks for guiding me all along!

Mario Carneiro (Apr 03 2024 at 07:25):

I've been on another lean project for a while, but I'm planning to return to Std development soon and fill out some of the lemmas. I don't have an immediate need for those extra String lemmas but I can try to write the fixed version of the String and Substring functions for splitOn

Mauricio Collares (Apr 03 2024 at 12:17):

This bug was discussed before, I think (https://leanprover.zulipchat.com/#narrow/stream/270676-lean4/topic/should.20we.20redefine.20.60String.2EsplitOnAux.60.3F). I'd argue this is not a bug, since it's likely that all actual uses of splitOn use single characters as the second argument, and fixing it to work for all arguments would incur a performance penalty. I think adding extra hypotheses on the input makes more sense in this case.

Mario Carneiro (Apr 03 2024 at 12:18):

I think this is definitely a bug, in that the function does not do what it says on an input it apparently allows. Possibly it could get an additional precondition if we don't expect it to work on this input, but if the split string was only supposed to be a single character we already have a type for that

Mario Carneiro (Apr 03 2024 at 12:19):

the fact that fixing the bug could have a performance overhead doesn't make it any less of a bug

Mario Carneiro (Apr 03 2024 at 12:19):

but we also won't know that until we have an implementation for benchmarking

Mario Carneiro (Apr 03 2024 at 12:22):

I don't think these functions have gone through enough use and revisions to necessarily make it always reasonable to assume that any unusual behavior is there for a good reason

Mauricio Collares (Apr 03 2024 at 12:24):

I guess implementing naive string search would be just as fast for single character delimiters

Mauricio Collares (Apr 03 2024 at 12:25):

But I can't imagine Boyer-Moore/Boyer-Moore-Horspool/KMP not being overkill if the main use case is single-character delimiters

Mario Carneiro (Apr 03 2024 at 12:26):

I'm not sure where you got the idea that we'd be using that algorithm

Mario Carneiro (Apr 03 2024 at 12:27):

a plain loop seems appropriate here, since we're optimizing for the small pattern case

Mario Carneiro (Apr 03 2024 at 12:27):

Plus, this has to be implemented in core and I'd rather not move the KMP code upstream

Mauricio Collares (Apr 03 2024 at 12:28):

I think I was primed by the fact that the upstream code looks like incomplete KMP

Mario Carneiro (Apr 03 2024 at 12:28):

can you elaborate on what's incomplete about it?

Bulhwi Cha (Apr 03 2024 at 12:28):

Mauricio Collares said:

This bug was discussed before, I think (https://leanprover.zulipchat.com/#narrow/stream/270676-lean4/topic/should.20we.20redefine.20.60String.2EsplitOnAux.60.3F).

That was a different ~~bug~~ from this one.

Mauricio Collares (Apr 03 2024 at 12:29):

How is this bug different from "ababcab".splitOn "abc" failing? I don't get it

Mario Carneiro (Apr 03 2024 at 12:31):

that one was fixed, no? There is at least a merged PR that came out of that discussion

Mauricio Collares (Apr 03 2024 at 12:31):

There were two bugs discussed in the other thread, I'm talking about the second one.

Bulhwi Cha (Apr 03 2024 at 12:32):

Mauricio Collares said:

That being said, the original function looks a bit buggy. Does "ababcab".splitOn "abc" work there? It does look like it is missing some partial match handling (KMP-style or otherwise).

Oh, you meant this one.

Bulhwi Cha (Apr 03 2024 at 12:36):

Bulhwi Cha said:

That was a different ~~bug~~ from this one.

What I fixed last year was one of the many weird behaviors of String.splitOnAux. If I remember correctly, this had nothing to do with String.splitOn. So you can say this wasn't a bug.

Mauricio Collares (Apr 03 2024 at 12:36):

Mario Carneiro said:

can you elaborate on what's incomplete about it?

KMP would increase b (and decrement j appropriately) instead of increasing i at this line, according to a precomputed partial matching function: https://github.com/leanprover/lean4/blob/master/src/Init/Data/String/Basic.lean#L262

Mario Carneiro (Apr 03 2024 at 12:38):

Oh I thought you meant that comment in regards to Std's actual KMP implementation

Mauricio Collares (Apr 03 2024 at 12:38):

Oh, sorry, I didn't know Std had a KMP implementation! I now understand what you mean

Bulhwi Cha (Apr 03 2024 at 12:39):

Mauricio Collares said:

I'd argue this is not a bug, since it's likely that all actual uses of splitOn use single characters as the second argument, and fixing it to work for all arguments would incur a performance penalty. I think adding extra hypotheses on the input makes more sense in this case.

Give me a week to take a good rest. I have a hunch that we only need to slightly modify the current definition of String.splitOnAux. By the way, I won't use KMP to redefine the function because I don't know what it is.

Mauricio Collares (Apr 03 2024 at 12:42):

Yes, you can add an extra parameter to splitOnAux that records where the current match starts, and then restart at the next position in case of failure (b is not where the current match starts, but rather where the current "block" starts)

Bulhwi Cha (Apr 03 2024 at 12:42):

I don't think I'll need more parameters. But we'll see.

Mario Carneiro (Apr 03 2024 at 12:43):

The type signature of splitOnAux is subject to change, it's not meant for direct use and should just do what it needs to do to get a correct result for splitOn

Mario Carneiro (Apr 03 2024 at 12:44):

which is why I find it a bit odd to be studying them separately like this

Bulhwi Cha said:

What I fixed last year was one of the many weird behaviors of String.splitOnAux. If I remember correctly, this had nothing to do with String.splitOn. So you can say this wasn't a bug.

Mauricio Collares (Apr 03 2024 at 12:44):

Bulhwi Cha said:

I don't think I'll need more parameters. But we'll see.

Fair. I guess you can do s.prev i and sep.prev j a bunch of times until you hit the beginning of the separator string, then advance i once (assuming there is a prev function). But that's more overhead. It would be easier to implement if you could do index operations in O(1) time, but that would introduce another class of bugs for strings which contain multi-byte characters.

Mario Carneiro (Apr 03 2024 at 12:45):

you can do indexing in O(1)?

Mario Carneiro (Apr 03 2024 at 12:45):

you should just be counting byte offsets not characters

Mauricio Collares (Apr 03 2024 at 12:46):

Oh, is docs#Pos a byte offset?

Mauricio Collares (Apr 03 2024 at 12:47):

Got it, so it should be enough to go from b i j to something like b (s.next (i - j)) 0 in case of failure

Mario Carneiro (Apr 03 2024 at 12:47):

no you don't want to use nextn, that's counting characters

Bulhwi Cha (Apr 03 2024 at 12:48):

Mauricio Collares said:

How is this bug different from "ababcab".splitOn "abc" failing? I don't get it

(Now I'm a bit sad because you already knew about the bug I found a few hours ago. :cry:)

Mauricio Collares (Apr 03 2024 at 12:48):

Mario Carneiro said:

no you don't want to use nextn, that's counting characters

That's what we want to do in this case

Mario Carneiro (Apr 03 2024 at 12:48):

unclear why, but I haven't dug into this algorithm in depth

Mario Carneiro (Apr 03 2024 at 12:49):

there shouldn't be a single character count involved here

Mauricio Collares (Apr 03 2024 at 12:49):

splitOnAux s b i j means "try to find the j-th character of the pattern starting at position i of the text, where the block starts at b"

Mario Carneiro (Apr 03 2024 at 12:49):

where both i and j are byte offsets

Mauricio Collares (Apr 03 2024 at 12:49):

Yes, but if you discovered that something doesn't match, you need to start again at the next character, not next byte

Mario Carneiro (Apr 03 2024 at 12:50):

so the difference between them is also a byte offset

Mario Carneiro (Apr 03 2024 at 12:51):

you use next and prev to get the next byte offset, but there is no local variable saying this thing is N characters past that thing

Mario Carneiro (Apr 03 2024 at 12:51):

every variable is in terms of byte measurements between parts of the string

Mauricio Collares (Apr 03 2024 at 12:54):

i means "the currently matching text character is i bytes past the start". So, for example, suppose you matched X characters (Y bytes) so far successfully, and then you got a failure. You want to look go back to the start of the string (that is, go back X characters or Y bytes starting from byte offset i) to get where the beginning of the match was, then advance that by one character and try to match the first character of the pattern again.

Mauricio Collares (Apr 03 2024 at 12:54):

This looks a lot more complicated than it is because the start of the current match is not currently a parameter (b is something else)

Mario Carneiro (Apr 03 2024 at 12:54):

right, so that's i - Y

Mario Carneiro (Apr 03 2024 at 12:55):

no nextn because we aren't storing X and it wouldn't be useful if we did

Mauricio Collares (Apr 03 2024 at 12:56):

But that's the beginning of the current match, which we know failed. We want to start again at the next possible beginning position.

Mario Carneiro (Apr 03 2024 at 12:56):

(i -Y).next

Mauricio Collares (Apr 03 2024 at 12:57):

Oh, are you saying Y and j are different things because j is not an offset wrt the beginning of the pattern?

Mauricio Collares (Apr 03 2024 at 12:57):

j in the code is a Pos referring to the separator/pattern

Mauricio Collares (Apr 03 2024 at 12:58):

(We're trying to match the character at offset j from sep to the character at offset i from s)

Mario Carneiro (Apr 03 2024 at 12:58):

it measures two things, the byte offset in the pattern and also in the search string (relative to i), we know that they agree so far hence it's a valid offset in both

Mauricio Collares (Apr 03 2024 at 12:59):

Yes, I agree, but you're claiming s.next (i-j) was wrong and I don't understand why (thanks for being patient by the way!)

Mario Carneiro (Apr 03 2024 at 12:59):

I was claiming s.nextn (i - j) was wrong for "typing reasons"

Mauricio Collares (Apr 03 2024 at 13:01):

Oh, but I didn't mean nextn, I really did mean next (I didn't realize nextn existed, I thought it was just a typo)

Mario Carneiro (Apr 03 2024 at 13:01):

nextn is how you move N characters forward

Mario Carneiro (Apr 03 2024 at 13:01):

so if you had X around you would need it

Mauricio Collares (Apr 03 2024 at 13:02):

Yes, but I think s.next x finds the character immediately after the one at offset x (at least that's how the code uses it iiuc)

Mauricio Collares (Apr 03 2024 at 13:02):

the offset of that character, that is

Mauricio Collares (Apr 03 2024 at 13:12):

I don't understand how nextn entered the conversation in the first place. But regardless: Thanks for the explanation, it was very useful! I didn't know Pos was a byte offset.

Mario Carneiro (Apr 03 2024 at 13:15):

Mauricio Collares said:

Got it, so it should be enough to go from b i j to something like b (s.next (i - j)) 0 in case of failure

I misunderstood this as being about nextn because I think of next as being a unary function

Mauricio Collares (Apr 03 2024 at 13:21):

Ah, got it! I was just cargo-culting splitOnAux's coding convention. I now know about nextn, which is nice :)

Mauricio Collares (Apr 03 2024 at 13:26):

Bulhwi Cha said:

Mauricio Collares said:

How is this bug different from "ababcab".splitOn "abc" failing? I don't get it

(Now I'm a bit sad because you already knew about the bug I found a few hours ago. :cry:)

That's my fault, I didn't emphasize it because I thought fixing it would be a performance regression in some cases (as Mario pointed out, I saw the need for KMP where there was none). But replacing splitOnAux s sep b (s.next i) 0 r by splitOnAux s sep b (s.next (i - j)) 0 r is clearly "free" for single-character separators since j is always 0, so I'm really glad you investigated this issue.

Bulhwi Cha (Apr 03 2024 at 15:13):

Mario Carneiro said:

which is why I find it a bit odd to be studying them separately like this

Bulhwi Cha said:

What I fixed last year was one of the many weird behaviors of String.splitOnAux. If I remember correctly, this had nothing to do with String.splitOn. So you can say this wasn't a bug.

The weirder the auxiliary function, the harder it is to prove theorems about the main function.

Bulhwi Cha (Apr 03 2024 at 15:28):

List.splitOnList

import Std.Data.String.Lemmas
import Std.Tactic.OpenPrivate

namespace List

theorem head_cons_tail : ∀ l h, @head α l h :: l.tail = l
  | _::_, _ => rfl

theorem IsPrefix.iff_isPrefixOf [BEq α] [LawfulBEq α] {l₁ l₂ : List α} :
    l₁ <+: l₂ ↔ isPrefixOf l₁ l₂ := by
  match l₁, l₂ with
  | [],   _  => simp [nil_prefix]
  | _::_, [] => simp
  | a::as, b::bs =>
    constructor
    · intro ⟨t, ht⟩
      simp only [cons_append, cons.injEq] at ht
      have hpf : as <+: bs := ⟨t, ht.2⟩
      simpa [isPrefixOf, ht.1] using IsPrefix.iff_isPrefixOf.mp hpf
    · intro h
      simp only [isPrefixOf, Bool.and_eq_true, beq_iff_eq] at h
      let ⟨t, ht⟩ := IsPrefix.iff_isPrefixOf.mpr h.2
      exists t
      simpa [h.1]

end List

namespace String
open private add_csize_pos from Std.Data.String.Lemmas

theorem utf8Len_pos : 0 < utf8Len l ↔ l ≠ [] := by
  cases l <;> simp [add_csize_pos]

end String

open String

def List.splitOnList' (l sep : List Char) : List (List Char) :=
  if h : l = [] ∨ sep = [] then
    [l]
  else
    let ⟨hls, hsp⟩ := not_or.mp h
    if hpf : sep.isPrefixOf l then
      have : utf8Len (l.drop sep.length) < utf8Len l := by
        let ⟨t, ht⟩ := IsPrefix.iff_isPrefixOf.mpr hpf
        simp only [← ht, drop_left, utf8Len_append, gt_iff_lt]
        exact Nat.lt_add_of_pos_left (utf8Len_pos.mpr hsp)
      [] :: splitOnList' (l.drop sep.length) sep
    else
      have : utf8Len l.tail < utf8Len l := by
        conv => rhs; rw [← head_cons_tail l hls, utf8Len_cons]
        exact Nat.lt_add_of_pos_right <| csize_pos (l.head hls)
      modifyHead (l.head hls :: ·) (splitOnList' l.tail sep)
termination_by utf8Len l

def String.splitOn' (s sep : String) := (List.splitOnList' s.1 sep.1).map mk

#eval "012".splitOn' "12" -- output: ["0", ""]
#eval "007".splitOn' "07" -- ["0", ""]

#eval Lean.versionString -- "4.7.0-rc2"

I think I found an easier way to define List.splitOnList. See the above code.

Mario Carneiro said:

Maybe for modeling purposes, but that's a wasteful implementation

actually we can't do that either, because String has so many non-WF cases that the equality theorem is false in full generality (maybe not for String because the inputs are simple, but definitely for Substring)

Since we can't define String.splitOn as (List.splitOnList s.1 sep.1).map mk, I'll figure out a new definition of it.

Notification Bot (Apr 03 2024 at 15:29):

Bulhwi Cha has marked this topic as resolved.

Notification Bot (Apr 03 2024 at 15:30):

Bulhwi Cha has marked this topic as unresolved.

Bulhwi Cha (Apr 03 2024 at 15:57):

If we have a proof that String.IsPrefix sep s is equivalent to String.isPrefixOf sep s, we can use it to define String.splitOn in a way similar to my definition of List.splitOnList above. We probably don't need to deal with byte offsets because String.isPrefixOf will do it for us.

I really should rest now. :innocent:

Mario Carneiro (Apr 04 2024 at 05:54):

lean4#3832

Bulhwi Cha (Apr 04 2024 at 06:16):

Mauricio Collares said:

But replacing splitOnAux s sep b (s.next i) 0 r by splitOnAux s sep b (s.next (i - j)) 0 r is clearly "free" for single-character separators since j is always 0, so I'm really glad you investigated this issue.

Now I understand why we need this change. :sweat_smile:

Bulhwi Cha (Apr 04 2024 at 06:23):

Allow me to prove String.splitOnAux_of_valid and String.splitOn_of_valid. I can do this.

Number Eighteen (Apr 04 2024 at 13:20):

Another issue with String.splitOn is that it causes crashes (in compiled code as well) when it is fed a very large String (GB+), because of the limitations of the List type. For myself, I had to write a String.splitArrayOn to do my work.

Similar situations may occur in other places where things get turned to List; since Core functions are supposed to be the "common standard", is there any plan to either fix List crashes or make Array the default container for these actions?

Mario Carneiro (Apr 04 2024 at 23:49):

As long as you handle the result well, the consequence of using List should not be that bad. List should still work with a billion elements in the list, but it has more overhead than array in that case. Are you sure the issue was actually with splitOn and not with subsequent calculation (like printing it to the screen)?

Number Eighteen (Apr 05 2024 at 12:01):

Mario Carneiro said:

As long as you handle the result well, the consequence of using List should not be that bad. List should still work with a billion elements in the list, but it has more overhead than array in that case. Are you sure the issue was actually with splitOn and not with subsequent calculation (like printing it to the screen)?

No, splitOn as a function was fine, it was subsequent handling of the List structure that caused the crash and made me write splitArray; but that is my point: since List is causing crashes where Array does not, either List needs to be reworked or Array needs to become the standard container for Core collections. Core, or Std or however you call it, should to be as correct as possible since it is the reference library and the "common dialect" of Lean code (my opinion, at least this is how it works in all other languages I know).

Mario Carneiro (Apr 05 2024 at 12:02):

you can convert the List to an Array using toArray and handle it as such if that helps

Mario Carneiro (Apr 05 2024 at 12:03):

the issues that can occur with large lists can also occur for large arrays, depending on how the function is written

Mario Carneiro (Apr 05 2024 at 12:05):

Choosing data structures for standard types is not an easy task, because workloads can vary significantly and functions have multiple constraints on them: being usable in proofs, having good kernel performance, good interpreter performance and good compiled performance

Mario Carneiro (Apr 05 2024 at 12:05):

not everyone is making gigabyte large lists with splitOn

Mario Carneiro (Apr 05 2024 at 12:06):

it may be the case that you have an unusual workload and the best option is to write a bespoke version of the function

Mario Carneiro (Apr 05 2024 at 12:07):

Having an array version of splitOn seems like a reasonable ask, although I would want to first review the existing uses of splitOn to see whether lists or arrays are better suited to the existing use cases in lean core

Mario Carneiro (Apr 05 2024 at 12:11):

since List is causing crashes where Array does not

just to be clear, it's not "List causing crashes", it's the use of a non-tail-recursive function on lists. I don't know which function you ran into but Lean/Std try to ensure that they don't handle list arguments in a way that will explode on long lists. One place which I know is non-tail-recursive is the list formatter, but this is unavoidable because it ends up in another, tree-like inductive type called Format which can explode when traversed. We would need to rewrite the formatter significantly or replace it entirely, and I am sure there is no appetite for that in lean core

Mario Carneiro (Apr 05 2024 at 12:12):

especially when the obvious answer is "stop trying to print a gigabyte long list"

Mario Carneiro (Apr 05 2024 at 12:17):

Core, or Std or however you call it

These are different projects. Lean 4 core is terminology referring to the lean4 repo https://github.com/leanprover/lean4/, and Std is https://github.com/leanprover/std4/ . The first repo is responsible for the language itself and everything you get with no imports at all, and the second repo provides things that you can import with import Std.Data.Array.Basic or the like. They both share some responsibility for defining core functions on Lists and Arrays. I am a Std maintainer, but this particular function String.splitOn happens to live in core and I have no authority to be making decisions of this kind on their behalf.

Timo Carlin-Burns (Apr 05 2024 at 14:35):

Number Eighteen said:

either List needs to be reworked or Array needs to become the standard container for Core collections.

I made an RFC recently with a very related proposal: https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/.5BRFC.5D.20single.20list.20type/near/430304738

Last updated: May 02 2025 at 03:31 UTC