Zulip Chat Archive
Stream: lean4
Topic: current definition of `String.splitOn` is incorrect
Bulhwi Cha (Apr 03 2024 at 06:52):
#eval "012".splitOn "12" -- output: ["0", ""]
#eval "007".splitOn "07" -- ["007"]
The value of the second expression above should be ["0", ""]
. For more context, see https://github.com/leanprover/lean4/issues/3829.
It took more than 160 hours for me to find this counterexample. :melting_face:
Issue
Bulhwi Cha (Apr 03 2024 at 07:06):
@Mario Carneiro, do you think we should replace the current definition of String.splitOn
with the one below?
import Std.Data.List.SplitOnList
def String.splitOn (l sep : String) : List String :=
(List.splitOnList l.1 sep.1).map String.mk
Mario Carneiro (Apr 03 2024 at 07:08):
Maybe for modeling purposes, but that's a wasteful implementation
Mario Carneiro (Apr 03 2024 at 07:11):
actually we can't do that either, because String has so many non-WF cases that the equality theorem is false in full generality (maybe not for String because the inputs are simple, but definitely for Substring)
Mario Carneiro (Apr 03 2024 at 07:11):
so it would need to be a lemma
Bulhwi Cha (Apr 03 2024 at 07:13):
Then I guess we should come up with a new definition.
Bulhwi Cha (Apr 03 2024 at 07:16):
I'm also worried there may be other flawed definitions in Lean core.
Mario Carneiro (Apr 03 2024 at 07:17):
no need to worry about it, they will come out in the wash when we prove characterizing theorems
Bulhwi Cha (Apr 03 2024 at 07:20):
Right, but it seems I'm the only one willing to prove the eleven theorems on the to-do list in Std.Data.String.Lemmas
. It'll be a painful journey. :joy:
Bulhwi Cha (Apr 03 2024 at 07:22):
Still, thanks for guiding me all along!
Mario Carneiro (Apr 03 2024 at 07:25):
I've been on another lean project for a while, but I'm planning to return to Std development soon and fill out some of the lemmas. I don't have an immediate need for those extra String lemmas but I can try to write the fixed version of the String and Substring functions for splitOn
Mauricio Collares (Apr 03 2024 at 12:17):
This bug was discussed before, I think (https://leanprover.zulipchat.com/#narrow/stream/270676-lean4/topic/should.20we.20redefine.20.60String.2EsplitOnAux.60.3F). I'd argue this is not a bug, since it's likely that all actual uses of splitOn use single characters as the second argument, and fixing it to work for all arguments would incur a performance penalty. I think adding extra hypotheses on the input makes more sense in this case.
Mario Carneiro (Apr 03 2024 at 12:18):
I think this is definitely a bug, in that the function does not do what it says on an input it apparently allows. Possibly it could get an additional precondition if we don't expect it to work on this input, but if the split string was only supposed to be a single character we already have a type for that
Mario Carneiro (Apr 03 2024 at 12:19):
the fact that fixing the bug could have a performance overhead doesn't make it any less of a bug
Mario Carneiro (Apr 03 2024 at 12:19):
but we also won't know that until we have an implementation for benchmarking
Mario Carneiro (Apr 03 2024 at 12:22):
I don't think these functions have gone through enough use and revisions to necessarily make it always reasonable to assume that any unusual behavior is there for a good reason
Mauricio Collares (Apr 03 2024 at 12:24):
I guess implementing naive string search would be just as fast for single character delimiters
Mauricio Collares (Apr 03 2024 at 12:25):
But I can't imagine Boyer-Moore/Boyer-Moore-Horspool/KMP not being overkill if the main use case is single-character delimiters
Mario Carneiro (Apr 03 2024 at 12:26):
I'm not sure where you got the idea that we'd be using that algorithm
Mario Carneiro (Apr 03 2024 at 12:27):
a plain loop seems appropriate here, since we're optimizing for the small pattern case
Mario Carneiro (Apr 03 2024 at 12:27):
Plus, this has to be implemented in core and I'd rather not move the KMP code upstream
Mauricio Collares (Apr 03 2024 at 12:28):
I think I was primed by the fact that the upstream code looks like incomplete KMP
Mario Carneiro (Apr 03 2024 at 12:28):
can you elaborate on what's incomplete about it?
Bulhwi Cha (Apr 03 2024 at 12:28):
Mauricio Collares said:
This bug was discussed before, I think (https://leanprover.zulipchat.com/#narrow/stream/270676-lean4/topic/should.20we.20redefine.20.60String.2EsplitOnAux.60.3F).
That was a different bug from this one.
Mauricio Collares (Apr 03 2024 at 12:29):
How is this bug different from "ababcab".splitOn "abc"
failing? I don't get it
Mario Carneiro (Apr 03 2024 at 12:31):
that one was fixed, no? There is at least a merged PR that came out of that discussion
Mauricio Collares (Apr 03 2024 at 12:31):
There were two bugs discussed in the other thread, I'm talking about the second one.
Bulhwi Cha (Apr 03 2024 at 12:32):
Mauricio Collares said:
That being said, the original function looks a bit buggy. Does
"ababcab".splitOn "abc"
work there? It does look like it is missing some partial match handling (KMP-style or otherwise).
Oh, you meant this one.
Bulhwi Cha (Apr 03 2024 at 12:36):
Bulhwi Cha said:
That was a different
bugfrom this one.
What I fixed last year was one of the many weird behaviors of String.splitOnAux
. If I remember correctly, this had nothing to do with String.splitOn
. So you can say this wasn't a bug.
Mauricio Collares (Apr 03 2024 at 12:36):
Mario Carneiro said:
can you elaborate on what's incomplete about it?
KMP would increase b
(and decrement j
appropriately) instead of increasing i
at this line, according to a precomputed partial matching function: https://github.com/leanprover/lean4/blob/master/src/Init/Data/String/Basic.lean#L262
Mario Carneiro (Apr 03 2024 at 12:38):
Oh I thought you meant that comment in regards to Std's actual KMP implementation
Mauricio Collares (Apr 03 2024 at 12:38):
Oh, sorry, I didn't know Std had a KMP implementation! I now understand what you mean
Bulhwi Cha (Apr 03 2024 at 12:39):
Mauricio Collares said:
I'd argue this is not a bug, since it's likely that all actual uses of splitOn use single characters as the second argument, and fixing it to work for all arguments would incur a performance penalty. I think adding extra hypotheses on the input makes more sense in this case.
Give me a week to take a good rest. I have a hunch that we only need to slightly modify the current definition of String.splitOnAux
. By the way, I won't use KMP to redefine the function because I don't know what it is.
Mauricio Collares (Apr 03 2024 at 12:42):
Yes, you can add an extra parameter to splitOnAux that records where the current match starts, and then restart at the next position in case of failure (b
is not where the current match starts, but rather where the current "block" starts)
Bulhwi Cha (Apr 03 2024 at 12:42):
I don't think I'll need more parameters. But we'll see.
Mario Carneiro (Apr 03 2024 at 12:43):
The type signature of splitOnAux
is subject to change, it's not meant for direct use and should just do what it needs to do to get a correct result for splitOn
Mario Carneiro (Apr 03 2024 at 12:44):
which is why I find it a bit odd to be studying them separately like this
Bulhwi Cha said:
What I fixed last year was one of the many weird behaviors of
String.splitOnAux
. If I remember correctly, this had nothing to do withString.splitOn
. So you can say this wasn't a bug.
Mauricio Collares (Apr 03 2024 at 12:44):
Bulhwi Cha said:
I don't think I'll need more parameters. But we'll see.
Fair. I guess you can do s.prev i
and sep.prev j
a bunch of times until you hit the beginning of the separator string, then advance i
once (assuming there is a prev
function). But that's more overhead. It would be easier to implement if you could do index operations in O(1) time, but that would introduce another class of bugs for strings which contain multi-byte characters.
Mario Carneiro (Apr 03 2024 at 12:45):
you can do indexing in O(1)?
Mario Carneiro (Apr 03 2024 at 12:45):
you should just be counting byte offsets not characters
Mauricio Collares (Apr 03 2024 at 12:46):
Oh, is docs#Pos a byte offset?
Mauricio Collares (Apr 03 2024 at 12:47):
Got it, so it should be enough to go from b i j
to something like b (s.next (i - j)) 0
in case of failure
Mario Carneiro (Apr 03 2024 at 12:47):
no you don't want to use nextn
, that's counting characters
Bulhwi Cha (Apr 03 2024 at 12:48):
Mauricio Collares said:
How is this bug different from
"ababcab".splitOn "abc"
failing? I don't get it
(Now I'm a bit sad because you already knew about the bug I found a few hours ago. :cry:)
Mauricio Collares (Apr 03 2024 at 12:48):
Mario Carneiro said:
no you don't want to use
nextn
, that's counting characters
That's what we want to do in this case
Mario Carneiro (Apr 03 2024 at 12:48):
unclear why, but I haven't dug into this algorithm in depth
Mario Carneiro (Apr 03 2024 at 12:49):
there shouldn't be a single character count involved here
Mauricio Collares (Apr 03 2024 at 12:49):
splitOnAux s b i j
means "try to find the j-th character of the pattern starting at position i of the text, where the block starts at b"
Mario Carneiro (Apr 03 2024 at 12:49):
where both i and j are byte offsets
Mauricio Collares (Apr 03 2024 at 12:49):
Yes, but if you discovered that something doesn't match, you need to start again at the next character, not next byte
Mario Carneiro (Apr 03 2024 at 12:50):
so the difference between them is also a byte offset
Mario Carneiro (Apr 03 2024 at 12:51):
you use next
and prev
to get the next byte offset, but there is no local variable saying this thing is N characters past that thing
Mario Carneiro (Apr 03 2024 at 12:51):
every variable is in terms of byte measurements between parts of the string
Mauricio Collares (Apr 03 2024 at 12:54):
i
means "the currently matching text character is i
bytes past the start". So, for example, suppose you matched X characters (Y bytes) so far successfully, and then you got a failure. You want to look go back to the start of the string (that is, go back X characters or Y bytes starting from byte offset i
) to get where the beginning of the match was, then advance that by one character and try to match the first character of the pattern again.
Mauricio Collares (Apr 03 2024 at 12:54):
This looks a lot more complicated than it is because the start of the current match is not currently a parameter (b is something else)
Mario Carneiro (Apr 03 2024 at 12:54):
right, so that's i - Y
Mario Carneiro (Apr 03 2024 at 12:55):
no nextn
because we aren't storing X and it wouldn't be useful if we did
Mauricio Collares (Apr 03 2024 at 12:56):
But that's the beginning of the current match, which we know failed. We want to start again at the next possible beginning position.
Mario Carneiro (Apr 03 2024 at 12:56):
(i -Y).next
Mauricio Collares (Apr 03 2024 at 12:57):
Oh, are you saying Y
and j
are different things because j
is not an offset wrt the beginning of the pattern?
Mauricio Collares (Apr 03 2024 at 12:57):
j
in the code is a Pos
referring to the separator/pattern
Mauricio Collares (Apr 03 2024 at 12:58):
(We're trying to match the character at offset j
from sep
to the character at offset i
from s
)
Mario Carneiro (Apr 03 2024 at 12:58):
it measures two things, the byte offset in the pattern and also in the search string (relative to i
), we know that they agree so far hence it's a valid offset in both
Mauricio Collares (Apr 03 2024 at 12:59):
Yes, I agree, but you're claiming s.next (i-j)
was wrong and I don't understand why (thanks for being patient by the way!)
Mario Carneiro (Apr 03 2024 at 12:59):
I was claiming s.nextn (i - j)
was wrong for "typing reasons"
Mauricio Collares (Apr 03 2024 at 13:01):
Oh, but I didn't mean nextn
, I really did mean next
(I didn't realize nextn
existed, I thought it was just a typo)
Mario Carneiro (Apr 03 2024 at 13:01):
nextn
is how you move N characters forward
Mario Carneiro (Apr 03 2024 at 13:01):
so if you had X
around you would need it
Mauricio Collares (Apr 03 2024 at 13:02):
Yes, but I think s.next x
finds the character immediately after the one at offset x
(at least that's how the code uses it iiuc)
Mauricio Collares (Apr 03 2024 at 13:02):
the offset of that character, that is
Mauricio Collares (Apr 03 2024 at 13:12):
I don't understand how nextn
entered the conversation in the first place. But regardless: Thanks for the explanation, it was very useful! I didn't know Pos
was a byte offset.
Mario Carneiro (Apr 03 2024 at 13:15):
Mauricio Collares said:
Got it, so it should be enough to go from
b i j
to something likeb (s.next (i - j)) 0
in case of failure
I misunderstood this as being about nextn
because I think of next
as being a unary function
Mauricio Collares (Apr 03 2024 at 13:21):
Ah, got it! I was just cargo-culting splitOnAux
's coding convention. I now know about nextn
, which is nice :)
Mauricio Collares (Apr 03 2024 at 13:26):
Bulhwi Cha said:
Mauricio Collares said:
How is this bug different from
"ababcab".splitOn "abc"
failing? I don't get it(Now I'm a bit sad because you already knew about the bug I found a few hours ago. :cry:)
That's my fault, I didn't emphasize it because I thought fixing it would be a performance regression in some cases (as Mario pointed out, I saw the need for KMP where there was none). But replacing splitOnAux s sep b (s.next i) 0 r
by splitOnAux s sep b (s.next (i - j)) 0 r
is clearly "free" for single-character separators since j
is always 0, so I'm really glad you investigated this issue.
Bulhwi Cha (Apr 03 2024 at 15:13):
Mario Carneiro said:
which is why I find it a bit odd to be studying them separately like this
Bulhwi Cha said:
What I fixed last year was one of the many weird behaviors of
String.splitOnAux
. If I remember correctly, this had nothing to do withString.splitOn
. So you can say this wasn't a bug.
The weirder the auxiliary function, the harder it is to prove theorems about the main function.
Bulhwi Cha (Apr 03 2024 at 15:28):
List.splitOnList
I think I found an easier way to define List.splitOnList
. See the above code.
Mario Carneiro said:
Maybe for modeling purposes, but that's a wasteful implementation
actually we can't do that either, because String has so many non-WF cases that the equality theorem is false in full generality (maybe not for String because the inputs are simple, but definitely for Substring)
Since we can't define String.splitOn
as (List.splitOnList s.1 sep.1).map mk
, I'll figure out a new definition of it.
Notification Bot (Apr 03 2024 at 15:29):
Bulhwi Cha has marked this topic as resolved.
Notification Bot (Apr 03 2024 at 15:30):
Bulhwi Cha has marked this topic as unresolved.
Bulhwi Cha (Apr 03 2024 at 15:57):
If we have a proof that String.IsPrefix sep s
is equivalent to String.isPrefixOf sep s
, we can use it to define String.splitOn
in a way similar to my definition of List.splitOnList
above. We probably don't need to deal with byte offsets because String.isPrefixOf
will do it for us.
I really should rest now. :innocent:
Mario Carneiro (Apr 04 2024 at 05:54):
Bulhwi Cha (Apr 04 2024 at 06:16):
Mauricio Collares said:
But replacing
splitOnAux s sep b (s.next i) 0 r
bysplitOnAux s sep b (s.next (i - j)) 0 r
is clearly "free" for single-character separators sincej
is always 0, so I'm really glad you investigated this issue.
Now I understand why we need this change. :sweat_smile:
Bulhwi Cha (Apr 04 2024 at 06:23):
Allow me to prove String.splitOnAux_of_valid
and String.splitOn_of_valid
. I can do this.
Number Eighteen (Apr 04 2024 at 13:20):
Another issue with String.splitOn
is that it causes crashes (in compiled code as well) when it is fed a very large String
(GB+), because of the limitations of the List
type. For myself, I had to write a String.splitArrayOn
to do my work.
Similar situations may occur in other places where things get turned to List
; since Core functions are supposed to be the "common standard", is there any plan to either fix List
crashes or make Array
the default container for these actions?
Mario Carneiro (Apr 04 2024 at 23:49):
As long as you handle the result well, the consequence of using List
should not be that bad. List
should still work with a billion elements in the list, but it has more overhead than array in that case. Are you sure the issue was actually with splitOn
and not with subsequent calculation (like printing it to the screen)?
Number Eighteen (Apr 05 2024 at 12:01):
Mario Carneiro said:
As long as you handle the result well, the consequence of using
List
should not be that bad.List
should still work with a billion elements in the list, but it has more overhead than array in that case. Are you sure the issue was actually withsplitOn
and not with subsequent calculation (like printing it to the screen)?
No, splitOn
as a function was fine, it was subsequent handling of the List
structure that caused the crash and made me write splitArray
; but that is my point: since List
is causing crashes where Array
does not, either List
needs to be reworked or Array
needs to become the standard container for Core collections. Core, or Std or however you call it, should to be as correct as possible since it is the reference library and the "common dialect" of Lean code (my opinion, at least this is how it works in all other languages I know).
Mario Carneiro (Apr 05 2024 at 12:02):
you can convert the List to an Array using toArray
and handle it as such if that helps
Mario Carneiro (Apr 05 2024 at 12:03):
the issues that can occur with large lists can also occur for large arrays, depending on how the function is written
Mario Carneiro (Apr 05 2024 at 12:05):
Choosing data structures for standard types is not an easy task, because workloads can vary significantly and functions have multiple constraints on them: being usable in proofs, having good kernel performance, good interpreter performance and good compiled performance
Mario Carneiro (Apr 05 2024 at 12:05):
not everyone is making gigabyte large lists with splitOn
Mario Carneiro (Apr 05 2024 at 12:06):
it may be the case that you have an unusual workload and the best option is to write a bespoke version of the function
Mario Carneiro (Apr 05 2024 at 12:07):
Having an array version of splitOn seems like a reasonable ask, although I would want to first review the existing uses of splitOn to see whether lists or arrays are better suited to the existing use cases in lean core
Mario Carneiro (Apr 05 2024 at 12:11):
since
List
is causing crashes whereArray
does not
just to be clear, it's not "List
causing crashes", it's the use of a non-tail-recursive function on lists. I don't know which function you ran into but Lean/Std try to ensure that they don't handle list arguments in a way that will explode on long lists. One place which I know is non-tail-recursive is the list formatter, but this is unavoidable because it ends up in another, tree-like inductive type called Format
which can explode when traversed. We would need to rewrite the formatter significantly or replace it entirely, and I am sure there is no appetite for that in lean core
Mario Carneiro (Apr 05 2024 at 12:12):
especially when the obvious answer is "stop trying to print a gigabyte long list"
Mario Carneiro (Apr 05 2024 at 12:17):
Core, or Std or however you call it
These are different projects. Lean 4 core is terminology referring to the lean4 repo https://github.com/leanprover/lean4/, and Std is https://github.com/leanprover/std4/ . The first repo is responsible for the language itself and everything you get with no imports at all, and the second repo provides things that you can import with import Std.Data.Array.Basic
or the like. They both share some responsibility for defining core functions on Lists and Arrays. I am a Std maintainer, but this particular function String.splitOn
happens to live in core and I have no authority to be making decisions of this kind on their behalf.
Timo Carlin-Burns (Apr 05 2024 at 14:35):
Number Eighteen said:
either
List
needs to be reworked orArray
needs to become the standard container for Core collections.
I made an RFC recently with a very related proposal: https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/.5BRFC.5D.20single.20list.20type/near/430304738
Last updated: May 02 2025 at 03:31 UTC