Zulip Chat Archive
Stream: batteries
Topic: ten unproved string theorems
Bulhwi Cha (Apr 16 2024 at 10:04):
I plan to prove the eleven remaining theorems on the TODO
list in Std.Data.String.Lemmas
. If I'm lucky, I'll be supported this summer or next year by NIPA's OSS Support Center (Open UP) to teach Korean students and programmers how to prove these string theorems.
I've just proved the first theorem: String.splitOn_of_valid
. See std#743. It took me around 170 hours (December 2023–April 2024) to prove it. There was a bug in the definition of String.splitOnAux
. This bug was one reason why it took so long. Thankfully, @Mario Carneiro fixed it. See lean#3832.
Now we have ten theorems to go, but I think I'll work on other projects for a while. :smile:
Bulhwi Cha (Apr 16 2024 at 10:17):
Since Lean 4.7.0 still has the bug in String.splitOnAux
, we'll have to wait for Lean 4.8.0-rc1 to come.
Mario Carneiro (Apr 16 2024 at 10:18):
you can either change the base branch to nightly-testing or add a copy of splitOnAux and prove properties about that instead
Bulhwi Cha (Apr 16 2024 at 10:20):
Right. I copied the function for now.
Bulhwi Cha (Apr 16 2024 at 10:31):
By the way, I proved that List.IsPrefix
is equivalent to List.isPrefixOf
, and List.IsSuffix
is equivalent to List.isSuffixOf
. It'll be nice if Lean can automatically generate lemmas for isPrefixOf
and isSuffixOf
whenever I prove those for IsPrefix
and IsSuffix
. Do you think this is a good idea? If so, how do we do that?
Mario Carneiro (Apr 16 2024 at 10:32):
the idea is to have all the lemmas about IsPrefix
, and have a simp lemma saying isPrefixOf l1 l2 <-> IsPrefix l1 l2
, and then you never have to worry about isPrefixOf
again
Bulhwi Cha (Apr 16 2024 at 10:33):
I'll try it a few hours later. Thanks! (Edit: done!)
Bulhwi Cha (Apr 19 2024 at 15:18):
Here are my four pull requests awaiting reviews:
- refactor: move theorems about lists from mathlib (std#756)
feat: add lemmas aboutI closed it.Nat.add
andList
(std#757)- refactor: move
Function.id_def
from mathlib (std#755) - feat: add
String.splitOn_of_valid
(std#743)
Bulhwi Cha (May 03 2024 at 02:28):
When I proved that the theorem String.splitOnAux_of_valid
terminates, I had to hide more than ten illegible hypotheses from each goal: https://github.com/leanprover-community/batteries/pull/743/files#diff-38e93b9f694cdc66d675df840d70c3c0d47cc4b5a1813c4aa7d4d7744efff544R487-R510.
I wonder why these unwanted hypotheses show up. Here's the hypothesis _₀
of the first goal:
_₀ : ∀
(y :
(_ : List Char) ×'
(sep₂ : List Char) ×' (_ : List Char) ×' (_ : List Char) ×' (_ : List Char) ×' (_ : List String) ×' sep₂ ≠ []),
(invImage
(fun x =>
PSigma.casesOn x fun sep₁ sep₂ =>
PSigma.casesOn sep₂ fun sep₂ l =>
PSigma.casesOn l fun l m =>
PSigma.casesOn m fun m r =>
PSigma.casesOn r fun r acc =>
PSigma.casesOn acc fun acc h => (utf8Len sep₁ + utf8Len r, utf8Len sep₂))
Prod.instWellFoundedRelation).1
y ⟨sep₁, sep₂✝⟩ →
{ data := y.2.2.1 ++ y.2.2.2.1 ++ y.1 ++ y.2.2.2.2.1 }.splitOnAux { data := y.1 ++ y.2.1 }
{ byteIdx := utf8Len y.2.2.1 } { byteIdx := utf8Len (y.2.2.1 ++ y.2.2.2.1 ++ y.1) } { byteIdx := utf8Len y.1 }
y.2.2.2.2.2.1 =
y.2.2.2.2.2.1.reverse ++
List.map mk (List.modifyHead (fun x => y.2.2.2.1 ++ x) (y.2.2.2.2.1.splitOnListAux y.1 y.2.1 ⋯))
Bulhwi Cha (May 06 2024 at 14:37):
Currently, I have two pull requests in Batteries awaiting reviews and one in Mathlib that bumps Batteries:
- refactor: move theorems about lists from mathlib (batteries#756, #12540)
- feat: add
String.splitOn_of_valid
(batteries#743)
Bulhwi Cha (May 09 2024 at 17:21):
I made another PR, batteries#790, from a commit in batteries#756.
refactor: remove(Merged)@[simp]
fromList.modifyHead
Bulhwi Cha (May 23 2024 at 14:20):
- -- TODO: substrEq
- -- TODO: isPrefixOf
Please don't remove the TODO
lines in Batteries.Data.String.Lemmas
! These are the ten unproved theorems specifying string operations, which I'll prove in the future.
Bulhwi Cha (May 23 2024 at 14:23):
I added a review to batteries#782 but it's labeled "Pending." So, I guess it's currently visible only to me.
Ruben Van de Velde (May 23 2024 at 14:27):
I see a review starting "I'm pretty sure String.IsPrefix
and String.isPrefixOf
are equivalent..."
Bulhwi Cha (May 23 2024 at 14:28):
That's another comment I made.
Yaël Dillies (May 23 2024 at 14:29):
Go to the diff (the url of batteries#782 + /files
) and submit your review
Bulhwi Cha (May 23 2024 at 14:32):
Oh, I didn't know I had to submit it. Thanks! I think this was my first time reviewing someone else's pull request.
tjf801 (May 23 2024 at 20:45):
Sorry about that! I implemented all of these suggestions, and verified your hunch about String.isPrefixOf
and Mathlib
's String.IsPrefix
in batteries#809, and closed batteries#782.
Bulhwi Cha (May 24 2024 at 06:11):
Thanks! I'll add some code to your PR or open a new PR. It'll take some time.
Bulhwi Cha (May 30 2024 at 04:51):
import Batteries.Data.String.Lemmas
namespace String
theorem substrEq_iff_extract_eq_extract : substrEq s₁ off₁ s₂ off₂ sz ↔
s₁.extract off₁ (off₁ + ⟨sz⟩) = s₂.extract off₂ (off₂ + ⟨sz⟩) := by
sorry
theorem substrEq_of_valid (h : utf8Len l₁ = utf8Len l₂) :
substrEq ⟨pre₁ ++ l₁ ++ suf₁⟩ ⟨utf8Len pre₁⟩ ⟨pre₂ ++ l₂ ++ suf₂⟩ ⟨utf8Len pre₂⟩ (utf8Len l₁) ↔
l₁ = l₂ := by
rw [substrEq_iff_extract_eq_extract]
simp only [Pos.add_eq]
conv => lhs; rhs; arg 3; rw [h]
repeat rw [extract_of_valid]
exact ext_iff
end String
@Mario Carneiro, do you think my statement of String.substrEq_of_valid
is correct?
Mario Carneiro (May 30 2024 at 05:01):
theorem substrEq_of_valid :
substrEq ⟨pre₁ ++ suf₁⟩ ⟨utf8Len pre₁⟩ ⟨pre₂ ++ suf₂⟩ ⟨utf8Len pre₂⟩ n ↔
∃ l, l <+: suf₁ ∧ l <+: suf₂ ∧ utf8Len l = n := sorry
Bulhwi Cha (May 30 2024 at 05:06):
Thanks for your suggestion!
Bulhwi Cha (Jun 08 2024 at 11:29):
Bulhwi Cha said:
When I proved that the theorem
String.splitOnAux_of_valid
terminates, I had to hide more than ten illegible hypotheses from each goal: https://github.com/leanprover-community/batteries/pull/743/files#diff-38e93b9f694cdc66d675df840d70c3c0d47cc4b5a1813c4aa7d4d7744efff544R487-R510.I wonder why these unwanted hypotheses show up. Here's the hypothesis
_₀
of the first goal:Hypothesis
I'm moving the Lean toolchain in std#743 from v4.8.0-rc2
to v4.9.0-rc1
. Sadly, I still get these unintelligible hypotheses: https://paste.sr.ht/~chabulhwi/17bd2b05d8d6ea181db82bc5b39b6f93cb40fe3b.
Bulhwi Cha (Jun 16 2024 at 13:48):
How do we know what lemmas are worth being included in the Lean standard library? Are lemmas for proving these ten specification theorems about string operations worth it?
Ruben Van de Velde (Jun 16 2024 at 14:03):
We put them in mathlib or batteries and at some point we get errors upgrading the nightly version because core took them
Bulhwi Cha (Jun 16 2024 at 14:06):
Um, is this an answer to my question?
Ruben Van de Velde (Jun 16 2024 at 14:13):
Depends on what you mean by "Lean standard library" - do you mean batteries or core?
Bulhwi Cha (Jun 16 2024 at 14:14):
I meant the new standard library, part of the Lean core library.
Ruben Van de Velde (Jun 16 2024 at 14:14):
Then yes, it was meant to be an answer
Bulhwi Cha (Jun 16 2024 at 14:32):
Sorry, I don't understand. These ten unproved theorems were added to the old Std by Mario Carneiro.
Kim Morrison said:
My basic target is "specification lemmas, simp lemmas for ext/operation combinations, lemmas describing pairwise interactions of basic operations".
Kim said that they'd like to upstream specification theorems about lists and arrays. So, the specification theorems about strings might also get moved to the new Std. But I'm less sure about whether the lemmas for proving these specification theorems are worth being included in Std as well. Some of them feel too hacky.
Mario Carneiro (Jun 16 2024 at 14:33):
I think you should just focus on putting them in batteries, and if they are needed they will be upstreamed to the new std
Kim Morrison (Jun 16 2024 at 23:00):
There will be more news coming about this later, and when a design document is available we will post it, but it is possible there will be a radical overhaul of the String
library (indeed, even the definition of String
), so I would hesitate to encourage significant work on String
at the moment as it may become obsolete (even to the point of not being able to meaningfully migrate results).
If you are interested in contributing to the development of basic data types, useful directions where help would be appreciated are:
BitVec
(where Tobias Grosser's group have been contributing results, and Henrik Boving is using these results to verify his bitblasting tactic that is coming soon for everyone). Here there are many missing results about interactions of the basic operations still.List
. I'm doing a big reorganisation at the moment. One thing that I won't get to in the short term is basic lemmas aboutintersperse
,intercalate
,eraseDups
,eraseReps
,span
,groupBy
,removeAll
, which currently are only used in meta code and have essentially no verification. Also, I'll have the first reorganisation PR up soon (edit: now lean#4469), and it will include some documentation on the range of lemmas we want about list operations, and I can imagine that going through Mathlib's List development and identifying a list of results there that fill gaps in what is described by that documentation would be helpful.- Cleaning up basic data type code in Mathlib:
- e.g. moving
Array.permute!
into a subfolder of the tactic code that uses it, to make clear this is not for outside use and that there is no associated verification - upstreaming
Mathlib.Data.Array.ExtractLemmas
to Lean (I will take care of this eventually, but could happily merge a PR) - move
LinearOrder Bool
andBool.injective_iff
into other files, after whichMathlib.Data.Bool.Basic
doesn't need to be imported anywhere in Mathlib - finish getting
LazyList
,ByteSlice
out of Mathlib. These were PR'd to Batteries batteries#835 and batteries#836, but there were problems identified in review. If there are real downstream users, find out what they would like done. Otherwise, can we just deprecate in Mathlib and be done? - similarly,
DList
should depart from Mathlib - similarly, most of the
MLList
material should depart from Mathlib. Unfortunately some of it is used inrw_search
, so a decision would have to be made about that part. Mathlib.Data.BinaryHeap
is not used in Mathlib and perhaps should move?
- e.g. moving
Bulhwi Cha (Jun 17 2024 at 02:49):
Kim Morrison said:
There will be more news coming about this later, and when a design document is available we will post it, but it is possible there will be a radical overhaul of the
String
library (indeed, even the definition ofString
), so I would hesitate to encourage significant work onString
at the moment as it may become obsolete (even to the point of not being able to meaningfully migrate results).
I'm glad that there may be a substantial overhaul of definitions and theorems about strings in the future. But it also means that I don't have to prove the ten specification theorems and review Batteries#809.
Bulhwi Cha (Jun 17 2024 at 02:53):
@tjf801 I'd like to hear what you think about this possible overhaul.
Bulhwi Cha (Jun 17 2024 at 03:17):
I think it's time for me to move on to other work: learning mathematics with Mathlib, translating #nng and #tpil into Korean, and editing an interview video that discusses formalization of mathematics in Lean.
François G. Dorais (Jun 17 2024 at 03:24):
I've had a need for binary heap recently. I will move it to Batteries. WIP batteries#849
Kim Morrison (Jun 17 2024 at 04:57):
Looks good to me.
Bulhwi Cha (Jun 17 2024 at 08:54):
@Kim Morrison Can I ask what's the motivation behind the planned overhaul of the library work on strings?
Bulhwi Cha (Jun 17 2024 at 08:55):
I'm also curious about when the Lean core team began considering it.
Henrik Böving (Jun 17 2024 at 09:31):
Bulhwi Cha said:
I'm also curious about when the Lean core team began considering it.
The refactoring of String
is on the way only very recently, the person that's working on it began approximately 2 weeks ago as you can see from the PR tracker of lean4
Eric Wieser (Jun 17 2024 at 09:41):
Eric Wieser (Jun 17 2024 at 09:43):
Is this likely to cause @Mario Carneiro's lean4#3963 to bitrot, or is this motivation to get that PR in quickly?
Kim Morrison (Jun 17 2024 at 10:37):
I think there's no reason we can proceed with that in the meantime.
tjf801 (Jun 17 2024 at 14:42):
(deleted)
tjf801 (Jun 17 2024 at 14:42):
Bulhwi Cha said:
tjf801 I'd like to hear what you think about this possible overhaul.
Honestly? I don't really know how String
could really end up being that much different, and for a lot of the lemmas in Batteries#809, I think they'd still be true for any reasonable definition of String
. And to be honest, I really only was even proving these lemmas because I kept needing them to verify pieces of code I wrote. So if people (who are probably much smarter and more experienced than me!) can remove some of the crust around Strings
and make them more usable for verification stuffs, I'm all for it! But I think that in the meantime, having lemmas to make the current solutions usable is also not necessarily a bad thing.
Bulhwi Cha (Jun 17 2024 at 16:19):
tjf801 said:
Honestly? I don't really know how
String
could really end up being that much different, and for a lot of the lemmas in Batteries#809, I think they'd still be true for any reasonable definition ofString
.
I agree with what you said, but the proofs of these lemmas will likely change. When I ported Mathlib.Data.String.Basic
from Lean 3 to Lean 4 last year, I had to rewrite the proof of the theorem String.lt_iff_toList_lt
from scratch. This was due to the new implementation of the type String.Iterator
.
Eric Wieser said:
Why did you choose to rewrite it from scratch?
Bulhwi Cha said:
The previous proof didn't work, and its comment said, "TODO This proof probably has to be completely redone."
tjf801 (Jun 17 2024 at 16:53):
Bulhwi Cha said:
I agree with what you said, but the proofs of these lemmas will likely change. When I ported
Mathlib.Data.String.Basic
from Lean 3 to Lean 4 last year, I had to rewrite the proof of the theoremString.lt_iff_toList_lt
from scratch. This was due to the new implementation of the typeString.Iterator
.
Yeah that's fair. But I feel like with proofs (and programming in general) once you've done something once, doing it again from scratch is much faster. At least in my experience
Bulhwi Cha (Jun 17 2024 at 17:24):
I've spent too much time proving String.splitOn_of_valid
, which hasn't been reviewed for two months. I want to wait until the overhaul is finished.
Bulhwi Cha (Jun 17 2024 at 17:39):
By the way, String.lt_iff_toList_lt
took me more than 30 hours to prove, and String.splitOn_of_valid
took about 170 hours.
Bulhwi Cha (Jun 18 2024 at 02:23):
Bulhwi Cha said:
@Kim Morrison Can I ask what's the motivation behind the planned overhaul of the library work on strings?
I guess I have to wait for the design document.
Matthew Ballard (Aug 27 2024 at 08:45):
What is the status of the overhaul of String
? I've already started smoothing some edges for use in my class and am wondering if I should spend any energy on PRing them.
Kim Morrison (Aug 27 2024 at 11:19):
Unfortunately overhauling String
by the FRO has fallen back to medium priority, and in particular no one is actively working on it now.
Bulhwi Cha (Aug 27 2024 at 23:21):
@Kim Morrison When did the FRO decide to postpone the overhaul?
Kim Morrison (Aug 28 2024 at 02:47):
~3-4 weeks ago.
Bulhwi Cha (Aug 28 2024 at 04:25):
I wish the FRO had made a small announcement about this decision.
Bulhwi Cha (Sep 03 2024 at 04:54):
@Kim Morrison Have you been moving lemmas about Nat
or List
from my pull requests to the Lean core library?
Kim Morrison (Sep 03 2024 at 05:51):
I don't think so -- I don't recall looking at your PRs at all, sorry.
Kim Morrison (Sep 03 2024 at 05:51):
I've only had time for maintenance and reorganizing in Batteries for some time, I haven't been looking at PRs with any new material.
Bulhwi Cha (Sep 03 2024 at 05:55):
Then someone must've proven lemmas that I'd already done proving.
Bulhwi Cha (Sep 03 2024 at 06:00):
How can we prevent people from proving the same theorems so that they won't duplicate work already done?
Bulhwi Cha (Sep 03 2024 at 06:10):
I wish we had more Batteries maintainers who would review stale pull requests.
Bulhwi Cha (Sep 03 2024 at 06:18):
Here are my PRs awaiting reviews:
- refactor: move theorems about lists from mathlib (batteries#756)
- refactor: move theorems about lists to batteries (mathlib#12540)
- feat: add
String.splitOn_of_valid
(batteries#743)
Ruben Van de Velde (Sep 03 2024 at 06:33):
I would suggest adding your new lemmas to mathlib instead
Bulhwi Cha (Sep 03 2024 at 06:42):
I'll think about it.
Bulhwi Cha (Sep 03 2024 at 06:45):
Bulhwi Cha said:
Then someone must've proven lemmas that I'd already done proving.
It turned out that Kim proved List.isPrefixOf_iff_prefix
on July 26:
@[simp] theorem isPrefixOf_iff_prefix [BEq α] [LawfulBEq α] {l₁ l₂ : List α} :
l₁.isPrefixOf l₂ ↔ l₁ <+: l₂ := by
induction l₁ generalizing l₂ with
| nil => simp
| cons a l₁ ih =>
cases l₂ with
| nil => simp
| cons a' l₂ => simp [isPrefixOf, ih]
Bulhwi Cha (Sep 03 2024 at 06:50):
Bulhwi Cha said:
By the way, I proved that
List.IsPrefix
is equivalent toList.isPrefixOf
, andList.IsSuffix
is equivalent toList.isSuffixOf
. It'll be nice if Lean can automatically generate lemmas forisPrefixOf
andisSuffixOf
whenever I prove those forIsPrefix
andIsSuffix
. Do you think this is a good idea? If so, how do we do that?
I proved List.isPrefixOf_iff_IsPrefix
on April 16:
@[simp] theorem isPrefixOf_iff_IsPrefix [BEq α] [LawfulBEq α] {l₁ l₂ : List α} :
isPrefixOf l₁ l₂ ↔ l₁ <+: l₂ := by
match l₁, l₂ with
| [], _ => simp [nil_prefix]
| _::_, [] => simp
| a::as, b::bs =>
constructor
· intro h
simp only [isPrefixOf, Bool.and_eq_true, beq_iff_eq] at h
let ⟨t, ht⟩ := isPrefixOf_iff_IsPrefix.mp h.2
exists t
simpa [h.1]
· intro ⟨t, ht⟩
simp only [cons_append, cons.injEq] at ht
have hpf : as <+: bs := ⟨t, ht.2⟩
simpa [isPrefixOf, ht.1] using isPrefixOf_iff_IsPrefix.mpr hpf
Bulhwi Cha (Sep 03 2024 at 06:52):
Their proof is much more elegant than mine, so I think the duplicated work was worth it in this case.
Bulhwi Cha (Oct 14 2024 at 12:36):
Kim Morrison said:
Unfortunately overhauling
String
by the FRO has fallen back to medium priority, and in particular no one is actively working on it now.
Do you think it's time for me to resume work on proving the remaining theorems on the TODO
list in Batteries.Data.String.Lemmas
? It seems Batteries#743 alone is too huge for the maintainers to review, so I might as well wait for the PR to get merged.
Kim Morrison (Oct 15 2024 at 00:52):
Personally, I don't think there is currently much value in adding theorems about String
, unless you are have an application in mind.
Bulhwi Cha (Oct 15 2024 at 02:07):
The unexpected value of proving specification theorems about operations of a datatype was finding bugs in the definitions of these operations.
Bulhwi Cha said:
I've just proved the first theorem:
String.splitOn_of_valid
. See std#743. It took me around 170 hours (December 2023–April 2024) to prove it.
However, it takes a lot of time when the definitions haven't been 'optimized' for proving theorems about them. I'd like to see the overhaul of the String
library first.
Bulhwi Cha (Oct 15 2024 at 02:41):
I also want to mention that the incorrect definition of String.splitOnAux
had been around since March 23, 2019, until I suggested fixing it on April 3: https://leanprover.zulipchat.com/#narrow/stream/270676-lean4/topic/current.20definition.20of.20.60String.2EsplitOn.60.20is.20incorrect/near/430930535.
Bulhwi Cha (Oct 15 2024 at 02:47):
I think it's desirable that at least one person in this community should inspect the old code to find and fix bugs.
Kim Morrison (Oct 15 2024 at 03:23):
Yes, that's true. But I'd prefer that work in this area is driven by applications (e.g. code that does large scale text processing).
Bulhwi Cha (Oct 15 2024 at 03:49):
Maybe that's another reason I should continue learning programming after I finish my Lean mentorship program.
Bulhwi Cha (Oct 15 2024 at 04:30):
Nevertheless, wouldn't software developers prefer a programming language whose core library had already been examined thoroughly?
Kim Morrison (Oct 15 2024 at 06:48):
Sorry yes, of course it's fine to use String
for practice.
Last updated: May 02 2025 at 03:31 UTC