Zulip Chat Archive

Stream: batteries

Topic: ten unproved string theorems


Bulhwi Cha (Apr 16 2024 at 10:04):

I plan to prove the eleven remaining theorems on the TODO list in Std.Data.String.Lemmas. If I'm lucky, I'll be supported this summer or next year by NIPA's OSS Support Center (Open UP) to teach Korean students and programmers how to prove these string theorems.

I've just proved the first theorem: String.splitOn_of_valid. See std#743. It took me around 170 hours (December 2023–April 2024) to prove it. There was a bug in the definition of String.splitOnAux. This bug was one reason why it took so long. Thankfully, @Mario Carneiro fixed it. See lean#3832.

Now we have ten theorems to go, but I think I'll work on other projects for a while. :smile:

Bulhwi Cha (Apr 16 2024 at 10:17):

Since Lean 4.7.0 still has the bug in String.splitOnAux, we'll have to wait for Lean 4.8.0-rc1 to come.

Mario Carneiro (Apr 16 2024 at 10:18):

you can either change the base branch to nightly-testing or add a copy of splitOnAux and prove properties about that instead

Bulhwi Cha (Apr 16 2024 at 10:20):

Right. I copied the function for now.

Bulhwi Cha (Apr 16 2024 at 10:31):

By the way, I proved that List.IsPrefix is equivalent to List.isPrefixOf, and List.IsSuffix is equivalent to List.isSuffixOf. It'll be nice if Lean can automatically generate lemmas for isPrefixOf and isSuffixOf whenever I prove those for IsPrefix and IsSuffix. Do you think this is a good idea? If so, how do we do that?

Mario Carneiro (Apr 16 2024 at 10:32):

the idea is to have all the lemmas about IsPrefix, and have a simp lemma saying isPrefixOf l1 l2 <-> IsPrefix l1 l2, and then you never have to worry about isPrefixOf again

Bulhwi Cha (Apr 16 2024 at 10:33):

I'll try it a few hours later. Thanks! (Edit: done!)

Bulhwi Cha (Apr 19 2024 at 15:18):

Here are my four pull requests awaiting reviews:

  • refactor: move theorems about lists from mathlib (std#756)
  • feat: add lemmas about Nat.add and List (std#757) I closed it.
  • refactor: move Function.id_def from mathlib (std#755)
  • feat: add String.splitOn_of_valid (std#743)

Bulhwi Cha (May 03 2024 at 02:28):

When I proved that the theorem String.splitOnAux_of_valid terminates, I had to hide more than ten illegible hypotheses from each goal: https://github.com/leanprover-community/batteries/pull/743/files#diff-38e93b9f694cdc66d675df840d70c3c0d47cc4b5a1813c4aa7d4d7744efff544R487-R510.

I wonder why these unwanted hypotheses show up. Here's the hypothesis _₀ of the first goal:

_₀ : 
  (y :
    (_ : List Char) ×'
      (sep₂ : List Char) ×' (_ : List Char) ×' (_ : List Char) ×' (_ : List Char) ×' (_ : List String) ×' sep₂  []),
  (invImage
          (fun x =>
            PSigma.casesOn x fun sep₁ sep₂ =>
              PSigma.casesOn sep₂ fun sep₂ l =>
                PSigma.casesOn l fun l m =>
                  PSigma.casesOn m fun m r =>
                    PSigma.casesOn r fun r acc =>
                      PSigma.casesOn acc fun acc h => (utf8Len sep₁ + utf8Len r, utf8Len sep₂))
          Prod.instWellFoundedRelation).1
      y sep₁, sep₂ 
    { data := y.2.2.1 ++ y.2.2.2.1 ++ y.1 ++ y.2.2.2.2.1 }.splitOnAux { data := y.1 ++ y.2.1 }
        { byteIdx := utf8Len y.2.2.1 } { byteIdx := utf8Len (y.2.2.1 ++ y.2.2.2.1 ++ y.1) } { byteIdx := utf8Len y.1 }
        y.2.2.2.2.2.1 =
      y.2.2.2.2.2.1.reverse ++
        List.map mk (List.modifyHead (fun x => y.2.2.2.1 ++ x) (y.2.2.2.2.1.splitOnListAux y.1 y.2.1 ))

Bulhwi Cha (May 06 2024 at 14:37):

Currently, I have two pull requests in Batteries awaiting reviews and one in Mathlib that bumps Batteries:

Bulhwi Cha (May 09 2024 at 17:21):

I made another PR, batteries#790, from a commit in batteries#756.

  • refactor: remove @[simp] from List.modifyHead (Merged)

Bulhwi Cha (May 23 2024 at 14:20):

From https://github.com/leanprover-community/batteries/pull/782/files/8895e1928f93ae0feace47a7d9e006d2e2e66ce8#diff-38e93b9f694cdc66d675df840d70c3c0d47cc4b5a1813c4aa7d4d7744efff544L782-L783:

- -- TODO: substrEq
- -- TODO: isPrefixOf

Please don't remove the TODO lines in Batteries.Data.String.Lemmas! These are the ten unproved theorems specifying string operations, which I'll prove in the future.

Bulhwi Cha (May 23 2024 at 14:23):

I added a review to batteries#782 but it's labeled "Pending." So, I guess it's currently visible only to me.

Ruben Van de Velde (May 23 2024 at 14:27):

I see a review starting "I'm pretty sure String.IsPrefix and String.isPrefixOf are equivalent..."

Bulhwi Cha (May 23 2024 at 14:28):

That's another comment I made.

Yaël Dillies (May 23 2024 at 14:29):

Go to the diff (the url of batteries#782 + /files) and submit your review

Bulhwi Cha (May 23 2024 at 14:32):

Oh, I didn't know I had to submit it. Thanks! I think this was my first time reviewing someone else's pull request.

tjf801 (May 23 2024 at 20:45):

Sorry about that! I implemented all of these suggestions, and verified your hunch about String.isPrefixOf and Mathlib's String.IsPrefix in batteries#809, and closed batteries#782.

Bulhwi Cha (May 24 2024 at 06:11):

Thanks! I'll add some code to your PR or open a new PR. It'll take some time.

Bulhwi Cha (May 30 2024 at 04:51):

import Batteries.Data.String.Lemmas

namespace String

theorem substrEq_iff_extract_eq_extract : substrEq s₁ off₁ s₂ off₂ sz 
    s₁.extract off₁ (off₁ + sz⟩) = s₂.extract off₂ (off₂ + sz⟩) := by
  sorry

theorem substrEq_of_valid (h : utf8Len l₁ = utf8Len l₂) :
    substrEq pre₁ ++ l₁ ++ suf₁ utf8Len pre₁ pre₂ ++ l₂ ++ suf₂ utf8Len pre₂ (utf8Len l₁) 
      l₁ = l₂ := by
  rw [substrEq_iff_extract_eq_extract]
  simp only [Pos.add_eq]
  conv => lhs; rhs; arg 3; rw [h]
  repeat rw [extract_of_valid]
  exact ext_iff

end String

@Mario Carneiro, do you think my statement of String.substrEq_of_valid is correct?

Mario Carneiro (May 30 2024 at 05:01):

theorem substrEq_of_valid :
    substrEq pre₁ ++ suf₁ utf8Len pre₁ pre₂ ++ suf₂ utf8Len pre₂ n 
       l, l <+: suf₁  l <+: suf₂  utf8Len l = n := sorry

Bulhwi Cha (May 30 2024 at 05:06):

Thanks for your suggestion!

Bulhwi Cha (Jun 08 2024 at 11:29):

Bulhwi Cha said:

When I proved that the theorem String.splitOnAux_of_valid terminates, I had to hide more than ten illegible hypotheses from each goal: https://github.com/leanprover-community/batteries/pull/743/files#diff-38e93b9f694cdc66d675df840d70c3c0d47cc4b5a1813c4aa7d4d7744efff544R487-R510.

I wonder why these unwanted hypotheses show up. Here's the hypothesis _₀ of the first goal:

Hypothesis

I'm moving the Lean toolchain in std#743 from v4.8.0-rc2 to v4.9.0-rc1. Sadly, I still get these unintelligible hypotheses: https://paste.sr.ht/~chabulhwi/17bd2b05d8d6ea181db82bc5b39b6f93cb40fe3b.

Bulhwi Cha (Jun 16 2024 at 13:48):

How do we know what lemmas are worth being included in the Lean standard library? Are lemmas for proving these ten specification theorems about string operations worth it?

Ruben Van de Velde (Jun 16 2024 at 14:03):

We put them in mathlib or batteries and at some point we get errors upgrading the nightly version because core took them

Bulhwi Cha (Jun 16 2024 at 14:06):

Um, is this an answer to my question?

Ruben Van de Velde (Jun 16 2024 at 14:13):

Depends on what you mean by "Lean standard library" - do you mean batteries or core?

Bulhwi Cha (Jun 16 2024 at 14:14):

I meant the new standard library, part of the Lean core library.

Ruben Van de Velde (Jun 16 2024 at 14:14):

Then yes, it was meant to be an answer

Bulhwi Cha (Jun 16 2024 at 14:32):

Sorry, I don't understand. These ten unproved theorems were added to the old Std by Mario Carneiro.

Kim Morrison said:

My basic target is "specification lemmas, simp lemmas for ext/operation combinations, lemmas describing pairwise interactions of basic operations".

Kim said that they'd like to upstream specification theorems about lists and arrays. So, the specification theorems about strings might also get moved to the new Std. But I'm less sure about whether the lemmas for proving these specification theorems are worth being included in Std as well. Some of them feel too hacky.

Mario Carneiro (Jun 16 2024 at 14:33):

I think you should just focus on putting them in batteries, and if they are needed they will be upstreamed to the new std

Kim Morrison (Jun 16 2024 at 23:00):

There will be more news coming about this later, and when a design document is available we will post it, but it is possible there will be a radical overhaul of the String library (indeed, even the definition of String), so I would hesitate to encourage significant work on String at the moment as it may become obsolete (even to the point of not being able to meaningfully migrate results).

If you are interested in contributing to the development of basic data types, useful directions where help would be appreciated are:

  • BitVec (where Tobias Grosser's group have been contributing results, and Henrik Boving is using these results to verify his bitblasting tactic that is coming soon for everyone). Here there are many missing results about interactions of the basic operations still.
  • List. I'm doing a big reorganisation at the moment. One thing that I won't get to in the short term is basic lemmas about intersperse, intercalate, eraseDups, eraseReps, span, groupBy, removeAll, which currently are only used in meta code and have essentially no verification. Also, I'll have the first reorganisation PR up soon (edit: now lean#4469), and it will include some documentation on the range of lemmas we want about list operations, and I can imagine that going through Mathlib's List development and identifying a list of results there that fill gaps in what is described by that documentation would be helpful.
  • Cleaning up basic data type code in Mathlib:
    • e.g. moving Array.permute! into a subfolder of the tactic code that uses it, to make clear this is not for outside use and that there is no associated verification
    • upstreaming Mathlib.Data.Array.ExtractLemmas to Lean (I will take care of this eventually, but could happily merge a PR)
    • move LinearOrder Bool and Bool.injective_iff into other files, after which Mathlib.Data.Bool.Basic doesn't need to be imported anywhere in Mathlib
    • finish getting LazyList, ByteSlice out of Mathlib. These were PR'd to Batteries batteries#835 and batteries#836, but there were problems identified in review. If there are real downstream users, find out what they would like done. Otherwise, can we just deprecate in Mathlib and be done?
    • similarly, DList should depart from Mathlib
    • similarly, most of the MLList material should depart from Mathlib. Unfortunately some of it is used in rw_search, so a decision would have to be made about that part.
    • Mathlib.Data.BinaryHeap is not used in Mathlib and perhaps should move?

Bulhwi Cha (Jun 17 2024 at 02:49):

Kim Morrison said:

There will be more news coming about this later, and when a design document is available we will post it, but it is possible there will be a radical overhaul of the String library (indeed, even the definition of String), so I would hesitate to encourage significant work on String at the moment as it may become obsolete (even to the point of not being able to meaningfully migrate results).

I'm glad that there may be a substantial overhaul of definitions and theorems about strings in the future. But it also means that I don't have to prove the ten specification theorems and review Batteries#809.

Bulhwi Cha (Jun 17 2024 at 02:53):

@tjf801 I'd like to hear what you think about this possible overhaul.

Bulhwi Cha (Jun 17 2024 at 03:17):

I think it's time for me to move on to other work: learning mathematics with Mathlib, translating #nng and #tpil into Korean, and editing an interview video that discusses formalization of mathematics in Lean.

François G. Dorais (Jun 17 2024 at 03:24):

I've had a need for binary heap recently. I will move it to Batteries. WIP batteries#849

Kim Morrison (Jun 17 2024 at 04:57):

Looks good to me.

Bulhwi Cha (Jun 17 2024 at 08:54):

@Kim Morrison Can I ask what's the motivation behind the planned overhaul of the library work on strings?

Bulhwi Cha (Jun 17 2024 at 08:55):

I'm also curious about when the Lean core team began considering it.

Henrik Böving (Jun 17 2024 at 09:31):

Bulhwi Cha said:

I'm also curious about when the Lean core team began considering it.

The refactoring of String is on the way only very recently, the person that's working on it began approximately 2 weeks ago as you can see from the PR tracker of lean4

Eric Wieser (Jun 17 2024 at 09:41):

lean4#4461?

Eric Wieser (Jun 17 2024 at 09:43):

Is this likely to cause @Mario Carneiro's lean4#3963 to bitrot, or is this motivation to get that PR in quickly?

Kim Morrison (Jun 17 2024 at 10:37):

I think there's no reason we can proceed with that in the meantime.

tjf801 (Jun 17 2024 at 14:42):

(deleted)

tjf801 (Jun 17 2024 at 14:42):

Bulhwi Cha said:

tjf801 I'd like to hear what you think about this possible overhaul.

Honestly? I don't really know how String could really end up being that much different, and for a lot of the lemmas in Batteries#809, I think they'd still be true for any reasonable definition of String. And to be honest, I really only was even proving these lemmas because I kept needing them to verify pieces of code I wrote. So if people (who are probably much smarter and more experienced than me!) can remove some of the crust around Strings and make them more usable for verification stuffs, I'm all for it! But I think that in the meantime, having lemmas to make the current solutions usable is also not necessarily a bad thing.

Bulhwi Cha (Jun 17 2024 at 16:19):

tjf801 said:

Honestly? I don't really know how String could really end up being that much different, and for a lot of the lemmas in Batteries#809, I think they'd still be true for any reasonable definition of String.

I agree with what you said, but the proofs of these lemmas will likely change. When I ported Mathlib.Data.String.Basic from Lean 3 to Lean 4 last year, I had to rewrite the proof of the theorem String.lt_iff_toList_lt from scratch. This was due to the new implementation of the type String.Iterator.

Eric Wieser said:

Why did you choose to rewrite it from scratch?

Bulhwi Cha said:

The previous proof didn't work, and its comment said, "TODO This proof probably has to be completely redone."

tjf801 (Jun 17 2024 at 16:53):

Bulhwi Cha said:

I agree with what you said, but the proofs of these lemmas will likely change. When I ported Mathlib.Data.String.Basic from Lean 3 to Lean 4 last year, I had to rewrite the proof of the theorem String.lt_iff_toList_lt from scratch. This was due to the new implementation of the type String.Iterator.

Yeah that's fair. But I feel like with proofs (and programming in general) once you've done something once, doing it again from scratch is much faster. At least in my experience

Bulhwi Cha (Jun 17 2024 at 17:24):

I've spent too much time proving String.splitOn_of_valid, which hasn't been reviewed for two months. I want to wait until the overhaul is finished.

Bulhwi Cha (Jun 17 2024 at 17:39):

By the way, String.lt_iff_toList_lt took me more than 30 hours to prove, and String.splitOn_of_valid took about 170 hours.

Bulhwi Cha (Jun 18 2024 at 02:23):

Bulhwi Cha said:

@Kim Morrison Can I ask what's the motivation behind the planned overhaul of the library work on strings?

I guess I have to wait for the design document.

Matthew Ballard (Aug 27 2024 at 08:45):

What is the status of the overhaul of String? I've already started smoothing some edges for use in my class and am wondering if I should spend any energy on PRing them.

Kim Morrison (Aug 27 2024 at 11:19):

Unfortunately overhauling String by the FRO has fallen back to medium priority, and in particular no one is actively working on it now.

Bulhwi Cha (Aug 27 2024 at 23:21):

@Kim Morrison When did the FRO decide to postpone the overhaul?

Kim Morrison (Aug 28 2024 at 02:47):

~3-4 weeks ago.

Bulhwi Cha (Aug 28 2024 at 04:25):

I wish the FRO had made a small announcement about this decision.

Bulhwi Cha (Sep 03 2024 at 04:54):

@Kim Morrison Have you been moving lemmas about Nat or List from my pull requests to the Lean core library?

Kim Morrison (Sep 03 2024 at 05:51):

I don't think so -- I don't recall looking at your PRs at all, sorry.

Kim Morrison (Sep 03 2024 at 05:51):

I've only had time for maintenance and reorganizing in Batteries for some time, I haven't been looking at PRs with any new material.

Bulhwi Cha (Sep 03 2024 at 05:55):

Then someone must've proven lemmas that I'd already done proving.

Bulhwi Cha (Sep 03 2024 at 06:00):

How can we prevent people from proving the same theorems so that they won't duplicate work already done?

Bulhwi Cha (Sep 03 2024 at 06:10):

I wish we had more Batteries maintainers who would review stale pull requests.

Bulhwi Cha (Sep 03 2024 at 06:18):

Here are my PRs awaiting reviews:

Ruben Van de Velde (Sep 03 2024 at 06:33):

I would suggest adding your new lemmas to mathlib instead

Bulhwi Cha (Sep 03 2024 at 06:42):

I'll think about it.

Bulhwi Cha (Sep 03 2024 at 06:45):

Bulhwi Cha said:

Then someone must've proven lemmas that I'd already done proving.

It turned out that Kim proved List.isPrefixOf_iff_prefix on July 26:

@[simp] theorem isPrefixOf_iff_prefix [BEq α] [LawfulBEq α] {l₁ l₂ : List α} :
    l₁.isPrefixOf l₂  l₁ <+: l₂ := by
  induction l₁ generalizing l₂ with
  | nil => simp
  | cons a l₁ ih =>
    cases l₂ with
    | nil => simp
    | cons a' l₂ => simp [isPrefixOf, ih]

Bulhwi Cha (Sep 03 2024 at 06:50):

Bulhwi Cha said:

By the way, I proved that List.IsPrefix is equivalent to List.isPrefixOf, and List.IsSuffix is equivalent to List.isSuffixOf. It'll be nice if Lean can automatically generate lemmas for isPrefixOf and isSuffixOf whenever I prove those for IsPrefix and IsSuffix. Do you think this is a good idea? If so, how do we do that?

I proved List.isPrefixOf_iff_IsPrefix on April 16:

@[simp] theorem isPrefixOf_iff_IsPrefix [BEq α] [LawfulBEq α] {l₁ l₂ : List α} :
    isPrefixOf l₁ l₂  l₁ <+: l₂ := by
  match l₁, l₂ with
  | [],   _  => simp [nil_prefix]
  | _::_, [] => simp
  | a::as, b::bs =>
    constructor
    · intro h
      simp only [isPrefixOf, Bool.and_eq_true, beq_iff_eq] at h
      let t, ht := isPrefixOf_iff_IsPrefix.mp h.2
      exists t
      simpa [h.1]
    · intro t, ht
      simp only [cons_append, cons.injEq] at ht
      have hpf : as <+: bs := t, ht.2
      simpa [isPrefixOf, ht.1] using isPrefixOf_iff_IsPrefix.mpr hpf

Bulhwi Cha (Sep 03 2024 at 06:52):

Their proof is much more elegant than mine, so I think the duplicated work was worth it in this case.

Bulhwi Cha (Oct 14 2024 at 12:36):

Kim Morrison said:

Unfortunately overhauling String by the FRO has fallen back to medium priority, and in particular no one is actively working on it now.

Do you think it's time for me to resume work on proving the remaining theorems on the TODO list in Batteries.Data.String.Lemmas? It seems Batteries#743 alone is too huge for the maintainers to review, so I might as well wait for the PR to get merged.

Kim Morrison (Oct 15 2024 at 00:52):

Personally, I don't think there is currently much value in adding theorems about String, unless you are have an application in mind.

Bulhwi Cha (Oct 15 2024 at 02:07):

The unexpected value of proving specification theorems about operations of a datatype was finding bugs in the definitions of these operations.

Bulhwi Cha said:

I've just proved the first theorem: String.splitOn_of_valid. See std#743. It took me around 170 hours (December 2023–April 2024) to prove it.

However, it takes a lot of time when the definitions haven't been 'optimized' for proving theorems about them. I'd like to see the overhaul of the String library first.

Bulhwi Cha (Oct 15 2024 at 02:41):

I also want to mention that the incorrect definition of String.splitOnAux had been around since March 23, 2019, until I suggested fixing it on April 3: https://leanprover.zulipchat.com/#narrow/stream/270676-lean4/topic/current.20definition.20of.20.60String.2EsplitOn.60.20is.20incorrect/near/430930535.

Bulhwi Cha (Oct 15 2024 at 02:47):

I think it's desirable that at least one person in this community should inspect the old code to find and fix bugs.

Kim Morrison (Oct 15 2024 at 03:23):

Yes, that's true. But I'd prefer that work in this area is driven by applications (e.g. code that does large scale text processing).

Bulhwi Cha (Oct 15 2024 at 03:49):

Maybe that's another reason I should continue learning programming after I finish my Lean mentorship program.

Bulhwi Cha (Oct 15 2024 at 04:30):

Nevertheless, wouldn't software developers prefer a programming language whose core library had already been examined thoroughly?

Kim Morrison (Oct 15 2024 at 06:48):

Sorry yes, of course it's fine to use String for practice.


Last updated: May 02 2025 at 03:31 UTC