Zulip Chat Archive

Stream: std4

Topic: Requirements for regex library


Chris Wong (Sep 25 2023 at 02:58):

Following on from this discussion –
https://leanprover.zulipchat.com/#narrow/stream/287929-mathlib4/topic/github.20permission/near/392839244

There are many options for regex libraries. To help choose, can we list our requirements for such a library?
So far I have:

  1. Written in a language already used in Std (C, C++, Lean)
  2. Supports capture clauses and find-and-replace
  3. Resistant to denial-of-service attacks

Another consideration is compile times. As an example, the CI for RE2 takes ~ 4 minutes, though I'm not sure how much of that is build vs test. Is that a price we're willing to pay for Std?

Mario Carneiro (Sep 25 2023 at 06:20):

If it's not pure lean and implemented in std, I would prefer it to be a separate library. There isn't any particular need to use a regex library in std itself, and I don't think it is a strong requirement that it be an exported functionality of std, especially if it is a binding to an external library like re2.

Ioannis Konstantoulas (Sep 25 2023 at 12:07):

Mario Carneiro said:

If it's not pure lean and implemented in std, I would prefer it to be a separate library. There isn't any particular need to use a regex library in std itself, and I don't think it is a strong requirement that it be an exported functionality of std, especially if it is a binding to an external library like re2.

As a heavy user of regexes, I would love to see a Lean implementation. But writing a module that is both performant and has all the modern niceties is a big challenge; can pure Lean code do it? For instance, what would be the steps to rewrite BurntSushi's ripgrep in Lean?

Mario Carneiro (Sep 25 2023 at 12:13):

of course it's a challenge and a lot of work, and I wouldn't be surprised if there is a performance hit because you need a really good inner loop and it's not clear to me that lean can express it

Chris Wong (Sep 25 2023 at 13:38):

There's a three-way trade off here – [a] easy to implement, [b] fast most of the time, [c] doesn't blow up
Backtracking has [a] and [b]
NFA has [a] and [c]
Hybrid NFA + DFA cache has [b] and [c]

Chris Wong (Sep 26 2023 at 00:37):

If you'd like to implement it in pure Lean, Antimirov derivatives look promising:
https://semantic-domain.blogspot.com/2013/11/antimirov-derivatives-for-regular.html

ocaml-re uses derivatives too, but to lazily build a DFA :astonished:
https://news.ycombinator.com/item?id=35055667

Chris Wong (Sep 27 2023 at 13:50):

I ported that blog post (minus the DFA construction) to Lean here - https://gist.github.com/lambda-fairy/efc6657e34d977b1eeba27495e802301
So a regex library in 100 lines would be possible, if slow.

François G. Dorais (Sep 28 2023 at 00:06):

FWIW, my students and I are currently working on a provably correct regex library, including an efficient version using boolean matrix algebra. This is an undergraduate research project so progress is slow for educational reasons. I have zero objection to competing projects and I encourage these! If you can, please send me a notification of your work so we can include it in our benchmarks.

Chris Wong (Sep 28 2023 at 02:09):

Ooh, what do you mean by boolean matrix algebra? Is that the "everything is the same algorithm" thing?
http://r6.ca/blog/20110808T035622Z.html


Last updated: Dec 20 2023 at 11:08 UTC