Zulip Chat Archive
Stream: std4
Topic: Requirements for regex library
Chris Wong (Sep 25 2023 at 02:58):
Following on from this discussion –
https://leanprover.zulipchat.com/#narrow/stream/287929-mathlib4/topic/github.20permission/near/392839244
There are many options for regex libraries. To help choose, can we list our requirements for such a library?
So far I have:
- Written in a language already used in Std (C, C++, Lean)
- Supports capture clauses and find-and-replace
- Resistant to denial-of-service attacks
Another consideration is compile times. As an example, the CI for RE2 takes ~ 4 minutes, though I'm not sure how much of that is build vs test. Is that a price we're willing to pay for Std?
Mario Carneiro (Sep 25 2023 at 06:20):
If it's not pure lean and implemented in std, I would prefer it to be a separate library. There isn't any particular need to use a regex library in std itself, and I don't think it is a strong requirement that it be an exported functionality of std, especially if it is a binding to an external library like re2.
Ioannis Konstantoulas (Sep 25 2023 at 12:07):
Mario Carneiro said:
If it's not pure lean and implemented in std, I would prefer it to be a separate library. There isn't any particular need to use a regex library in std itself, and I don't think it is a strong requirement that it be an exported functionality of std, especially if it is a binding to an external library like re2.
As a heavy user of regexes, I would love to see a Lean implementation. But writing a module that is both performant and has all the modern niceties is a big challenge; can pure Lean code do it? For instance, what would be the steps to rewrite BurntSushi's ripgrep
in Lean?
Mario Carneiro (Sep 25 2023 at 12:13):
of course it's a challenge and a lot of work, and I wouldn't be surprised if there is a performance hit because you need a really good inner loop and it's not clear to me that lean can express it
Chris Wong (Sep 25 2023 at 13:38):
There's a three-way trade off here – [a] easy to implement, [b] fast most of the time, [c] doesn't blow up
Backtracking has [a] and [b]
NFA has [a] and [c]
Hybrid NFA + DFA cache has [b] and [c]
Chris Wong (Sep 26 2023 at 00:37):
If you'd like to implement it in pure Lean, Antimirov derivatives look promising:
https://semantic-domain.blogspot.com/2013/11/antimirov-derivatives-for-regular.html
ocaml-re uses derivatives too, but to lazily build a DFA :astonished:
https://news.ycombinator.com/item?id=35055667
Chris Wong (Sep 27 2023 at 13:50):
I ported that blog post (minus the DFA construction) to Lean here - https://gist.github.com/lambda-fairy/efc6657e34d977b1eeba27495e802301
So a regex library in 100 lines would be possible, if slow.
François G. Dorais (Sep 28 2023 at 00:06):
FWIW, my students and I are currently working on a provably correct regex library, including an efficient version using boolean matrix algebra. This is an undergraduate research project so progress is slow for educational reasons. I have zero objection to competing projects and I encourage these! If you can, please send me a notification of your work so we can include it in our benchmarks.
Chris Wong (Sep 28 2023 at 02:09):
Ooh, what do you mean by boolean matrix algebra? Is that the "everything is the same algorithm" thing?
http://r6.ca/blog/20110808T035622Z.html
Last updated: Dec 20 2023 at 11:08 UTC