Zulip Chat Archive

Stream: Program verification

Topic: Formalizing and verifying a regular experssion engine

pandaman (Aug 04 2024 at 06:49):

Hi everyone,

I'm the author of lean-regex, a library that brings regular expression searches to Lean and verifies their correctness. Recently, I've been thinking about how to formalize and verify the correctness of regex searches with capture groups and thought this problem might interest the community as well. While I don't have specific questions yet, I would appreciate any thoughts, suggestions, or references on this topic.

pandaman (Aug 04 2024 at 06:49):

Capture groups introduce new kinds of complexities to the semantics of regular expressions, which are not typically covered in textbook. In particular:

The regex semantics need to track these positions as an "output" of the predicate.
Capture groups require tracking positions in the original string, which is not very compositional but a global property.

To address this, I developed an Expr.captures semantics. I was able to prove a correspondence between this new semantics and the existing semantics without capture groups, which I verified for correctness before. So I believe this is a good starting point for further verification using this semantics.

pandaman (Aug 04 2024 at 06:50):

The regex implementation utilizes NFA (Non-deterministic Finite Automaton) under the hood. So, the next step is to formalize the semantics of capture groups in terms of transitions within the compiled NFA. For example, a regex (e)ₙ (which means n'th capture group wrapping e) is compiled to NFA as follows:

| .save (2 * n) | -> | e | -> | .save (2 * n + 1) |

.save is a state that saves the current position in the string, with two .save states recording the start and end positions. The challenge is that the NFA semantics need to handle mutations made by the .save states. I'm wondering whether Separation Logic could help manage this.

pandaman (Aug 04 2024 at 06:53):

Moreover, my regex definition allows assigning the same n to different parts of the regex, and when verifying the correctness of the NFA semantics, I would need to constrain the regexes to those that don't have such conflicts. What I'm thinking of is using Separation Logic like this:

-- `owns e` means the set of capture groups owned by `e` (without overlaps)
def owns : Expr → Nat → SepProp
-- example
owns (.concat e₁ e₂) := owns e₁ * owns e₂
owns (.group n e) := {n} * owns e

I haven't used Separation Logic before, so I'm unsure if this approach is feasible or how to integrate it with Lean. I definitely need to learn more about it and would greatly appreciate your feedback.

Ira Fesefeldt (Aug 05 2024 at 08:08):

First of all, very cool project!

Secondly, maybe some questions about the Separation Logic part, because I do not quite understand it yet and they may help with the discussion:

Why does owns have this type? I would have expected something like Expr → Set Nat → Prop --- where you could also call Set Nat → Prop just CaptureProp or something like this. Set Nat is then your object to separate --- i.e. the list of positions the expressions captured and disjoint union on sets is the disjoint operator used for the partial commutatitive monoid. But I am unsure what the Nat is supposed to do?
I am not sure that Separation Logic is particular useful for mutation, but it is useful for framing --- that is beeing able to ignore certain parts of your object under only mild side-conditions. I am happy for anyone correcting me with this one.
Separation Logic is usually facilitated in some high-level Prop-transformer semantics. Thus it usually requires you to establish these high-level semantics in the first place. Is that something that is actually worth the effort? For program verification this is useful, as we require new proofs for each program --- thus high-level proofs simplify the process. Do you think something similar also holds for your capture groups? Or how do you actually planned to facilitate Separation Logic?

pandaman (Aug 05 2024 at 10:39):

Hi. Thank you for sharing your thoughts and questions!

Why does owns have this type? I would have expected something like Expr → Set Nat → Prop --- where you could also call Set Nat → Prop just CaptureProp or something like this. Set Nat is then your object to separate --- i.e. the list of positions the expressions captured and disjoint union on sets is the disjoint operator used for the partial commutatitive monoid. But I am unsure what the Nat is supposed to do?

Nat corresponds to the indices of capture groups, so if we have a regex like (ab)c(def), (ab) owns {1} and (def) owns {2} (0 is owned by the entire regex). And yeah, I'm not sure how exactly this stuff will be represented using SL. Thank you for the suggestion!

I am not sure that Separation Logic is particular useful for mutation, but it is useful for framing --- that is beeing able to ignore certain parts of your object under only mild side-conditions. I am happy for anyone correcting me with this one.

I think you are right and I could have explained a bit better why I want framing. My regex search implementation will keep track of positions inside the haystack for each capture groups the search has encountered. So for example, when reasoning about the search for regex (e), the search for the inner regex e shouldn't affect the positions tracked by the outer capture group, which I think framing would shine.

Separation Logic is usually facilitated in some high-level Prop-transformer semantics. Thus it usually requires you to establish these high-level semantics in the first place. Is that something that is actually worth the effort? For program verification this is useful, as we require new proofs for each program --- thus high-level proofs simplify the process. Do you think something similar also holds for your capture groups? Or how do you actually planned to facilitate Separation Logic?

My ultimate goal is to prove my regex search function is correct. It should return .some if and only if there is a match in the string and the returned capture groups should store the corresponding positions. I'll need to develop a theory about capture groups that will be used during the program verification, and that's why I started looking into Separation Logic.

Jannis Limperg (Aug 05 2024 at 12:51):

Ira Fesefeldt (Aug 05 2024 at 14:21):

That sounds indeed more reasonable and also exciting!
If you are not aware, Lean offers an ITP for Separation Logic here: https://github.com/leanprover-community/iris-lean
Despite its name, it currently features mostly MoSEL functionalities and not Iris, but can be nice to have when proving theorems in a Separation Logic on (a subset of) the given BI structure there.

Siddhartha Gadgil (Aug 06 2024 at 06:56):

I often want to use regex's in Lean, so thanks for this. I just had a request on a practical point - can you tag a version with each stable lean toolchain?

pandaman (Aug 10 2024 at 05:56):

@Siddhartha Gadgil Hi. I added v4.10.0 tag and will add tags for later versions. Thank you the suggestion.

pandaman (Aug 10 2024 at 06:21):

I have been thinking about how to formalize the NFA matching. In my previous verification attempt, the behavior is formalized in terms of graphs. For example, I defined a predicate that represents a path in the NFA. However, extending these graph concepts with capture groups seem too complicated to carry.

My NFA complilation is heavily based on Russ Cox's famous regex articles, and each state of an NFA can be regarded as an instruction of a virtual machine. So I realized that we can define an operational semantics for the virtual machine as an imperative language with side effects to arrays (corresponding to capture groups) instead of using the vocabulary from graph theory. Moreover, reasoning about the NFAs as programs can be carried using Separation Logic as the axiomatic semantics, which feels a good fit to me.

pandaman (Aug 10 2024 at 06:27):

I guess my next step is to read "Separation Logic Foundations" and consider if we can port it to Lean (with iris-lean or whatever)

Ira Fesefeldt (Aug 10 2024 at 06:34):

When defining the operational semantics, keep in mind that they should satisfy a certain framing condition. For example, in heap-manipulating programs, (non-symbolic) deterministic allocation doesn't satisfy the framing condition. That's why you always see a non-deterministic (or a symbolic) allocation when using separation logic.

pandaman (Aug 10 2024 at 06:39):

Interesting. I don't think the operational semantics is going to have dynamic allocations. Rather, it will operate on an array with fixed size (each cell corresponding to start/end positions of capture groups)

pandaman (Aug 10 2024 at 06:40):

Maybe it's more like assigining to variables s1, e1, s2, e2, ....

Ira Fesefeldt (Aug 10 2024 at 06:45):

I don't know how this will translate. It is something I stumbled upon during the last years while defining operational semantics myself and considering eliminating non-determinism for different reasons. After failing I got more sensitive to this problem.

Luisa Cicolini (Aug 11 2024 at 06:25):

That's a very cool project!! Potentially interesting: https://dl.acm.org/doi/abs/10.1145/3591287

Last updated: May 02 2025 at 03:31 UTC