Zulip Chat Archive

Stream: lean4

Topic: Parsec parser for decoding Uri escapes


Chris Lovett (Aug 05 2022 at 18:45):

Any ideas why the following crashes the Lean Server with a stack overflow? I'm trying to decode Uri escaping using Parsec .

def textData : Parsec Char := satisfy fun c =>
  '%'  c

def hexDigitToNat (c : Char) : Nat :=
  if '0'  c  c  '9' then c.toNat - '0'.toNat
  else if 'a'  c  c  'f' then c.toNat - 'a'.toNat + 10
  else c.toNat - 'A'.toNat + 10

def digitsToNat (base : Nat) (digits : Array Nat) : Nat :=
  digits.foldl (λ r d => r * base + d) 0

@[inline]
partial def exactlyNCore (p : Parsec α) (n : Nat) (acc : Array α) : Parsec $ Array α :=
  if n > 0 then do
    exactlyNCore p (n-1) (acc.push $ p)
  else
    pure acc

@[inline]
def exactlyN (p : Parsec α) (n : Nat) : Parsec $ Array α := exactlyNCore p n #[]

def escaped : Parsec String := do
  skipString "%"
  let c?  peek?
  if c? = some '%' then
    skip
    return "%"
  let charCode 
    digitsToNat 16 <$> exactlyN (hexDigitToNat <$> hexDigit) 2
  return (Char.ofNat charCode).toString

def unescaped : Parsec String := manyChars textData
def uriChunk : Parsec String := escaped <|> unescaped
def parsedUri : Parsec (Array String) := many uriChunk

#eval match parsedUri "file:///d%3a/temp/test.xml".mkIterator with
  | Parsec.ParseResult.success _ res => Except.ok res
  | Parsec.ParseResult.error it err  => Except.error s!"offset {it.pos}: {err}"

and can this code be simplified?

Mario Carneiro (Aug 05 2022 at 18:52):

You have a many uriChunk where uriChunk matches the empty string via unescaped

Mario Carneiro (Aug 05 2022 at 18:53):

so the many parser will end up matching an infinite number of empty strings

Henrik Böving (Aug 05 2022 at 18:54):

Are there parser combinator libraries from other dependently typed languages out there that can guarantee termination if they are used in a correct way with subtypes or similar?

Chris Lovett (Aug 05 2022 at 18:58):

Mario Carneiro said:

so the many parser will end up matching an infinite number of empty strings

Thanks for the hint, now how do I fix it? I thought textData was greedy and consumed as much as it can, why would it ever match empty, except when it hits '%' in which case escaped should take over... ?

Mario Carneiro (Aug 05 2022 at 18:59):

what if there is no text remaining?

Mario Carneiro (Aug 05 2022 at 18:59):

it will consume as much as it can, which is nothing

Mario Carneiro (Aug 05 2022 at 19:00):

It is possible to make many not loop in this case by checking that the contained parser matches at least one character

Mario Carneiro (Aug 05 2022 at 19:01):

The simplest fix in this case would be to use something like many1Chars in unescaped

Mario Carneiro (Aug 05 2022 at 19:02):

Henrik Böving said:

Are there parser combinator libraries from other dependently typed languages out there that can guarantee termination if they are used in a correct way with subtypes or similar?

I believe Gabriel actually proved termination for the lean 3 parsec IIRC

Mario Carneiro (Aug 05 2022 at 19:03):

Yakov Pechersky also did some work on provable parsers in mathlib

Mario Carneiro (Aug 05 2022 at 19:04):

Indeed, I just checked and https://github.com/leanprover-community/lean/blob/master/library/data/buffer/parser.lean contains only terminating functions

Yakov Pechersky (Aug 05 2022 at 19:05):

Mario, what's you definition of terminating?

Henrik Böving (Aug 05 2022 at 19:05):

Functions that the Lean compiler accepts without stuff like partial for me

Mario Carneiro (Aug 05 2022 at 19:06):

yeah that's what I mean

Mario Carneiro (Aug 05 2022 at 19:06):

I guess the proof is a bit of a cheat because it uses a fuel parameter (even though the fuel parameter is guaranteed to be large enough, that's not proved)

Yakov Pechersky (Aug 05 2022 at 19:07):

As a side note, the TC-based parser properties led to the "scary" graph that was shared at lftcm

Mac (Aug 05 2022 at 19:08):

I imagine such a scheme would not work for parsers that want to match the empty string (i.e., work in a similar manner to skip/ check* functions in the Lean parser).

Yakov Pechersky (Aug 05 2022 at 19:09):

docs#parser.bounded

Mario Carneiro (Aug 05 2022 at 19:10):

@Mac If you are doing the check in many itself I don't see how there could be a false positive unless the parser function is not pure

Yakov Pechersky (Aug 05 2022 at 19:10):

contrast with docs#parser.unfailing, related to docs#parser.conditionally_unfailing

Mario Carneiro (Aug 05 2022 at 19:10):

if you call p and it returns success and does not advance the cursor, then calling p again will result in the same thing and you will definitely loop

Yakov Pechersky (Aug 05 2022 at 19:11):

many (pure x)

Chris Lovett (Aug 05 2022 at 19:16):

Mario Carneiro said:

The simplest fix in this case would be to use something like many1Chars in unescaped

Thanks, many1Chars fixed it.

Mac (Aug 05 2022 at 19:20):

Mario Carneiro said:

Mac If you are doing the check in many itself I don't see how there could be a false positive unless the parser function is not pure

Yes, but this assumes that one wants many to fail at the end of the input. If one has many acceptsEmptyString >> optional otherParser, one ideally does not want many acceptsEmptyString to fail because it just reach the end of the input. Even break upon reaching the end does not solve this, as it not necessarily clear how many invocations of acceptsEmptyString one wants (especially if the parser acceptsEmptyString is not pure).

Mario Carneiro (Aug 05 2022 at 19:22):

many acceptsEmptyString would not fail, it would return the empty list and parse nothing

Mario Carneiro (Aug 05 2022 at 19:22):

and the exit condition is not reaching the end, it is parsing 0 characters

Mario Carneiro (Aug 05 2022 at 19:23):

if the parser is not pure then you are hosed

Chris Lovett (Aug 05 2022 at 19:37):

dangerang, this code I borrowed from Lean.Xml.Parser.hexDigitToNat doesn't work, can anyone spot the problem?

#eval Lean.Xml.Parser.CharRef "&#x3a;".mkIterator -- ':'
#eval Lean.Xml.Parser.CharRef "&#x3b;".mkIterator -- error expected ';'

is there something wrong with if 'a' ≤ c ∧ c ≤ 'f' ? The weird thing is this works:

#eval Lean.Xml.Parser.hexDigitToNat 'b'   -- 11

Mario Carneiro (Aug 05 2022 at 19:46):

It's a bug in core

Mario Carneiro (Aug 05 2022 at 19:47):

https://github.com/leanprover/lean4/blob/deafc315c783ac8069508f72b2815f2713d06f6a/src/Lean/Data/Parsec.lean#L126-L128

def hexDigit : Parsec Char := attempt do
  let c  anyChar
  if ('0'  c  c  '9')
    ('a'  c  c  'a')
    ('A'  c  c  'A') then return c else fail s!"hex digit expected"

Chris Lovett (Aug 05 2022 at 19:48):

hilarious, I filed a bug https://github.com/leanprover/lean4/issues/1429 and tested the fix https://github.com/leanprover/lean4/pull/1430, and it works.

Reid Barton (Aug 05 2022 at 19:48):

A correct implementation for recognizing undecimal digits, though

Kevin Buzzard (Aug 06 2022 at 09:18):

11 is the new 16 -- oh the fix is already merged! I love the way things move fast here.

David Thrane Christiansen (Aug 06 2022 at 19:02):

Henrik Böving said:

Are there parser combinator libraries from other dependently typed languages out there that can guarantee termination if they are used in a correct way with subtypes or similar?

There's at least https://github.com/gallais/agdarsec to look at, along with its ports to Coq and Idris.

Alexandre Rademaker (Oct 18 2023 at 19:22):

Can the https://github.com/leanprover-community/lean/blob/master/library/data/buffer/parser.lean be used to construct a parser for any DSL?

My goal is to write a parser for a string initially provided by an external tool. The string need to be parsed into a inductive data type that later can be used to produce Lean code. The str is a natural language sentence. The data type is the semantic representation of each and the conversion from it to Lean dependent type language will possibly need the meta features of Lean.

Scott Morrison (Oct 18 2023 at 21:52):

@Alexandre Rademaker you linked to something from Lean 3. Not only is this obsolete, but everything about parsing is new (and better) in Lean 4. It was written explicitly with supporting DSLs in mind.

Alexandre Rademaker (Oct 18 2023 at 22:19):

Oh yes, my bad about the wrong reference. Still, my question is about the lean4 support for DSL parsing being the right match for my use case.

I am getting a string from IO and need to parse into a term, instance of an inductive data type. For now, it is not about parsing a DSL written directly in the lean file … but maybe the two scenarios are actually more similar than I can see.

Joachim Breitner (Oct 19 2023 at 06:31):

A good way to use Lean's syntax declarations to describe parsers would be great! The best approach I know of so far is https://github.com/tydeu/lean4-partax, maybe you can try that?

Alexandre Rademaker (Oct 19 2023 at 12:18):

No sure if i got your point. Sorry. As I said, I am not parsing Lean code. Say I have a sentence in natural language

“The cat is write”

I sent this string to a process via IO channel and I got back another string with a serialization of a data structure representing the sentence semantics/syntax. This structure can be represented by an inductive data type and I want to manipulate that structure before eventually convert it to Lean definitions.

Scott Morrison (Oct 19 2023 at 23:38):

Sorry, I think we are lacking tutorial style examples on writing parsers in Lean. What you want is definitely supported and shouldn't be hard --- but not my expertise.

The closest I know of to a tutorial is https://github.com/leanprover-community/lean4-samples/tree/main/CSVParser, but I'm not sure how good it is.

François G. Dorais (Oct 20 2023 at 12:22):

In addition to Lean.Data.Parsec, consider these alternatives:

Abhijith Settipalli (Oct 20 2023 at 21:37):

Related question, I've been using Lean.Data.Parsec to parse C. Is there a TLDR of why one would pick one over the other? From a cursory glance it looks like both work on token streams rather than raw strings and megaparsec has more facilities for error reporting. On the other hand Parsec is just less of a conceptual overhead (175 loc and no monad transformers! :smile: ). I'm also aware of https://github.com/opencompl/C-parsing-for-Lean4 which uses lean macros but I'd rather stay away from metaprogramming if I can help it.

François G. Dorais (Oct 22 2023 at 03:56):

Every Lean 4 package out there is just a few months old, so there is no TLDR at this time. As for Lean.Data.Parsec vs alternatives, there is one clear issue that Lean core will not add new features unless they affect Lean core itself, the two alternatives I mentioned are open to feature requests.

If you need to flip a coin then I would say use mine... but that is evidently a biased coin!

Alexandre Rademaker (Nov 05 2023 at 23:50):

@François G. Dorais, the BNF of the data I want to parse is https://github.com/delph-in/docs/wiki/MrsRFC#simple; I started with Lean.Data.Parsec trying to keep dependencies as minimal as possible, but I wonder if the alternatives provide any real advantage here.

François G. Dorais (Nov 06 2023 at 05:09):

@Alexandre Rademaker For your use case, I can't see any major advantage either way. What would make a difference for you is regex support. This is coming and will be available for my parser package but I don't have a specific timeline for that right now.

Alexandre Rademaker (Nov 10 2023 at 23:45):

The escaped string parser defined in https://github.com/leanprover-community/lean4-samples/blob/main/CSVParser/CSVParser.lean#L29-L32 is not actually working with a escaped double quote inside a string, b is ignored:

#eval escaped "\"a\\\"b\"".mkIterator

I believe I created a more robust escaped string, but I'm unsure if it could be more straightforward. Do I really need the auxiliary function QSA?

def QSA : Parsec String := do
 let a  pchar '\\'
 let b  (satisfy $ fun c => c  '\n')
 let c  many $ satisfy $ fun c => ['"', '\\'].notElem c
 return (#[a,b] ++ c).asString

def QuotedString : Parsec String := do
  let g := many $ satisfy $ fun c => ['"', '\\'].notElem c
  let a1  pchar '"'
  let a2  g
  let a3  many QSA
  let a4  pchar '"'
  return (a1.toString ++ a2.asString ++
    (Array.foldl (λ s c => s ++ c) "" a3) ++ a4.toString)

#eval QuotedString "\"a\\\"b\"".mkIterator

The regex used as reference is /"[^"\\]*(?:\\.[^"\\]*)*"/

François G. Dorais (Dec 18 2023 at 10:16):

@Alexandre Rademaker As promised, I've started adding some rudimentary regex support to lean4-parser. The regex compiler is extremely limited right now but I'm hoping to change that soon.


Last updated: Dec 20 2023 at 11:08 UTC