Zulip Chat Archive

Stream: lean4

Topic: Compile-time large data


Xubai Wang (Feb 19 2022 at 06:18):

Lean gives me a Stack overflow detected. Aborting. error when I try to load the unicode data file into a HashMap at compile time.

The data file has following semicolon separated syntax, with 34626 lines in all.

001F;<control>;Cc;0;S;;;;;N;INFORMATION SEPARATOR ONE;;;;
0020;SPACE;Zs;0;WS;;;;;N;;;;;
0021;EXCLAMATION MARK;Po;0;ON;;;;;N;;;;;

I use the following code to parse it.

syntax (name := includeDataFile) "include_data_file" str : term

@[termElab includeDataFile] def includeDataFileImp : TermElab := λ stx expectedType? => do
    let str := stx[1].isStrLit?.get!
    let path := FilePath.mk str
    if path.pathExists then
      if path.isDir then
        throwError s!"{str} is a directory"
      else
        let mut s  `(Std.HashMap.empty)
        for line in ( FS.lines path).filter (·  "") do
          let splits := line.splitOn ";" |>.map String.trim
          s  `(($s).insert $(Syntax.mkStrLit (splits.get! 0)) $((Syntax.mkStrLit (splits.get! 2))))
        elabTerm s none
    else
      throwError s!"\"{str}\" does not exist as a file"

set_option maxRecDepth 50000 in
def generalCategoryMap : HashMap String String := include_data_file "./UCD/UnicodeData.txt"

Xubai Wang (Feb 19 2022 at 06:21):

The code builds successfully when I provide a smaller file, and the error comes from elabTerm. Are there any hacks that can work round the stack overflow?

Mario Carneiro (Feb 19 2022 at 06:54):

I assume you already recognize that the term you are constructing is linear depth, since it has the form empty |>.insert k v |>.insert k v |>.insert k v ...

Mario Carneiro (Feb 19 2022 at 06:56):

Instead, you should use [(k, v), ..., (k', v')].foldl (fun m (k, v) => m.insert k v) empty, since the [a, ..., z] syntax already elaborates into a nice balanced tree for logarithmic depth

Xubai Wang (Feb 19 2022 at 08:11):

It partially works, but now it produces another error.

error: PANIC at Lean.Expr.mkData Lean.Expr:136:44: bound variable index is too big
PANIC at Lean.Expr.mkData Lean.Expr:136:44: bound variable index is too big

Stack overflow detected. Aborting.

Xubai Wang (Feb 19 2022 at 08:12):

Postponing parsing to run time may be a good workaround for me now.

Mario Carneiro (Feb 19 2022 at 08:20):

Oh, now I'm curious what the bound variable index limit is (apparently 2^16-1 = 65535)

Mario Carneiro (Feb 19 2022 at 08:21):

You can also make your own balanced tree of operators (I think the one for lists uses let which introduces bound variables)

Xubai Wang (Feb 19 2022 at 08:46):

https://github.com/xubaiw/Unicode.lean/tree/compile-time

Mario Carneiro said:

You can also make your own balanced tree of operators (I think the one for lists uses let which introduces bound variables)

I'll try that when I learn more about lean metaprogramming.

Mario Carneiro (Feb 19 2022 at 09:05):

This seems to work okay, but it's now hitting a timeout in isDefEq

import Lean
open Std Lean Elab Term System IO
syntax (name := includeDataFile) "include_data_file" str : term

abbrev Diff := HashMap String String  HashMap String String
def map₁ (k v : String) : Diff := λ m => m.insert k v
def append (m₁ m₂ : Diff) : Diff := fun m => m₂ (m₁ m)
def mk (d : Diff) : HashMap String String := d Std.HashMap.empty

partial def mkTree (f : Nat  String × String) (i j : Nat) : TermElabM Syntax := do
  if j - i > 1 then
    let mid := i + (j-i) / 2
    let lhs  mkTree f i mid
    let rhs  mkTree f mid j
    `(append $lhs $rhs)
  else
    let (k, v) := f i
    `(map₁ $(Syntax.mkStrLit k) $(Syntax.mkStrLit v))

@[termElab includeDataFile] def includeDataFileImp : TermElab := λ stx expectedType? => do
    let str := stx[1].isStrLit?.get!
    let path := FilePath.mk str
    if path.pathExists then
      if path.isDir then
        throwError s!"{str} is a directory"
      else
        let arr := ( FS.lines path).filter (·  "")
        let s  mkTree (fun i =>
          let line := arr[i]
          let splits := line.splitOn ";" |>.map String.trim
          (splits.get! 0, splits.get! 2))
          0 arr.size
        elabTerm ( `(mk $s)) none
    else
      throwError s!"\"{str}\" does not exist as a file"

def generalCategoryMap : HashMap String String := include_data_file "./UCD/UnicodeData.txt"

Mario Carneiro (Feb 19 2022 at 09:09):

making it a structure works:

structure Diff where
  f : HashMap String String  HashMap String String
def map₁ (k v : String) : Diff := λ m => m.insert k v
def append (m₁ m₂ : Diff) : Diff := fun m => m₂.1 (m₁.1 m)⟩
def mk (d : Diff) : HashMap String String := d.1 Std.HashMap.empty

...

set_option maxHeartbeats 100000 in
def generalCategoryMap : HashMap String String := include_data_file "./UCD/UnicodeData.txt"

Mario Carneiro (Feb 19 2022 at 09:17):

Actually, this can be sped up significantly by generating an Expr directly instead of building a syntax to pass to ElabTerm. Note the maxHeartbeats is gone now and it is visibly much faster

import Lean
open Std Lean Elab Term System IO
syntax (name := includeDataFile) "include_data_file" str : term

structure Diff where
  f : HashMap String String  HashMap String String
def map₁ (k v : String) : Diff := λ m => m.insert k v
def append (m₁ m₂ : Diff) : Diff := fun m => m₂.1 (m₁.1 m)⟩
def mk (d : Diff) : HashMap String String := d.1 Std.HashMap.empty

partial def mkTree (f : Nat  String × String) (i j : Nat) : Expr :=
  if j - i > 1 then
    let mid := i + (j-i) / 2
    let lhs := mkTree f i mid
    let rhs := mkTree f mid j
    mkApp2 (mkConst ``append) lhs rhs
  else
    let (k, v) := f i
    mkApp2 (mkConst ``map₁) (mkStrLit k) (mkStrLit v)

@[termElab includeDataFile] def includeDataFileImp : TermElab := λ stx expectedType? => do
    let str := stx[1].isStrLit?.get!
    let path := FilePath.mk str
    if path.pathExists then
      if path.isDir then
        throwError s!"{str} is a directory"
      else
        let arr := ( FS.lines path).filter (·  "")
        pure $ mkApp (mkConst ``mk) <| mkTree (fun i =>
          let line := arr[i]
          let splits := line.splitOn ";" |>.map String.trim
          (splits.get! 0, splits.get! 2))
          0 arr.size
    else
      throwError s!"\"{str}\" does not exist as a file"

def generalCategoryMap : HashMap String String := include_data_file "./UCD/UnicodeData.txt"

Xubai Wang (Feb 19 2022 at 09:38):

Thanks very much!! although it's still stack overflow on my machine. I'll try to optimize it later.

Mario Carneiro (Feb 19 2022 at 10:22):

where is the stack overflow occurring? Does it produce C code?

Xubai Wang (Feb 19 2022 at 10:31):

No. Both build/ir and build/lib are empty, and it produces error very quickly (about 3 seconds).

Mario Carneiro (Feb 19 2022 at 10:33):

Oh, it looks like my synthetic benchmark had too much duplication in it, leading to fewer distinct subexpressions than expected. When I use the real data it times out

Mario Carneiro (Feb 19 2022 at 10:34):

or at least it's taking a while, it hasn't crashed yet on my machine

Mario Carneiro (Feb 19 2022 at 10:37):

I do hope we can find a better story for big compile time data. I think it should at least be possible to use an extern function

Xubai Wang (Feb 19 2022 at 10:42):

Yes. Also I think runtime parsing and Thunk may be a better solution now as lean builds the whole package. Compiling all the data file will lead to unacceptably long build time.


Last updated: Dec 20 2023 at 11:08 UTC