Zulip Chat Archive
Stream: lean4
Topic: Compile-time large data
Xubai Wang (Feb 19 2022 at 06:18):
Lean gives me a Stack overflow detected. Aborting.
error when I try to load the unicode data file into a HashMap
at compile time.
The data file has following semicolon separated syntax, with 34626 lines in all.
001F;<control>;Cc;0;S;;;;;N;INFORMATION SEPARATOR ONE;;;;
0020;SPACE;Zs;0;WS;;;;;N;;;;;
0021;EXCLAMATION MARK;Po;0;ON;;;;;N;;;;;
I use the following code to parse it.
syntax (name := includeDataFile) "include_data_file" str : term
@[termElab includeDataFile] def includeDataFileImp : TermElab := λ stx expectedType? => do
let str := stx[1].isStrLit?.get!
let path := FilePath.mk str
if ←path.pathExists then
if ←path.isDir then
throwError s!"{str} is a directory"
else
let mut s ← `(Std.HashMap.empty)
for line in (← FS.lines path).filter (· ≠ "") do
let splits := line.splitOn ";" |>.map String.trim
s ← `(($s).insert $(Syntax.mkStrLit (splits.get! 0)) $((Syntax.mkStrLit (splits.get! 2))))
elabTerm s none
else
throwError s!"\"{str}\" does not exist as a file"
set_option maxRecDepth 50000 in
def generalCategoryMap : HashMap String String := include_data_file "./UCD/UnicodeData.txt"
Xubai Wang (Feb 19 2022 at 06:21):
The code builds successfully when I provide a smaller file, and the error comes from elabTerm
. Are there any hacks that can work round the stack overflow?
Mario Carneiro (Feb 19 2022 at 06:54):
I assume you already recognize that the term you are constructing is linear depth, since it has the form empty |>.insert k v |>.insert k v |>.insert k v ...
Mario Carneiro (Feb 19 2022 at 06:56):
Instead, you should use [(k, v), ..., (k', v')].foldl (fun m (k, v) => m.insert k v) empty
, since the [a, ..., z]
syntax already elaborates into a nice balanced tree for logarithmic depth
Xubai Wang (Feb 19 2022 at 08:11):
It partially works, but now it produces another error.
error: PANIC at Lean.Expr.mkData Lean.Expr:136:44: bound variable index is too big
PANIC at Lean.Expr.mkData Lean.Expr:136:44: bound variable index is too big
Stack overflow detected. Aborting.
Xubai Wang (Feb 19 2022 at 08:12):
Postponing parsing to run time may be a good workaround for me now.
Mario Carneiro (Feb 19 2022 at 08:20):
Oh, now I'm curious what the bound variable index limit is (apparently 2^16-1
= 65535)
Mario Carneiro (Feb 19 2022 at 08:21):
You can also make your own balanced tree of operators (I think the one for lists uses let
which introduces bound variables)
Xubai Wang (Feb 19 2022 at 08:46):
https://github.com/xubaiw/Unicode.lean/tree/compile-time
Mario Carneiro said:
You can also make your own balanced tree of operators (I think the one for lists uses
let
which introduces bound variables)
I'll try that when I learn more about lean metaprogramming.
Mario Carneiro (Feb 19 2022 at 09:05):
This seems to work okay, but it's now hitting a timeout in isDefEq
import Lean
open Std Lean Elab Term System IO
syntax (name := includeDataFile) "include_data_file" str : term
abbrev Diff := HashMap String String → HashMap String String
def map₁ (k v : String) : Diff := λ m => m.insert k v
def append (m₁ m₂ : Diff) : Diff := fun m => m₂ (m₁ m)
def mk (d : Diff) : HashMap String String := d Std.HashMap.empty
partial def mkTree (f : Nat → String × String) (i j : Nat) : TermElabM Syntax := do
if j - i > 1 then
let mid := i + (j-i) / 2
let lhs ← mkTree f i mid
let rhs ← mkTree f mid j
`(append $lhs $rhs)
else
let (k, v) := f i
`(map₁ $(Syntax.mkStrLit k) $(Syntax.mkStrLit v))
@[termElab includeDataFile] def includeDataFileImp : TermElab := λ stx expectedType? => do
let str := stx[1].isStrLit?.get!
let path := FilePath.mk str
if ←path.pathExists then
if ←path.isDir then
throwError s!"{str} is a directory"
else
let arr := (← FS.lines path).filter (· ≠ "")
let s ← mkTree (fun i =>
let line := arr[i]
let splits := line.splitOn ";" |>.map String.trim
(splits.get! 0, splits.get! 2))
0 arr.size
elabTerm (← `(mk $s)) none
else
throwError s!"\"{str}\" does not exist as a file"
def generalCategoryMap : HashMap String String := include_data_file "./UCD/UnicodeData.txt"
Mario Carneiro (Feb 19 2022 at 09:09):
making it a structure works:
structure Diff where
f : HashMap String String → HashMap String String
def map₁ (k v : String) : Diff := ⟨λ m => m.insert k v⟩
def append (m₁ m₂ : Diff) : Diff := ⟨fun m => m₂.1 (m₁.1 m)⟩
def mk (d : Diff) : HashMap String String := d.1 Std.HashMap.empty
...
set_option maxHeartbeats 100000 in
def generalCategoryMap : HashMap String String := include_data_file "./UCD/UnicodeData.txt"
Mario Carneiro (Feb 19 2022 at 09:17):
Actually, this can be sped up significantly by generating an Expr
directly instead of building a syntax to pass to ElabTerm
. Note the maxHeartbeats
is gone now and it is visibly much faster
import Lean
open Std Lean Elab Term System IO
syntax (name := includeDataFile) "include_data_file" str : term
structure Diff where
f : HashMap String String → HashMap String String
def map₁ (k v : String) : Diff := ⟨λ m => m.insert k v⟩
def append (m₁ m₂ : Diff) : Diff := ⟨fun m => m₂.1 (m₁.1 m)⟩
def mk (d : Diff) : HashMap String String := d.1 Std.HashMap.empty
partial def mkTree (f : Nat → String × String) (i j : Nat) : Expr :=
if j - i > 1 then
let mid := i + (j-i) / 2
let lhs := mkTree f i mid
let rhs := mkTree f mid j
mkApp2 (mkConst ``append) lhs rhs
else
let (k, v) := f i
mkApp2 (mkConst ``map₁) (mkStrLit k) (mkStrLit v)
@[termElab includeDataFile] def includeDataFileImp : TermElab := λ stx expectedType? => do
let str := stx[1].isStrLit?.get!
let path := FilePath.mk str
if ←path.pathExists then
if ←path.isDir then
throwError s!"{str} is a directory"
else
let arr := (← FS.lines path).filter (· ≠ "")
pure $ mkApp (mkConst ``mk) <| mkTree (fun i =>
let line := arr[i]
let splits := line.splitOn ";" |>.map String.trim
(splits.get! 0, splits.get! 2))
0 arr.size
else
throwError s!"\"{str}\" does not exist as a file"
def generalCategoryMap : HashMap String String := include_data_file "./UCD/UnicodeData.txt"
Xubai Wang (Feb 19 2022 at 09:38):
Thanks very much!! although it's still stack overflow on my machine. I'll try to optimize it later.
Mario Carneiro (Feb 19 2022 at 10:22):
where is the stack overflow occurring? Does it produce C code?
Xubai Wang (Feb 19 2022 at 10:31):
No. Both build/ir
and build/lib
are empty, and it produces error very quickly (about 3 seconds).
Mario Carneiro (Feb 19 2022 at 10:33):
Oh, it looks like my synthetic benchmark had too much duplication in it, leading to fewer distinct subexpressions than expected. When I use the real data it times out
Mario Carneiro (Feb 19 2022 at 10:34):
or at least it's taking a while, it hasn't crashed yet on my machine
Mario Carneiro (Feb 19 2022 at 10:37):
I do hope we can find a better story for big compile time data. I think it should at least be possible to use an extern function
Xubai Wang (Feb 19 2022 at 10:42):
Yes. Also I think runtime parsing and Thunk
may be a better solution now as lean builds the whole package. Compiling all the data file will lead to unacceptably long build time.
Last updated: Dec 20 2023 at 11:08 UTC