Zulip Chat Archive
Stream: lean4
Topic: Simple Lean programm 100x slower than equivalent Python
Michael Rothgang (May 26 2024 at 22:07):
First off, sorry for the click-bait title (which I'd have preferred to end with "what's wrong?"). Rewriting mathlib's style linters (written in Python) in Lean, I encountered one part where my Lean code is about 100x slower than the equivalent Python code. I can't imagine it has to be like this: please help me understand what I'm "doing wrong". Or how I can start to find out...
Here's a minimal Python programm doing what I want.
#!/usr/bin/env python3
# Minimal reduction from the "semicolon" style linter in Python.
from pathlib import Path
import sys
ERR_SEM = 13 # the substring " ;"
argv = sys.argv[1:]
for filename in argv:
path = Path(filename)
errors = []
with path.open(encoding="utf-8", newline="") as f:
for line in f:
if " ;" in line:
errors.append((ERR_SEM, path))
n = len(errors) # do something with errors
Running this on all of mathlib takes about half a second on my machine.
$ time python3 ./scripts/lint-minimal.py $(find Mathlib -name '*.lean')
real 0m0,640s
user 0m0,537s
sys 0m0,108s
Here's the Lean code I wrote: very much minimised. (In case you're wondering: all the set-up is necessary; this is already extremely simplified. But the set-up itself is very fast.) It takes about a minute, i.e. 100x slower than Python.
-- This file is minimised from rewriting the style linters:
-- it illustrates that "just looking for semicolons" is *really* slow.
import Lean.Elab.Command
import Batteries.Data.String.Basic
def lint_all_files : IO Unit := do
-- Read all module names from `Mathlib.lean`.
let allModules ← IO.FS.lines (System.mkFilePath ["Mathlib.lean"])
for module in allModules do
let module := module.stripPrefix "import "
-- Convert the module name to a file name, then lint that file.
let path := (System.mkFilePath (module.split fun c ↦ (c == '.'))).addExtension "lean"
let lines ← IO.FS.lines path
let mut number_semicolons := 0
for line in lines do
-- This check is *extremely* slow: why?
if line.containsSubstr " ;" then
number_semicolons := number_semicolons + 1
if number_semicolons > 0 then
IO.println "Found a semicolon" -- full programm: format all errors nicely...
--run_cmd lint_all_files <-- this step takes about a minute on my computer
Michael Rothgang (May 26 2024 at 22:07):
Finally, for an idea what could be possible, here's the timing of calling ripgrep
, from the mathlib root directory. (Ripgrep parses the .gitignore, hence this counts all mathlib files, but ignore the lake directory.)
$ time rg -F " ;"
real 0m0,041s
user 0m0,056s
sys 0m0,056s
Here's an invocation searching just all *.lean files (and the full output).
$ time rg --type-add 'lean:*.lean' -t lean " ;"
test/Monotonicity.lean
139:-- ; intros ys zs h h'
140:-- ; cases ys with y ys
141:-- ; cases zs with z zs
142:-- ; try { cases h; cases h'; done },
real 0m0,039s
user 0m0,054s
sys 0m0,054s
Standard grep
takes a similar amount of time.
Henrik Böving (May 26 2024 at 22:08):
First thing I would recommend before doing any optimization here is to actually compile your program, that will easily speed it up a lot.
Michael Rothgang (May 26 2024 at 22:09):
:exploding_head: I can compile a Lean program? How do I do this? Pointer to docs is fine.
Henrik Böving (May 26 2024 at 22:10):
We usually create a lake project with a lean_exe target as provided by the default project that you get when you run lake new
, run lake build
and you'll have a binary in .lake/bin
Michael Rothgang (May 26 2024 at 22:10):
Ah - so creating a new executable. I know how to do this.
Henrik Böving (May 26 2024 at 22:10):
If you really want to you can also do this by hand but I don't recommend it
Michael Rothgang (May 26 2024 at 22:11):
... this would end up as another executable in mathlib - but I guess that's fine. I don't need by hand :-)
Henrik Böving (May 26 2024 at 22:11):
Running something with run_cmd (without precompileModules) will use the bytecode interpreter which is not very well optimized at all. I can very much imagine python's bytecode interpreter to be much better than ours. After you've compiled the program we can start doing actual optimizations if it is still too slow for your taste
Eric Wieser (May 26 2024 at 22:14):
Another difference is that the Python code reads one line at a time, while the Lean code above reads it all into memory at once
Henrik Böving (May 26 2024 at 22:15):
That should give Lean an advantage though. Reading one line at a time would mean having n syscalls as opposed to have 1 syscall per file.
Eric Wieser (May 26 2024 at 22:15):
I think Python does some buffering to make that not 1:1
Henrik Böving (May 26 2024 at 22:20):
@Michael Rothgang if it's still not fast enough I would suggest to profile the binary with perf
and hotspot
or whatever tool is the one that you prefer (personally I enjoy Intel VTune). And then based on the results you can do further things. If it's really the string stuff that you are spending most time in I guess the next best thing would be to write either a better containsSubstr
or come up with a Parsec
based parser to go through whole files and find the wrong semicolons.
Michael Rothgang (May 26 2024 at 22:20):
Okay, a few minutes later: after compiling, it's almost instant. Much better, wohoo!
Michael Rothgang (May 26 2024 at 22:23):
(More detailed: running all the linters on all of mathlib works within 2 seconds - the Python script took ~10.)
Michael Rothgang (May 26 2024 at 22:44):
An update: the minimal programme (i.e., just the semicolon) takes about 2s. With other linters, adding the semicolon check turns the runtime from ~1.5 to ~2.5s: i.e., that check hurts a bit, but by itself is still tolerable.
In other words: using a profiler doesn't seem like a bad idea :-) I'm very busy until the end of the month, though.
Number Eighteen (May 28 2024 at 17:46):
The following code may be more performant :
private partial def String.hasSemicolon (source : String) : Bool :=
let rec loop (i : String.Iterator) :=
if i.atEnd then
false
else
let posOfSemi := i.find (. == ';')
if posOfSemi == i.toEnd then false else
if posOfSemi.prev.curr == ' ' then
true
else
loop posOfSemi.next
loop source.iter
/- Change accordingly. -/
def pathPrefix := "/home/data/programming-data/practice/lake-packages/mathlib/"
def pathString := "/home/data/programming-data/practice/lake-packages/mathlib/Mathlib.lean"
/-! Assumes that modules is a small list of files. -/
def my_linter : IO Unit := do
let modules := (← IO.FS.lines pathString).map
fun s ↦ s.drop 7
|>.replace "." "/"
|>.append ".lean"
for module in modules do
let blob ← IO.FS.readFile <| pathPrefix ++ module
if blob.hasSemicolon then
IO.println "Semicolons detected; here, split the blob and lint."
Last updated: May 02 2025 at 03:31 UTC