Zulip Chat Archive

Stream: lean4

Topic: Simple Lean programm 100x slower than equivalent Python


Michael Rothgang (May 26 2024 at 22:07):

First off, sorry for the click-bait title (which I'd have preferred to end with "what's wrong?"). Rewriting mathlib's style linters (written in Python) in Lean, I encountered one part where my Lean code is about 100x slower than the equivalent Python code. I can't imagine it has to be like this: please help me understand what I'm "doing wrong". Or how I can start to find out...

Here's a minimal Python programm doing what I want.

#!/usr/bin/env python3
# Minimal reduction from the "semicolon" style linter in Python.
from pathlib import Path
import sys

ERR_SEM = 13 # the substring " ;"

argv = sys.argv[1:]
for filename in argv:
    path = Path(filename)
    errors = []
    with path.open(encoding="utf-8", newline="") as f:
        for line in f:
            if " ;" in line:
                errors.append((ERR_SEM, path))
    n = len(errors) # do something with errors

Running this on all of mathlib takes about half a second on my machine.

$ time python3 ./scripts/lint-minimal.py $(find Mathlib -name '*.lean')
real    0m0,640s
user    0m0,537s
sys 0m0,108s

Here's the Lean code I wrote: very much minimised. (In case you're wondering: all the set-up is necessary; this is already extremely simplified. But the set-up itself is very fast.) It takes about a minute, i.e. 100x slower than Python.

-- This file is minimised from rewriting the style linters:
-- it illustrates that "just looking for semicolons" is *really* slow.
import Lean.Elab.Command
import Batteries.Data.String.Basic

def lint_all_files  : IO Unit := do
  -- Read all module names from `Mathlib.lean`.
  let allModules  IO.FS.lines (System.mkFilePath ["Mathlib.lean"])
  for module in allModules do
    let module := module.stripPrefix "import "
    -- Convert the module name to a file name, then lint that file.
    let path := (System.mkFilePath (module.split fun c  (c == '.'))).addExtension "lean"
    let lines  IO.FS.lines path
    let mut number_semicolons := 0
    for line in lines do
      -- This check is *extremely* slow: why?
      if line.containsSubstr " ;" then
        number_semicolons := number_semicolons + 1
    if number_semicolons > 0 then
      IO.println "Found a semicolon" -- full programm: format all errors nicely...
--run_cmd lint_all_files <-- this step takes about a minute on my computer

Michael Rothgang (May 26 2024 at 22:07):

Finally, for an idea what could be possible, here's the timing of calling ripgrep, from the mathlib root directory. (Ripgrep parses the .gitignore, hence this counts all mathlib files, but ignore the lake directory.)

$ time rg -F " ;"
real    0m0,041s
user    0m0,056s
sys 0m0,056s

Here's an invocation searching just all *.lean files (and the full output).

$ time rg --type-add 'lean:*.lean' -t lean " ;"
test/Monotonicity.lean
139:--  ; intros ys zs h h'
140:--  ; cases ys with y ys
141:--  ; cases zs with z zs
142:--  ; try { cases h; cases h'; done },
real    0m0,039s
user    0m0,054s
sys 0m0,054s

Standard grep takes a similar amount of time.

Henrik Böving (May 26 2024 at 22:08):

First thing I would recommend before doing any optimization here is to actually compile your program, that will easily speed it up a lot.

Michael Rothgang (May 26 2024 at 22:09):

:exploding_head: I can compile a Lean program? How do I do this? Pointer to docs is fine.

Henrik Böving (May 26 2024 at 22:10):

We usually create a lake project with a lean_exe target as provided by the default project that you get when you run lake new, run lake build and you'll have a binary in .lake/bin

Michael Rothgang (May 26 2024 at 22:10):

Ah - so creating a new executable. I know how to do this.

Henrik Böving (May 26 2024 at 22:10):

If you really want to you can also do this by hand but I don't recommend it

Michael Rothgang (May 26 2024 at 22:11):

... this would end up as another executable in mathlib - but I guess that's fine. I don't need by hand :-)

Henrik Böving (May 26 2024 at 22:11):

Running something with run_cmd (without precompileModules) will use the bytecode interpreter which is not very well optimized at all. I can very much imagine python's bytecode interpreter to be much better than ours. After you've compiled the program we can start doing actual optimizations if it is still too slow for your taste

Eric Wieser (May 26 2024 at 22:14):

Another difference is that the Python code reads one line at a time, while the Lean code above reads it all into memory at once

Henrik Böving (May 26 2024 at 22:15):

That should give Lean an advantage though. Reading one line at a time would mean having n syscalls as opposed to have 1 syscall per file.

Eric Wieser (May 26 2024 at 22:15):

I think Python does some buffering to make that not 1:1

Henrik Böving (May 26 2024 at 22:20):

@Michael Rothgang if it's still not fast enough I would suggest to profile the binary with perf and hotspot or whatever tool is the one that you prefer (personally I enjoy Intel VTune). And then based on the results you can do further things. If it's really the string stuff that you are spending most time in I guess the next best thing would be to write either a better containsSubstr or come up with a Parsec based parser to go through whole files and find the wrong semicolons.

Michael Rothgang (May 26 2024 at 22:20):

Okay, a few minutes later: after compiling, it's almost instant. Much better, wohoo!

Michael Rothgang (May 26 2024 at 22:23):

(More detailed: running all the linters on all of mathlib works within 2 seconds - the Python script took ~10.)

Michael Rothgang (May 26 2024 at 22:44):

An update: the minimal programme (i.e., just the semicolon) takes about 2s. With other linters, adding the semicolon check turns the runtime from ~1.5 to ~2.5s: i.e., that check hurts a bit, but by itself is still tolerable.

In other words: using a profiler doesn't seem like a bad idea :-) I'm very busy until the end of the month, though.

Number Eighteen (May 28 2024 at 17:46):

The following code may be more performant :

private partial def String.hasSemicolon (source : String) : Bool :=
  let rec loop (i : String.Iterator) :=
    if i.atEnd then
      false
    else
      let posOfSemi := i.find (. == ';')
      if posOfSemi == i.toEnd then false else
      if posOfSemi.prev.curr == ' ' then
        true
      else
        loop posOfSemi.next
  loop source.iter

/- Change accordingly. -/
def pathPrefix := "/home/data/programming-data/practice/lake-packages/mathlib/"
def pathString := "/home/data/programming-data/practice/lake-packages/mathlib/Mathlib.lean"

/-! Assumes that modules is a small list of files. -/
def my_linter : IO Unit := do
  let modules := ( IO.FS.lines pathString).map
    fun s  s.drop 7
    |>.replace "." "/"
    |>.append ".lean"
  for module in modules do
    let blob  IO.FS.readFile <| pathPrefix ++ module
    if blob.hasSemicolon then
      IO.println "Semicolons detected; here, split the blob and lint."

Last updated: May 02 2025 at 03:31 UTC