Zulip Chat Archive

Stream: lean4

Topic: Matching c++ on simple for loop


Tomas Skrivan (Apr 20 2023 at 04:53):

I have been messing around with for loops and trying to get them as fast as possible. I went a bit too far as I'm matching C++ speed which I find a bit suspicious.

The test is to compute triangular number the dumb way: T(n) = ∑ i∈[0:n], ∑ j∈[0:i], 1.

The naive way, using Lean's built in for loop: (the IO monad is there to prevent compile from computing it at compile time)

def triangularNumberNaive (n : USize) : IO USize := pure $ Id.run do
  let mut sum : USize := 0
  for i in [0:n.toNat] do
    for _ in [0:i] do
      sum := sum + 1
  pure sum

Computing T(100 000) takes ~10s which is not great.

So I have decided to implemented my own version of for loop which does not use big integers but USize instead.

def triangularNumberForIn (n : USize) : IO USize := pure $ Id.run do
  let mut sum : USize := 0
  for i in (fullRange n) do
    for _ in fullRange i.1 do
      sum := sum + 1
  pure sum

Where fullRange n just indicates the range [0:n] for some n : USize.

This version can compute T(1 000 000 000) in ~250ms. Which is surprisingly fast, identical implementation in C++ takes ~700ms(with g++) or ~150ms(with clang).

I have also implemented fold operation over a range on USize to see if I can squeeze even more speed by not allowing early return from the for loop, but I didn't get any significant speed up. In fact, monadic version of the fold takes double the time for some reason.

If you want to mess around with it, I have set up a github repo: https://github.com/lecopivo/ForLoopSpeed

I'm just sharing it as I find it interesting that you can go that fast.

Mario Carneiro (Apr 20 2023 at 04:59):

FYI compilers know the formula for the sum of triangular numbers, so you can get some really impossible looking speedups if you aren't careful

Tomas Skrivan (Apr 20 2023 at 05:00):

Ahh I was a bit worried about hitting some crazy optimization. Maybe I should make it a bit more complicated.

Mario Carneiro (Apr 20 2023 at 05:00):

the easiest way to find out if this happened is to look at the generated assembly to see if it still has a loop in it

Tomas Skrivan (Apr 20 2023 at 05:06):

I think it optimizes away one of the for loops https://godbolt.org/z/51d9c37o5. But the lean code must be doing the same thing, which is nice.

Tomas Skrivan (Apr 20 2023 at 05:18):

When I change the inner loop computation to something more complicated like sum += sin(i) + cos(j) then I'm getting the exact same speed as C++.

Mac Malone (Apr 23 2023 at 17:37):

I would expect that for loop code to produce much the same result as C++ (after optimizing) so I take the similar speeds as a good sign. Well-written low-level Lean can easily match C/C++.

François G. Dorais (Apr 23 2023 at 21:01):

I agree with @mac but I think the key in this case is using USize instead of Nat. Small Nat are conveniently represented the same as USize 2*n+1, which never conflicts with a big num pointer because of pointer alignment. But that translation has a cost that shows most for simple operations. (For example, adding two raw small nats has an unwanted carry from the unit bits which must be corrected either before or after.) Compilers will try to do a great job optimizing these details, but they can only do so much.

Tomas Skrivan (Apr 23 2023 at 21:35):

Yes, using USize did it and I had to mark few things with inline, but unfortunately it was more or less trial and error. I found it hard to predict what needs to be inlined or not.


Last updated: Dec 20 2023 at 11:08 UTC