Zulip Chat Archive

Stream: general

Topic: Self-hosted Github runner


view this post on Zulip Jannis Limperg (May 14 2020 at 13:51):

Breaking this out of the speed-up project returns thread.

It seems like it might be beneficial to have a custom Github runner which could build mathlib much faster than Github's runners. I'm currently running some experiments in this direction with initial funding graciously provided by @Jalex Stark.

To determine what sort of build server we'd want, a key factor is how mathlib scales with additional cores. I'm currently running benchmarks on a dedicated server with 8 cores and a virtual server with 16 cores. @Scott Morrison, could you perhaps run the same benchmark on your machine? Just so that we have a bit more data.

Here's the benchmark script:

#!/bin/bash

function die() {
  echo $1
  echo "Usage: mathlib-bench <mathlib-dir> <n1> [n2 ...]"
  exit 1
}

test -n "$1" || die "Mathlib directory not specified."
mathlib_dir="$1"
shift

test -n "$1" || die "Thread count not specified"

echo "===== Cloning mathlib repo in '$mathlib_dir'"
git clone https://github.com/leanprover-community/mathlib "$mathlib_dir" && \
  pushd "$mathlib_dir" && \
  git checkout d0beb496140b5506530e18033591c711030546ad && \
  popd

test "$?" -eq 0 || die "Unable to clone mathlib."

echo "===== Installing Lean toolchain"
pushd "$mathlib_dir" && \
  leanpkg configure && \
  popd

test "$?" -eq 0 || die "Unable to install Lean toolchain."

while test -n "$1"; do
  test "$1" -gt 0 || die "Not a positive number: $1."
  n_threads="$1"

  echo "===== Building mathlib with $n_threads threads" && \
    find "$mathlib_dir" -name '*.olean' -delete && \
    pushd "$mathlib_dir" && \
    command time leanpkg build -- --threads="$n_threads" --memory=16000 && \
    popd

  test $? -eq 0 || die "Error while building mathlib."
done

Usage: mathlib-bench <dir> <n1> ... <nm> where dir is the directory where mathlib will be checked out (must be empty or nonexistent) and the ni are thread counts. I'm using mathlib-bench <dir> 1 2 4 6 8 10 12 14 16.

view this post on Zulip Jannis Limperg (May 14 2020 at 14:00):

Also, if anyone knows an organisation that would give us build servers for free/cheap, I'm all ears.

view this post on Zulip Ryan Lahfa (May 14 2020 at 14:03):

Jannis Limperg said:

Also, if anyone knows an organisation that would give us build servers for free/cheap, I'm all ears.

We have nothing to lose to try:
— Nix/NixOS community
— Packet.com
— University

Also, thanks for the benchmark script, I'll be able to play around with Packet's EPYC now :-)

view this post on Zulip Scott Morrison (May 14 2020 at 14:03):

(the same machine is at the moment running two single-threaded builds of old commits of mathlib, which will presumably interfere for the higher core numbers...)

view this post on Zulip Scott Morrison (May 14 2020 at 14:06):

Running!

view this post on Zulip Jannis Limperg (May 14 2020 at 14:51):

Thanks Scott and Ryan! Those numbers will be very interesting.

view this post on Zulip Johan Commelin (May 14 2020 at 14:54):

And thanks Jalex and Jannis!

view this post on Zulip Jannis Limperg (May 14 2020 at 16:22):

sigh The security issues of self-hosted Github runners are much worse than I thought. Essentially: Anyone can make a PR that changes the GH actions workflow file, then this new file is run (because it can say, 'run me on a self-hosted runner when a PR is created'), and the runner does no sandboxing whatsoever. I was able to create a PR (as a random user with no affiliation with the repo) that changed the workflow file to cat /etc/passwd and this was executed and displayed without a hitch. Whoever thought this was a remotely sensible design deserves to be slapped.

What this means for the project is that I need to figure out whether I can sandbox the runner myself in a VM or something. Fun times...

view this post on Zulip Johan Commelin (May 14 2020 at 16:49):

That's quite ridiculous indeed.

view this post on Zulip Johan Commelin (May 14 2020 at 16:50):

And running it inside some docker sandbox probably hinders the performance

view this post on Zulip Johan Commelin (May 14 2020 at 16:50):

How about a qemu bare metal VM?

view this post on Zulip Johan Commelin (May 14 2020 at 16:51):

But it comes with extra maintenance ~~> headaches

view this post on Zulip Reid Barton (May 14 2020 at 16:52):

All solutions for CI involve security issues like this to various extents, as well as the reverse issue: how do I know I can trust the binaries produced by your server?

view this post on Zulip Reid Barton (May 14 2020 at 16:52):

How about just not building PRs?

view this post on Zulip Yury G. Kudryashov (May 14 2020 at 16:52):

As far as I understand, a PR can say " let's build PRs"

view this post on Zulip Yury G. Kudryashov (May 14 2020 at 16:53):

Am I right?

view this post on Zulip Reid Barton (May 14 2020 at 16:53):

oh, I don't know about this github self-hosted thing in particular

view this post on Zulip Yury G. Kudryashov (May 14 2020 at 16:54):

Does it run under root user account (otherwise it shouldn't have access to /etc/passwd)?

view this post on Zulip Reid Barton (May 14 2020 at 16:56):

I agree if it can't even be configured to only build commits to master, that's an extra level of dumb.

view this post on Zulip Rob Lewis (May 14 2020 at 16:58):

I think the solution is going to depend on exactly what you want to do with this. Is this supposed to be a replacement for the CI running on Actions right now? Or is it separate, just for benchmarking?

view this post on Zulip Rob Lewis (May 14 2020 at 16:58):

If it's just for benchmarking, it doesn't have to be tied to the mathlib repo at all.

view this post on Zulip Jannis Limperg (May 14 2020 at 17:08):

Yury G. Kudryashov said:

As far as I understand, a PR can say " let's build PRs"

Yes, that's one part of the problem. The workflow itself defines when it is run, and there are no repository-wide configuration options to restrict this. I thought it might be possible to disable Actions for PRs from forks, but no such luck.

Yury G. Kudryashov said:

Does it run under root user account (otherwise it shouldn't have access to /etc/passwd)?

No, you can run it as an unprivileged user. (/etc/passwd is world-readable by default, at least on this installation.) But then it still has access to any world-readable files as well as the network, which is way too much attack surface.

Rob Lewis said:

I think the solution is going to depend on exactly what you want to do with this. Is this supposed to be a replacement for the CI running on Actions right now? Or is it separate, just for benchmarking?

No, I want this to be a faster replacement for the default Github Actions runners. The idea is to get mathlib compile times to a more tolerable level for all contributors.

Johan Commelin said:

How about a qemu bare metal VM?

The problem is that you really want to spin up a new VM/container for every job, to make sure that they can't leave anything behind. This would have to be done by the GH runner, which, as far as I can tell, can't do it.

view this post on Zulip Yury G. Kudryashov (May 14 2020 at 17:10):

(/etc/passwd is world-readable by default
I should read more carefully: it's "just" /etc/passwd, not /etc/shadow

view this post on Zulip Gabriel Ebner (May 14 2020 at 17:13):

Jannis Limperg said:

No, I want this to be a faster replacement for the default Github Actions runners. The idea is to get mathlib compile times to a more tolerable level for all contributors.

Have you done the calculations on this? How many machines would we need to build all PR commits? (Say one build takes 20 minutes, then we can only do 72 commits per day. This seems potentially low.)

view this post on Zulip Yury G. Kudryashov (May 14 2020 at 17:14):

An alternative approach: make github just wget some URL on the server. Then a script can checkout the specified commit, compile it using a hardcoded script, publish oleans, and post a status update.

view this post on Zulip Gabriel Ebner (May 14 2020 at 17:17):

Please be aware that you can execute arbitrary commands when compiling Lean code. (See io.cmd.)

view this post on Zulip Reid Barton (May 14 2020 at 17:19):

and, for example, you do not want that arbitrary code to have access to your cloud storage credentials

view this post on Zulip Rob Lewis (May 14 2020 at 17:22):

With that setup, we can at least restrict to people we've offered write access to mathlib branches. External PRs can't access GH secrets, so you just need the script to be called with a stored passphrase.

view this post on Zulip Johan Commelin (May 14 2020 at 17:22):

@Jannis Limperg What's wrong with the current Github actions + leanproject setup?

view this post on Zulip Rob Lewis (May 14 2020 at 17:23):

But, like Gabriel, I'm skeptical about the scalability here. Github gives us a crazy amount of resources for free, even if they aren't blazing fast.

view this post on Zulip Jannis Limperg (May 14 2020 at 17:37):

Gabriel Ebner said:

Have you done the calculations on this? How many machines would we need to build all PR commits? (Say one build takes 20 minutes, then we can only do 72 commits per day. This seems potentially low.)

No, I'm waiting for the benchmark results to see what the numbers are like. I'm also not sure that they're going to be good enough for this whole thing to be worth it. However, on my machine, compilation with 4 threads already takes under 1h (around half the time GH actions currently needs), so I could run 2 jobs in parallel.

Johan Commelin said:

Jannis Limperg What's wrong with the current Github actions + leanproject setup?

That it takes 2h until you know whether your commit broke anything (and 3h for linting etc.). I was under the impression that people were unhappy with this and would be happier if it took significantly less long. However, if there's no demand for this project, I'll gladly drop it; the compile times don't bother me personally much (because my machine will likely remain faster than CI).

view this post on Zulip Jannis Limperg (May 14 2020 at 18:23):

@Scott Morrison @Ryan Lahfa The benchmark script was missing a shift, so it'll run the 1-thread benchmark forever. Sorry! (Why is there no facepalm emoji?)

Corrected script:

#!/bin/bash

function die() {
  echo $1
  echo "Usage: mathlib-bench <mathlib-dir> <n1> [n2 ...]"
  exit 1
}

test -n "$1" || die "Mathlib directory not specified."
mathlib_dir="$1"
shift

test -n "$1" || die "Thread count not specified"

echo "===== Cloning mathlib repo in '$mathlib_dir'"
git clone https://github.com/leanprover-community/mathlib "$mathlib_dir" && \
  pushd "$mathlib_dir" && \
  git checkout d0beb496140b5506530e18033591c711030546ad && \
  popd

test "$?" -eq 0 || die "Unable to clone mathlib."

echo "===== Installing Lean toolchain"
pushd "$mathlib_dir" && \
  leanpkg configure && \
  popd

test "$?" -eq 0 || die "Unable to install Lean toolchain."

while test -n "$1"; do
  test "$1" -gt 0 || die "Not a positive number: $1."
  n_threads="$1"

  echo "===== Building mathlib with $n_threads threads" && \
    find "$mathlib_dir" -name '*.olean' -delete && \
    pushd "$mathlib_dir" && \
    command time leanpkg build -- --threads="$n_threads" --memory=16000 && \
    popd

  test $? -eq 0 || die "Error while building mathlib."
  shift
done

view this post on Zulip Johan Commelin (May 14 2020 at 18:25):

@Jannis Limperg It's :face palm:. I've needed it a lot.

view this post on Zulip Jannis Limperg (May 14 2020 at 19:27):

What's the correct reaction when you're too dumb to find an emoji? :face_palm:

view this post on Zulip Ryan Lahfa (May 14 2020 at 19:56):

Jannis Limperg said:

sigh The security issues of self-hosted Github runners are much worse than I thought. Essentially: Anyone can make a PR that changes the GH actions workflow file, then this new file is run (because it can say, 'run me on a self-hosted runner when a PR is created'), and the runner does no sandboxing whatsoever. I was able to create a PR (as a random user with no affiliation with the repo) that changed the workflow file to cat /etc/passwd and this was executed and displayed without a hitch. Whoever thought this was a remotely sensible design deserves to be slapped.

What this means for the project is that I need to figure out whether I can sandbox the runner myself in a VM or something. Fun times...

That's why I think we should go for a Hydra (Nix/NixOS) model

view this post on Zulip Ryan Lahfa (May 14 2020 at 19:57):

The GH Actions are cool for simple stuff, beyond that, it's becoming annoyingly difficult to do it

view this post on Zulip Ryan Lahfa (May 14 2020 at 19:59):

Johan Commelin said:

Jannis Limperg What's wrong with the current Github actions + leanproject setup?

That it takes 2h until you know whether your commit broke anything (and 3h for linting etc.). I was under the impression that people were unhappy with this and would be happier if it took significantly less long. However, if there's no demand for this project, I'll gladly drop it; the compile times don't bother me personally much (because my machine will likely remain faster than CI).

In addition, if it takes 2 hours now, it might slowly getting longer and longer.
Faster feedback is nice, but also makes it future-proof.

view this post on Zulip Johan Commelin (May 14 2020 at 20:07):

I see. But maybe we can first focus on detailed and reliable profiling. So that we better understand where the slow bits of code are?

view this post on Zulip Johan Commelin (May 14 2020 at 20:07):

Still, of course I would love "instant CI"

view this post on Zulip Ryan Lahfa (May 14 2020 at 20:07):

Johan Commelin said:

I see. But maybe we can first focus on detailed and reliable profiling. So that we better understand where the slow bits of code are?

Agreed, I'm not jumping yet on the running stuff yet, I just want to setup simple experiments

view this post on Zulip Ryan Lahfa (May 14 2020 at 20:07):

Ideally, reproducible experiments would be arguably nice

view this post on Zulip Scott Morrison (May 15 2020 at 01:53):

I'm running the updated benchmark script.


Last updated: May 11 2021 at 22:14 UTC