Zulip Chat Archive

Stream: mathlib4

Topic: Fine-tuning bors batching


Yaël Dillies (Feb 13 2024 at 14:54):

Batches like https://mathlib-bors-ca18eefec4cb.herokuapp.com/batches/1584 seemed doomed to fail due to their large size. What do people think of setting max_batch_size to something like 16 to avoid huge useless runs?

Yaël Dillies (Feb 13 2024 at 15:02):

This is probably day-dreaming, but is it possible in any way to use the logs of a failed bors batch (or mergify batch, when that becomes a thing) in order to influence how it's going to be rebatched? Put another way: A human can reasonably quickly read a log file and tell Johan what he should take off the queue; Is it possible to replace both the human and Johan by a script?

Yaël Dillies (Feb 13 2024 at 15:08):

It's pretty easy to list which files a PR touches and also pretty easy to calculate the import graph on master, so a somewhat naive criterion for rebatching is something like "Put all the PRs that don't touch anything upstream of the failing file in a batch. Split the other ones into two equal size batches.". A more refined version could be looking at the distance along the import graph (either in terms of file, or using a more complicated metric like the number of declarations used) and make sure the "didn't touch the file" batch does not contain more than 1/3 of the batch's PRs.

Yaël Dillies (Feb 13 2024 at 15:09):

Because it's the trend, one could even think of training an AI to recognise what PR is causing the build error and single it out...

Eric Rodriguez (Feb 13 2024 at 15:12):

Mergify doesn't support something like this, I don't think. I mean, at the current moment we seem to have enough CI capacity - are you finding yourself waiting a lot?

Yaël Dillies (Feb 13 2024 at 15:13):

No, but @Johan Commelin and I were discussing how our CI setup is going to scale up. As we seem to be able to get Mergify to (slowly) work on features we need, it seems good to think through what our future needs will be.

Patrick Massot (Feb 13 2024 at 15:14):

Yaël Dillies said:

Put another way: A human can reasonably quickly read a log file and tell Johan what he should take off the queue; Is it possible to replace both the human and Johan by a script?

Johan, it seems you’ve been promoted(?) to non-human.

Ruben Van de Velde (Feb 13 2024 at 15:16):

There's another 25-PR batch running, let's see... Nope, it's failed

Ruben Van de Velde (Feb 13 2024 at 15:18):

Which is to say, I agree

Johan Commelin (Feb 13 2024 at 15:19):

Where should we put this setting? In some file? Or via some web UI?

Yaël Dillies (Feb 13 2024 at 15:19):

Johan certainly provides super-human reviewing!

Eric Rodriguez (Feb 13 2024 at 15:19):

I've not heard anything from mergify wrt making the merge behaviour as we desire - am I out of the loop here?

Eric Rodriguez (Feb 13 2024 at 15:19):

bors.toml, I'd assume

Yaël Dillies (Feb 13 2024 at 15:20):

Here: https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/bors.20going.20away.3F/near/412131903

Johan Commelin (Feb 13 2024 at 15:25):

#10499

Johan Commelin (Feb 14 2024 at 09:08):

Here is a short analysis of what happened:

- Start: 2/13/2024, 13:28:55
- End:   2/14/2024, 08:40:06
- Duration: 19 hours 11 minutes
- Total number of merged PRs: 87
- Total batches: 26
- Total canceled batches: 1 (size 13)
- Total failed batches: 12
- Total succeeded batches: 13
- Average batch size: 10.3 = 268/26
- Average successful batch size: 6.7 = 87/13
- Average failed batch size: 14 = 168/12

Johan Commelin (Feb 14 2024 at 09:09):

Which means that under this kind of stress, bors is about 3.5x more efficient than a linear merging strategy.

Ruben Van de Velde (Feb 14 2024 at 09:17):

Probably would even have been a bit more efficient if we'd had the batch size limit in place earlier

Yaël Dillies (Feb 14 2024 at 09:18):

I conjecture that a max batch size of 16 is still too big

Johan Commelin (Feb 14 2024 at 09:35):

Distribution of failed batch sizes

#
####
#######
########
##########
############
##############
################
###################
#########################
##########################
##########################

Anne Baanen (Feb 14 2024 at 10:29):

I would like to try a numerical argument for batch sizes. Let's simulate some options and see what looks best!

#!/usr/bin/env python3

import statistics
import random

def make_batch(batch_size, failure_mode):
    # Let's assume each PR in the batch has an independent chance of `prob_failure` to contain an error.
    prob_failure = failure_mode
    return [random.uniform(0, 1) > prob_failure for i in range(0, batch_size)]

def batch_succeeds(batch):
    # A batch succeeds if all PRs in this batch succeed.
    # We assume there are no inter-batch failures.
    return all(batch)

def split_on_failure(batch):
    # If a single PR fails, we drop it from the queue.
    # Otherwise we bisect the batch.
    if len(batch) <= 1:
        return []
    else:
        return [batch[:len(batch)//2], batch[len(batch)//2:]]

def simulate(batch_size, failure_mode, attempts):
    # Return the number of merged PRs during a number of build attempts.
    # We'll assume each batch is filled up to `batch_size`,
    # and the way a batch may fail is controlled by `failure_mode`.

    successes = 0
    queue = []
    for i in range(0, attempts):
        if not queue:
            queue = [make_batch(batch_size, failure_mode)]
        # We'll operate the queue like a true queue, which seems to be Bors' model.
        batch, *queue = queue
        if batch_succeeds(batch):
            successes += len(batch)
        else:
            queue.extend(split_on_failure(batch))

    return successes

if __name__ == "__main__":
    for batch_size in range(1, 32+1):
        for prob_failure in [0, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5]:
            outcomes = [simulate(batch_size, prob_failure, 1000) for i in range(0, 100)]
            print(batch_size, prob_failure, statistics.mean(outcomes), statistics.median(outcomes), statistics.stdev(outcomes))

Mario Carneiro (Feb 14 2024 at 10:33):

I'm using the following simulation:

sim[p_, \[Mu]_, n_] := (
  fails = {};
  successes = {};
  bisect[vals_] :=
   With[{ok = And @@ vals},
    If[ok, AppendTo[successes, Length[vals]],
     AppendTo[fails, Length[vals]];
     If[! ok && Length[vals] > 1,
      With[{b = Floor[Length[vals]/2]}, bisect[vals[[;; b]]];
       bisect[vals[[b + 1 ;;]]]]]]];
  Do[bisect[
    Table[RandomReal[] <= p,
     RandomVariate[PoissonDistribution[\[Mu]]]]], n];
  Mean /@ {fails, successes, Join[fails, successes]} // N)

This assumes no batch limit, where the initial batch is poisson-distributed with mean μ\mu, and the batch is bisected to find the errors (independent with probability pp). For the given data I get something like:

Manipulate[
 Show[ListPlot@
   Transpose@
    Table[{\[Mu], #} & /@ sim[1 - 1/p, \[Mu], 30], {\[Mu], 10, 200}],
  Plot[{14, 6.7, 10.3}, {\[Mu], 10, 200}]], {{p, 27}, 1, 50}]

image.png

Mario Carneiro (Feb 14 2024 at 10:34):

which is to say the predicted initial batch is μ=80\mu=80 (with success probability 11/271-1/27) which sounds a bit unrealistically high; I must be missing an aspect of the process

Anne Baanen (Feb 14 2024 at 10:37):

Based on running my script: if each PR has an i.i.d. chance p of failing (and therefore causing any batch it's in to fail), then the batch sizes for optimal throughput, followed by everything within one std. dev. are approximately as follows:

0.00 - >32 - 32
0.01 - >32 - 19-32
0.02 - >32 - 11-32
0.05 - 14 - 5-32
0.10 - 4 - 3-9
0.20 - 2 - 2-2
0.50 - 1 - 1-1

Anne Baanen (Feb 14 2024 at 10:39):

So if we assume more than one in 20 PRs fails, then our batch size is too big.

Mario Carneiro (Feb 14 2024 at 10:39):

AFAICT your code is using the same simulation as mine except that you are assuming fixed batch size and I'm using poisson-distributed batch sizes

Mario Carneiro (Feb 14 2024 at 10:39):

you are also not tracking number of failed batches

Mario Carneiro (Feb 14 2024 at 10:40):

my original goal was to try to fit the simulation to find the parameters

Mario Carneiro (Feb 14 2024 at 10:40):

and then you can use that to test batch sizes

Anne Baanen (Feb 14 2024 at 10:45):

Mario Carneiro said:

AFAICT your code is using the same simulation as mine except that you are assuming fixed batch size and I'm using poisson-distributed batch sizes

Yes, I think our model agrees. My question is maximum throughput during congested periods (how many merges is the upper limit we can survive), and I suppose yours is more similar to uncongested periods.

Anne Baanen (Feb 14 2024 at 10:46):

Is Poisson a reasonable assumption? I think our behaviour is somewhat more towards the tails and away from the middle, with periods of inactivity and of high activity.

Mario Carneiro (Feb 14 2024 at 10:48):

the landing time of PRs is not particularly correlated, except that there are times of day and of the week for high activity

Mario Carneiro (Feb 14 2024 at 11:07):

Ooh, I incorporated another aspect and got some interesting behavior:

sim[p_, \[Mu]_, n_] := (
  fails = {};
  successes = {};
  time = 1;
  bisect[vals_] := With[{ok = And @@ vals},
    time += 1;
    If[ok, AppendTo[successes, Length[vals]],
     AppendTo[fails, Length[vals]];
     If[! ok && Length[vals] > 1,
      With[{b = Floor[Length[vals]/2]}, bisect[vals[[;; b]]];
       bisect[vals[[b + 1 ;;]]]]]]];
  Do[batch =
    Table[RandomReal[] <= p,
     RandomVariate[PoissonDistribution[\[Mu] time]]];
   time = 0;
   bisect[batch], n];
  Mean /@ {fails, successes, Join[fails, successes]} // N)

Here the mean for the distribution is μt\mu t where tt is the number of batches, on the reasoning that batches take time and the longer it takes to bisect the more PRs queue up in the next run. What I didn't expect is that if μ\mu is larger than a constant then the simulation diverges with exponentially growing batch sizes, because at a certain point the batch size nn is much larger than the failure rate qq so we have to bisect qnqn times to find all the failures in the batch, so if μq>1\mu q>1 then the batches will grow as (μq)kn(\mu q)^kn

Mario Carneiro (Feb 14 2024 at 11:08):

I suppose at some level this is expected, this is what happens if the people outrun bors

Mario Carneiro (Feb 14 2024 at 11:09):

IIRC we had a situation like this with the benchmark bot, where the queue was plodding along a few months behind real time

Mario Carneiro (Feb 14 2024 at 13:09):

Given fixed probability p, we can actually work out a recursive formula for the cost of bisection:

cost[n_] := cost[n] = 1 + (1 - p^n) failcost[n]
failcost[n_] :=
 If[n <= 1, 0, With[{m = Floor[n/2]}, cost[m] + cost[n - m]]]
maxcost[n_, m_] :=
 If[n > 2 m, cost[m] + maxcost[n - m, m],
  If[n > m, cost[Floor[n/2]] + cost[n - Floor[n/2]], cost[n]]]

That is, costn(p)=1+(1pn)failcostn(p)\mathrm{cost}_n(p)=1+(1-p^n)\mathrm{failcost}_n(p) and failcostn(p)=costm(p)+costnm(p)\mathrm{failcost}_n(p)=\mathrm{cost}_m(p)+\mathrm{cost}_{n-m}(p) where m=n/2m=\lfloor n/2\rfloor. costn(p)\mathrm{cost}_n(p) is a polynomial in pp for each nn, which is 2n12n-1 near p=0p=0 and 1+2n(1p)1+2n(1-p) near p=1p=1.

Mario Carneiro (Feb 14 2024 at 13:09):

Here's a plot of the relative benefit of a batch limit mm compared to no limit, when nn items are in the queue, assuming p=11/27p=1-1/27 and for m=16,27,40,54m=16, 27, 40, 54:

ListPlot[
 Transpose[
  Table[cost[n] - maxcost[n, m] /. p -> 1 - 1/27., {n, 1,
    200}, {m, {16, 27, 40, 56}}]], Joined -> True]

image.png

The bottom value is m=16m=16, which is clearly too low. Playing around with this shows that the benefit becomes strictly positive when mm is more than twice the failure rate (the red line). Also shown are the intermediate values m=27m=27 (yellow, the same as the failure rate), which gets off to a weak start, and m=40m=40 (in green), which seems to have the best long term behavior (for large nn, i.e. when there is a large backlog).

Yaël Dillies (Feb 14 2024 at 13:10):

From experience, a failure probability of 1/27 sounds too low. Can we gather more bors data?

Ruben Van de Velde (Feb 14 2024 at 13:12):

:up: mathematicians using CI

Mario Carneiro (Feb 14 2024 at 13:13):

This is the failure probability for individual PRs, not batches. The failure rate for batches is about 55% in these simulations (although it depends on the batching strategy of course)

Damiano Testa (Feb 14 2024 at 13:13):

... and then we should formalise the results.

Yaël Dillies (Feb 14 2024 at 13:14):

Mario Carneiro said:

This is the failure probability for individual PRs, not batches.

I meant that 1/27 sounds too low for the failure of an individual PR.

Mario Carneiro (Feb 14 2024 at 13:15):

I don't think bors actually runs batches of size 1 which it thinks will fail by process of elimination, which might skew the data a little

Yaël Dillies (Feb 14 2024 at 13:16):

It does seem to run the second batch of size m/2 even when the first batch of size m/2 has already hit master though.

Mario Carneiro (Feb 14 2024 at 13:17):

yes, that's a necessary part of the bisection

Yaël Dillies (Feb 14 2024 at 13:18):

Why? We know the second batch will fail, so why not split further into two batches of size m/4 ?

Mario Carneiro (Feb 14 2024 at 13:21):

oh I see. That would require more state I guess, it would need to share information between siblings in the bisection tree

Mario Carneiro (Feb 14 2024 at 13:22):

after the initial bisection you would mark the second batch as "mark failing if the first batch succeeds"

Yaël Dillies (Feb 14 2024 at 13:22):

Yes, exactly

Mario Carneiro (Feb 14 2024 at 13:22):

my cost function doesn't take that into account either

Yaël Dillies (Feb 14 2024 at 13:23):

I suspect those "doomed to fail" batches account for quite a lot of CI time overall.

Josha Dekker (Feb 14 2024 at 13:26):

Just a quick thought: are failure probabilities for a PR roughly constant or do we have some property (e.g. total diff etc) that is predictive for failure probabilities? This way, you could make the bisection split "balanced" in the sense that each split has about equal estimated probability of failing rather than equal number of PRs.

Yaël Dillies (Feb 14 2024 at 13:27):

Currently, if you have 2n2^n PRs, all but one of which are good (and mutually compatible) and one is bad, you will run one batch with 2n2^n PRs, two batches with 2n12^{n - 1} PRs, ..., two batches with 11 PR, namely 2n+12n + 1 batches. If you do my optimisation, you will run one batch with 2n2^n PRs, ..., one batch with 22 PRs, two batches with 11 PR, namely n+1n + 1 batches. That's a 50% improvement!

Anne Baanen (Feb 14 2024 at 13:29):

Here are raw throughput numbers where we ignore "doomed to failure" batches: afbeelding.png
Here where we drop single PRs that are "doomed to failure":
afbeelding.png
And here where we bisect batches that are "doomed to fail" (singletons are dropped, like above):
afbeelding.png

Anne Baanen (Feb 14 2024 at 13:31):

(Not entirely sure why the third table has lower throughput than the second one. It might be that I implemented the logic incorrectly...)

Yaël Dillies (Feb 14 2024 at 13:33):

Josha Dekker said:

Just a quick thought: are failure probabilities for a PR roughly constant or do we have some property (e.g. total diff etc) that is predictive for failure probabilities? This way, you could make the bisection split "balanced" in the sense that each split has about equal estimated probability of failing rather than equal number of PRs.

Here are my thoughts from the reviewer stream a few days ago:
Yaël Dillies said:

This is probably day-dreaming, but is it possible in any way to use the logs of a failed bors batch (or mergify batch, when that becomes a thing) in order to influence how it's going to be rebatched? Put another way: A human can reasonably quickly read a log file and tell Johan what he should take off the queue; Is it possible to replace both the human and Johan by a script?
It's pretty easy to list which files a PR touches and also pretty easy to calculate the import graph on master, so a somewhat naive criterion for rebatching is something like "Put all the PRs that don't touch anything upstream of the failing file in a batch. Split the other ones into two equal size batches.". A more refined version could be looking at the distance along the import graph (either in terms of file, or using a more complicated metric like the number of declarations used) and make sure the "didn't touch the file" batch does not contain more than 1/3 of the batch's PRs.
Because it's the trend, one could even think of training an AI to recognise what PR is causing the build error and single it out...

Yaël Dillies (Feb 14 2024 at 13:36):

Yaël Dillies said:

Why? We know the second batch will fail, so why not split further into two batches of size m/4 ?

This is not always true. It can happen that a bad PR becomes good because a PR improved performance on master or whatever. But this is rare enough that I think it's fair to reject the bad-turned-good PR until the author merges master and checks everything is alright indeed.

Anne Baanen (Feb 14 2024 at 13:37):

Anne Baanen said:

(Not entirely sure why the third table has lower throughput than the second one. It might be that I implemented the logic incorrectly...)

I added logic to drop predicted-to-fail batches of size 1 outright. Now the raw numbers look like:
afbeelding.png

Yaël Dillies (Feb 14 2024 at 13:38):

Those numbers are very encouraging!

Anne Baanen (Feb 14 2024 at 13:42):

One conclusion we might draw is if Bors is smart and drops batches that are predicted to fail, we _don't_ want a batch size of a power of two. Multiples of 3 like 12 or 24 seem to be slightly better!

Anne Baanen (Feb 14 2024 at 13:45):

Should we rename and/or merge the two active bors threads by the way? Maybe this can become "Optimizing bors batch size" and the other "Bors failure investigation".

Yaël Dillies (Feb 14 2024 at 13:46):

Actually, could you move my "Fine-tuning bors batching" thread out of #mathlib reviewers and then move the relevant messages from here to it?

Mario Carneiro (Feb 14 2024 at 13:46):

The formula only changes a little bit accounting for the "doomed to fail" optimization:

cost2[n_] := cost2[n] = 1 + (1 - p^n) failcost2[n]
failcost2[n_] :=
 If[n <= 1, 0, With[{m = Floor[n/2]}, cost2[m] + cost2[n - m] - p^m]]

Which has an effect of increasing throughput by about 1/25, here it is compared to the batching optimizations:
image.png

Mario Carneiro (Feb 14 2024 at 13:47):

and of course the batching optimizations still apply on top of this

Yaël Dillies (Feb 14 2024 at 13:47):

Anne Baanen said:

Multiples of 3 like 12 or 24 seem to be slightly better!

Does that mean we should try trisecting? Try a batch of size m. It fails. Try a batch of size m/3. It works. Try a batch of size m/3. It works. Trisect the last m/3 batch.

Anne Baanen (Feb 14 2024 at 13:48):

Yaël Dillies said:

Actually, could you move my "Fine-tuning bors batching" thread out of #mathlib reviewers and then move the relevant messages from here to it?

How about the "Bors batch size" thread in the reviewers stream?

Yaël Dillies (Feb 14 2024 at 13:48):

You could merge it too, I guess. It might become a little confusing.

Notification Bot (Feb 14 2024 at 13:49):

This topic was moved here from #mathlib reviewers > Fine-tuning bors batching by Anne Baanen.

Notification Bot (Feb 14 2024 at 13:50):

This topic was moved here from #mathlib reviewers > Bors batch size by Anne Baanen.

Notification Bot (Feb 14 2024 at 13:50):

52 messages were moved here from #mathlib4 > bors failing by Anne Baanen.

Mario Carneiro (Feb 14 2024 at 14:28):

More plots. The question of where to bisect is the same as asking what is the minimum of the following function:
image.png

Here it is for more values of nn:

ListPlot[
 Table[{m/n,
   Evaluate[(cost[m] + cost[n - m] - p^m) - (cost[1] + cost[n - 1] -
        p^1) /. p -> 1 - 1/27.]}, {n, {10, 20, 30, 40, 80, 300,
    2000}}, {m, 1, n - 1}], Joined -> True, PlotRange -> {-1, 1}]

image.png

There are two interesting features of this function compared to the usual. Normally this would be a symmetric function which bows in the middle, hence the minimum is in the middle so that's where you split. However here there is an overall tilt of the function down to the left because of the additional pmp^m factor, meaning that we should prefer to split more leftish. Moreover, the case m=1m=1 gets a little boost because it saves on the cost of a bisection. As a result, for a significant range of values the minimum value is m=1m=1, all the way up to m=30m=30, and then it hangs somewhere around n/3n/3 for a while, and then for very large nn the bottom starts to turn into a fractal so I'm not so sure.

Mario Carneiro (Feb 14 2024 at 14:32):

And here's a plot of the relative advantages to throughput of applying the splitting strategy n/3n/3 or if(n<30,1,n/3)\mathrm{if}(n<30,1,n/3) recursively, over the usual n/2n/2:
image.png

Mario Carneiro (Feb 14 2024 at 15:43):

Yaël Dillies said:

Currently, if you have 2n2^n PRs, all but one of which are good (and mutually compatible) and one is bad, you will run one batch with 2n2^n PRs, two batches with 2n12^{n - 1} PRs, ..., two batches with 11 PR, namely 2n+12n + 1 batches. If you do my optimisation, you will run one batch with 2n2^n PRs, ..., one batch with 22 PRs, two batches with 11 PR, namely n+1n + 1 batches. That's a 50% improvement!

This analysis isn't quite correct. It's true if the failing PR is the last one in the list, but if it is the first one then you get no improvement. In general it's somewhere in between

Mario Carneiro (Feb 14 2024 at 15:44):

because the optimization only applies when the first half is good and the second half is bad

Mario Carneiro (Feb 14 2024 at 15:58):

Just pulled some bors data regarding 1-commit batches: there are 316 successes and 45 failures in the database right now, i.e. one in 7

Sébastien Gouëzel (Feb 14 2024 at 16:02):

Isn't this biased by the fact that correct commits are more likely to be merged in bigger batches, while buggy commits will end up alone and will therefore be part of a 1-commit batch?

Eric Rodriguez (Feb 14 2024 at 16:07):

Can we not run the simulations with historical data? Maybe just try grab the past week and see what would happen

Jireh Loreaux (Feb 14 2024 at 16:07):

I'm wondering if we should take more advantage of priority, especially in common cases where we expect things to break when part of a batch. I'm betting that a small amount of thoughtful human interaction can drastically decrease the failure rate.

I'm thinking of, for instance, PRs renaming very common lemmas (even with deprecation, bors will fail because Mathlib doesn't allow using deprecated lemmas), PRs deleting ‌/ renaming a file, etc.

Jireh Loreaux (Feb 14 2024 at 16:08):

I think one source of failures from yesterday was due to a file rename.

Ruben Van de Velde (Feb 14 2024 at 16:54):

One of the issues I think I saw multiple times was PRs not being updated since the FunLike refactor. I wonder if we could have a warning or something for PRs that are very outdated

Johan Commelin (Feb 14 2024 at 16:59):

@Mario Carneiro at some point automatically commented on all open PRs pinging them to update...

Jireh Loreaux (Feb 14 2024 at 17:02):

I think that's because he updated all the toolchains for them in order to avoid data loss from v4.x.x-rc3, whichever one it was.

Jireh Loreaux (Feb 14 2024 at 17:02):

Oh but I guess your point is more that such a pinging of all PRs is possible.

Ruben Van de Velde (Feb 14 2024 at 17:30):

Yeah, that could have helped with this case, though my point was more that old branches are more prone to failing in general

Yaël Dillies (Feb 14 2024 at 17:35):

Do we have data about probability of a PR failing plotted according to its age (as measured by the latest master merge)?

Eric Rodriguez (Feb 14 2024 at 18:29):

I really think we just need to find some number that seems sensible and stick with it. Fiddling with very specific numbers is not going to fix all issues. One thing that mergify does which could be helpful is that it can do batches speculatively, i.e. in parallel. I think that's much more likely to give us a boost than trying to dig deeper and deeper into this

Mario Carneiro (Feb 14 2024 at 20:08):

Jireh Loreaux said:

I'm wondering if we should take more advantage of priority, especially in common cases where we expect things to break when part of a batch.

There is the bors single on command, which says "don't put this PR in a batch, compile it on its own". I recommend using this for PRs that have lots of merge conflicts

Jireh Loreaux (Feb 14 2024 at 21:07):

Can a bot find the most recent common ancestor of a PR branch and master, and then use the age of that commit to add a label?


Last updated: May 02 2025 at 03:31 UTC