Zulip Chat Archive
Stream: mathlib4
Topic: Fine-tuning bors batching
Yaël Dillies (Feb 13 2024 at 14:54):
Batches like https://mathlib-bors-ca18eefec4cb.herokuapp.com/batches/1584 seemed doomed to fail due to their large size. What do people think of setting max_batch_size
to something like 16 to avoid huge useless runs?
Yaël Dillies (Feb 13 2024 at 15:02):
This is probably day-dreaming, but is it possible in any way to use the logs of a failed bors batch (or mergify batch, when that becomes a thing) in order to influence how it's going to be rebatched? Put another way: A human can reasonably quickly read a log file and tell Johan what he should take off the queue; Is it possible to replace both the human and Johan by a script?
Yaël Dillies (Feb 13 2024 at 15:08):
It's pretty easy to list which files a PR touches and also pretty easy to calculate the import graph on master, so a somewhat naive criterion for rebatching is something like "Put all the PRs that don't touch anything upstream of the failing file in a batch. Split the other ones into two equal size batches.". A more refined version could be looking at the distance along the import graph (either in terms of file, or using a more complicated metric like the number of declarations used) and make sure the "didn't touch the file" batch does not contain more than 1/3 of the batch's PRs.
Yaël Dillies (Feb 13 2024 at 15:09):
Because it's the trend, one could even think of training an AI to recognise what PR is causing the build error and single it out...
Eric Rodriguez (Feb 13 2024 at 15:12):
Mergify doesn't support something like this, I don't think. I mean, at the current moment we seem to have enough CI capacity - are you finding yourself waiting a lot?
Yaël Dillies (Feb 13 2024 at 15:13):
No, but @Johan Commelin and I were discussing how our CI setup is going to scale up. As we seem to be able to get Mergify to (slowly) work on features we need, it seems good to think through what our future needs will be.
Patrick Massot (Feb 13 2024 at 15:14):
Yaël Dillies said:
Put another way: A human can reasonably quickly read a log file and tell Johan what he should take off the queue; Is it possible to replace both the human and Johan by a script?
Johan, it seems you’ve been promoted(?) to non-human.
Ruben Van de Velde (Feb 13 2024 at 15:16):
There's another 25-PR batch running, let's see... Nope, it's failed
Ruben Van de Velde (Feb 13 2024 at 15:18):
Which is to say, I agree
Johan Commelin (Feb 13 2024 at 15:19):
Where should we put this setting? In some file? Or via some web UI?
Yaël Dillies (Feb 13 2024 at 15:19):
Johan certainly provides super-human reviewing!
Eric Rodriguez (Feb 13 2024 at 15:19):
I've not heard anything from mergify wrt making the merge behaviour as we desire - am I out of the loop here?
Eric Rodriguez (Feb 13 2024 at 15:19):
bors.toml, I'd assume
Yaël Dillies (Feb 13 2024 at 15:20):
Johan Commelin (Feb 13 2024 at 15:25):
Johan Commelin (Feb 14 2024 at 09:08):
Here is a short analysis of what happened:
- Start: 2/13/2024, 13:28:55
- End: 2/14/2024, 08:40:06
- Duration: 19 hours 11 minutes
- Total number of merged PRs: 87
- Total batches: 26
- Total canceled batches: 1 (size 13)
- Total failed batches: 12
- Total succeeded batches: 13
- Average batch size: 10.3 = 268/26
- Average successful batch size: 6.7 = 87/13
- Average failed batch size: 14 = 168/12
Johan Commelin (Feb 14 2024 at 09:09):
Which means that under this kind of stress, bors is about 3.5x more efficient than a linear merging strategy.
Ruben Van de Velde (Feb 14 2024 at 09:17):
Probably would even have been a bit more efficient if we'd had the batch size limit in place earlier
Yaël Dillies (Feb 14 2024 at 09:18):
I conjecture that a max batch size of 16 is still too big
Johan Commelin (Feb 14 2024 at 09:35):
Distribution of failed batch sizes
#
####
#######
########
##########
############
##############
################
###################
#########################
##########################
##########################
Anne Baanen (Feb 14 2024 at 10:29):
I would like to try a numerical argument for batch sizes. Let's simulate some options and see what looks best!
#!/usr/bin/env python3
import statistics
import random
def make_batch(batch_size, failure_mode):
# Let's assume each PR in the batch has an independent chance of `prob_failure` to contain an error.
prob_failure = failure_mode
return [random.uniform(0, 1) > prob_failure for i in range(0, batch_size)]
def batch_succeeds(batch):
# A batch succeeds if all PRs in this batch succeed.
# We assume there are no inter-batch failures.
return all(batch)
def split_on_failure(batch):
# If a single PR fails, we drop it from the queue.
# Otherwise we bisect the batch.
if len(batch) <= 1:
return []
else:
return [batch[:len(batch)//2], batch[len(batch)//2:]]
def simulate(batch_size, failure_mode, attempts):
# Return the number of merged PRs during a number of build attempts.
# We'll assume each batch is filled up to `batch_size`,
# and the way a batch may fail is controlled by `failure_mode`.
successes = 0
queue = []
for i in range(0, attempts):
if not queue:
queue = [make_batch(batch_size, failure_mode)]
# We'll operate the queue like a true queue, which seems to be Bors' model.
batch, *queue = queue
if batch_succeeds(batch):
successes += len(batch)
else:
queue.extend(split_on_failure(batch))
return successes
if __name__ == "__main__":
for batch_size in range(1, 32+1):
for prob_failure in [0, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5]:
outcomes = [simulate(batch_size, prob_failure, 1000) for i in range(0, 100)]
print(batch_size, prob_failure, statistics.mean(outcomes), statistics.median(outcomes), statistics.stdev(outcomes))
Mario Carneiro (Feb 14 2024 at 10:33):
I'm using the following simulation:
sim[p_, \[Mu]_, n_] := (
fails = {};
successes = {};
bisect[vals_] :=
With[{ok = And @@ vals},
If[ok, AppendTo[successes, Length[vals]],
AppendTo[fails, Length[vals]];
If[! ok && Length[vals] > 1,
With[{b = Floor[Length[vals]/2]}, bisect[vals[[;; b]]];
bisect[vals[[b + 1 ;;]]]]]]];
Do[bisect[
Table[RandomReal[] <= p,
RandomVariate[PoissonDistribution[\[Mu]]]]], n];
Mean /@ {fails, successes, Join[fails, successes]} // N)
This assumes no batch limit, where the initial batch is poisson-distributed with mean , and the batch is bisected to find the errors (independent with probability ). For the given data I get something like:
Manipulate[
Show[ListPlot@
Transpose@
Table[{\[Mu], #} & /@ sim[1 - 1/p, \[Mu], 30], {\[Mu], 10, 200}],
Plot[{14, 6.7, 10.3}, {\[Mu], 10, 200}]], {{p, 27}, 1, 50}]
Mario Carneiro (Feb 14 2024 at 10:34):
which is to say the predicted initial batch is (with success probability ) which sounds a bit unrealistically high; I must be missing an aspect of the process
Anne Baanen (Feb 14 2024 at 10:37):
Based on running my script: if each PR has an i.i.d. chance p
of failing (and therefore causing any batch it's in to fail), then the batch sizes for optimal throughput, followed by everything within one std. dev. are approximately as follows:
0.00 - >32 - 32
0.01 - >32 - 19-32
0.02 - >32 - 11-32
0.05 - 14 - 5-32
0.10 - 4 - 3-9
0.20 - 2 - 2-2
0.50 - 1 - 1-1
Anne Baanen (Feb 14 2024 at 10:39):
So if we assume more than one in 20 PRs fails, then our batch size is too big.
Mario Carneiro (Feb 14 2024 at 10:39):
AFAICT your code is using the same simulation as mine except that you are assuming fixed batch size and I'm using poisson-distributed batch sizes
Mario Carneiro (Feb 14 2024 at 10:39):
you are also not tracking number of failed batches
Mario Carneiro (Feb 14 2024 at 10:40):
my original goal was to try to fit the simulation to find the parameters
Mario Carneiro (Feb 14 2024 at 10:40):
and then you can use that to test batch sizes
Anne Baanen (Feb 14 2024 at 10:45):
Mario Carneiro said:
AFAICT your code is using the same simulation as mine except that you are assuming fixed batch size and I'm using poisson-distributed batch sizes
Yes, I think our model agrees. My question is maximum throughput during congested periods (how many merges is the upper limit we can survive), and I suppose yours is more similar to uncongested periods.
Anne Baanen (Feb 14 2024 at 10:46):
Is Poisson a reasonable assumption? I think our behaviour is somewhat more towards the tails and away from the middle, with periods of inactivity and of high activity.
Mario Carneiro (Feb 14 2024 at 10:48):
the landing time of PRs is not particularly correlated, except that there are times of day and of the week for high activity
Mario Carneiro (Feb 14 2024 at 11:07):
Ooh, I incorporated another aspect and got some interesting behavior:
sim[p_, \[Mu]_, n_] := (
fails = {};
successes = {};
time = 1;
bisect[vals_] := With[{ok = And @@ vals},
time += 1;
If[ok, AppendTo[successes, Length[vals]],
AppendTo[fails, Length[vals]];
If[! ok && Length[vals] > 1,
With[{b = Floor[Length[vals]/2]}, bisect[vals[[;; b]]];
bisect[vals[[b + 1 ;;]]]]]]];
Do[batch =
Table[RandomReal[] <= p,
RandomVariate[PoissonDistribution[\[Mu] time]]];
time = 0;
bisect[batch], n];
Mean /@ {fails, successes, Join[fails, successes]} // N)
Here the mean for the distribution is where is the number of batches, on the reasoning that batches take time and the longer it takes to bisect the more PRs queue up in the next run. What I didn't expect is that if is larger than a constant then the simulation diverges with exponentially growing batch sizes, because at a certain point the batch size is much larger than the failure rate so we have to bisect times to find all the failures in the batch, so if then the batches will grow as
Mario Carneiro (Feb 14 2024 at 11:08):
I suppose at some level this is expected, this is what happens if the people outrun bors
Mario Carneiro (Feb 14 2024 at 11:09):
IIRC we had a situation like this with the benchmark bot, where the queue was plodding along a few months behind real time
Mario Carneiro (Feb 14 2024 at 13:09):
Given fixed probability p
, we can actually work out a recursive formula for the cost of bisection:
cost[n_] := cost[n] = 1 + (1 - p^n) failcost[n]
failcost[n_] :=
If[n <= 1, 0, With[{m = Floor[n/2]}, cost[m] + cost[n - m]]]
maxcost[n_, m_] :=
If[n > 2 m, cost[m] + maxcost[n - m, m],
If[n > m, cost[Floor[n/2]] + cost[n - Floor[n/2]], cost[n]]]
That is, and where . is a polynomial in for each , which is near and near .
Mario Carneiro (Feb 14 2024 at 13:09):
Here's a plot of the relative benefit of a batch limit compared to no limit, when items are in the queue, assuming and for :
ListPlot[
Transpose[
Table[cost[n] - maxcost[n, m] /. p -> 1 - 1/27., {n, 1,
200}, {m, {16, 27, 40, 56}}]], Joined -> True]
The bottom value is , which is clearly too low. Playing around with this shows that the benefit becomes strictly positive when is more than twice the failure rate (the red line). Also shown are the intermediate values (yellow, the same as the failure rate), which gets off to a weak start, and (in green), which seems to have the best long term behavior (for large , i.e. when there is a large backlog).
Yaël Dillies (Feb 14 2024 at 13:10):
From experience, a failure probability of 1/27 sounds too low. Can we gather more bors data?
Ruben Van de Velde (Feb 14 2024 at 13:12):
:up: mathematicians using CI
Mario Carneiro (Feb 14 2024 at 13:13):
This is the failure probability for individual PRs, not batches. The failure rate for batches is about 55% in these simulations (although it depends on the batching strategy of course)
Damiano Testa (Feb 14 2024 at 13:13):
... and then we should formalise the results.
Yaël Dillies (Feb 14 2024 at 13:14):
Mario Carneiro said:
This is the failure probability for individual PRs, not batches.
I meant that 1/27 sounds too low for the failure of an individual PR.
Mario Carneiro (Feb 14 2024 at 13:15):
I don't think bors actually runs batches of size 1 which it thinks will fail by process of elimination, which might skew the data a little
Yaël Dillies (Feb 14 2024 at 13:16):
It does seem to run the second batch of size m/2 even when the first batch of size m/2 has already hit master though.
Mario Carneiro (Feb 14 2024 at 13:17):
yes, that's a necessary part of the bisection
Yaël Dillies (Feb 14 2024 at 13:18):
Why? We know the second batch will fail, so why not split further into two batches of size m/4 ?
Mario Carneiro (Feb 14 2024 at 13:21):
oh I see. That would require more state I guess, it would need to share information between siblings in the bisection tree
Mario Carneiro (Feb 14 2024 at 13:22):
after the initial bisection you would mark the second batch as "mark failing if the first batch succeeds"
Yaël Dillies (Feb 14 2024 at 13:22):
Yes, exactly
Mario Carneiro (Feb 14 2024 at 13:22):
my cost function doesn't take that into account either
Yaël Dillies (Feb 14 2024 at 13:23):
I suspect those "doomed to fail" batches account for quite a lot of CI time overall.
Josha Dekker (Feb 14 2024 at 13:26):
Just a quick thought: are failure probabilities for a PR roughly constant or do we have some property (e.g. total diff etc) that is predictive for failure probabilities? This way, you could make the bisection split "balanced" in the sense that each split has about equal estimated probability of failing rather than equal number of PRs.
Yaël Dillies (Feb 14 2024 at 13:27):
Currently, if you have PRs, all but one of which are good (and mutually compatible) and one is bad, you will run one batch with PRs, two batches with PRs, ..., two batches with PR, namely batches. If you do my optimisation, you will run one batch with PRs, ..., one batch with PRs, two batches with PR, namely batches. That's a 50% improvement!
Anne Baanen (Feb 14 2024 at 13:29):
Here are raw throughput numbers where we ignore "doomed to failure" batches: afbeelding.png
Here where we drop single PRs that are "doomed to failure":
afbeelding.png
And here where we bisect batches that are "doomed to fail" (singletons are dropped, like above):
afbeelding.png
Anne Baanen (Feb 14 2024 at 13:31):
(Not entirely sure why the third table has lower throughput than the second one. It might be that I implemented the logic incorrectly...)
Yaël Dillies (Feb 14 2024 at 13:33):
Josha Dekker said:
Just a quick thought: are failure probabilities for a PR roughly constant or do we have some property (e.g. total diff etc) that is predictive for failure probabilities? This way, you could make the bisection split "balanced" in the sense that each split has about equal estimated probability of failing rather than equal number of PRs.
Here are my thoughts from the reviewer stream a few days ago:
Yaël Dillies said:
This is probably day-dreaming, but is it possible in any way to use the logs of a failed bors batch (or mergify batch, when that becomes a thing) in order to influence how it's going to be rebatched? Put another way: A human can reasonably quickly read a log file and tell Johan what he should take off the queue; Is it possible to replace both the human and Johan by a script?
It's pretty easy to list which files a PR touches and also pretty easy to calculate the import graph on master, so a somewhat naive criterion for rebatching is something like "Put all the PRs that don't touch anything upstream of the failing file in a batch. Split the other ones into two equal size batches.". A more refined version could be looking at the distance along the import graph (either in terms of file, or using a more complicated metric like the number of declarations used) and make sure the "didn't touch the file" batch does not contain more than 1/3 of the batch's PRs.
Because it's the trend, one could even think of training an AI to recognise what PR is causing the build error and single it out...
Yaël Dillies (Feb 14 2024 at 13:36):
Yaël Dillies said:
Why? We know the second batch will fail, so why not split further into two batches of size m/4 ?
This is not always true. It can happen that a bad PR becomes good because a PR improved performance on master or whatever. But this is rare enough that I think it's fair to reject the bad-turned-good PR until the author merges master and checks everything is alright indeed.
Anne Baanen (Feb 14 2024 at 13:37):
Anne Baanen said:
(Not entirely sure why the third table has lower throughput than the second one. It might be that I implemented the logic incorrectly...)
I added logic to drop predicted-to-fail batches of size 1 outright. Now the raw numbers look like:
afbeelding.png
Yaël Dillies (Feb 14 2024 at 13:38):
Those numbers are very encouraging!
Anne Baanen (Feb 14 2024 at 13:42):
One conclusion we might draw is if Bors is smart and drops batches that are predicted to fail, we _don't_ want a batch size of a power of two. Multiples of 3 like 12 or 24 seem to be slightly better!
Anne Baanen (Feb 14 2024 at 13:45):
Should we rename and/or merge the two active bors threads by the way? Maybe this can become "Optimizing bors batch size" and the other "Bors failure investigation".
Yaël Dillies (Feb 14 2024 at 13:46):
Actually, could you move my "Fine-tuning bors batching" thread out of #mathlib reviewers and then move the relevant messages from here to it?
Mario Carneiro (Feb 14 2024 at 13:46):
The formula only changes a little bit accounting for the "doomed to fail" optimization:
cost2[n_] := cost2[n] = 1 + (1 - p^n) failcost2[n]
failcost2[n_] :=
If[n <= 1, 0, With[{m = Floor[n/2]}, cost2[m] + cost2[n - m] - p^m]]
Which has an effect of increasing throughput by about 1/25, here it is compared to the batching optimizations:
image.png
Mario Carneiro (Feb 14 2024 at 13:47):
and of course the batching optimizations still apply on top of this
Yaël Dillies (Feb 14 2024 at 13:47):
Anne Baanen said:
Multiples of 3 like 12 or 24 seem to be slightly better!
Does that mean we should try trisecting? Try a batch of size m. It fails. Try a batch of size m/3. It works. Try a batch of size m/3. It works. Trisect the last m/3 batch.
Anne Baanen (Feb 14 2024 at 13:48):
Yaël Dillies said:
Actually, could you move my "Fine-tuning bors batching" thread out of #mathlib reviewers and then move the relevant messages from here to it?
How about the "Bors batch size" thread in the reviewers stream?
Yaël Dillies (Feb 14 2024 at 13:48):
You could merge it too, I guess. It might become a little confusing.
Notification Bot (Feb 14 2024 at 13:49):
This topic was moved here from #mathlib reviewers > Fine-tuning bors batching by Anne Baanen.
Notification Bot (Feb 14 2024 at 13:50):
This topic was moved here from #mathlib reviewers > Bors batch size by Anne Baanen.
Notification Bot (Feb 14 2024 at 13:50):
52 messages were moved here from #mathlib4 > bors failing by Anne Baanen.
Mario Carneiro (Feb 14 2024 at 14:28):
More plots. The question of where to bisect is the same as asking what is the minimum of the following function:
image.png
Here it is for more values of :
ListPlot[
Table[{m/n,
Evaluate[(cost[m] + cost[n - m] - p^m) - (cost[1] + cost[n - 1] -
p^1) /. p -> 1 - 1/27.]}, {n, {10, 20, 30, 40, 80, 300,
2000}}, {m, 1, n - 1}], Joined -> True, PlotRange -> {-1, 1}]
There are two interesting features of this function compared to the usual. Normally this would be a symmetric function which bows in the middle, hence the minimum is in the middle so that's where you split. However here there is an overall tilt of the function down to the left because of the additional factor, meaning that we should prefer to split more leftish. Moreover, the case gets a little boost because it saves on the cost of a bisection. As a result, for a significant range of values the minimum value is , all the way up to , and then it hangs somewhere around for a while, and then for very large the bottom starts to turn into a fractal so I'm not so sure.
Mario Carneiro (Feb 14 2024 at 14:32):
And here's a plot of the relative advantages to throughput of applying the splitting strategy or recursively, over the usual :
image.png
Mario Carneiro (Feb 14 2024 at 15:43):
Yaël Dillies said:
Currently, if you have PRs, all but one of which are good (and mutually compatible) and one is bad, you will run one batch with PRs, two batches with PRs, ..., two batches with PR, namely batches. If you do my optimisation, you will run one batch with PRs, ..., one batch with PRs, two batches with PR, namely batches. That's a 50% improvement!
This analysis isn't quite correct. It's true if the failing PR is the last one in the list, but if it is the first one then you get no improvement. In general it's somewhere in between
Mario Carneiro (Feb 14 2024 at 15:44):
because the optimization only applies when the first half is good and the second half is bad
Mario Carneiro (Feb 14 2024 at 15:58):
Just pulled some bors data regarding 1-commit batches: there are 316 successes and 45 failures in the database right now, i.e. one in 7
Sébastien Gouëzel (Feb 14 2024 at 16:02):
Isn't this biased by the fact that correct commits are more likely to be merged in bigger batches, while buggy commits will end up alone and will therefore be part of a 1-commit batch?
Eric Rodriguez (Feb 14 2024 at 16:07):
Can we not run the simulations with historical data? Maybe just try grab the past week and see what would happen
Jireh Loreaux (Feb 14 2024 at 16:07):
I'm wondering if we should take more advantage of priority, especially in common cases where we expect things to break when part of a batch. I'm betting that a small amount of thoughtful human interaction can drastically decrease the failure rate.
I'm thinking of, for instance, PRs renaming very common lemmas (even with deprecation, bors will fail because Mathlib doesn't allow using deprecated lemmas), PRs deleting / renaming a file, etc.
Jireh Loreaux (Feb 14 2024 at 16:08):
I think one source of failures from yesterday was due to a file rename.
Ruben Van de Velde (Feb 14 2024 at 16:54):
One of the issues I think I saw multiple times was PRs not being updated since the FunLike refactor. I wonder if we could have a warning or something for PRs that are very outdated
Johan Commelin (Feb 14 2024 at 16:59):
@Mario Carneiro at some point automatically commented on all open PRs pinging them to update...
Jireh Loreaux (Feb 14 2024 at 17:02):
I think that's because he updated all the toolchains for them in order to avoid data loss from v4.x.x-rc3, whichever one it was.
Jireh Loreaux (Feb 14 2024 at 17:02):
Oh but I guess your point is more that such a pinging of all PRs is possible.
Ruben Van de Velde (Feb 14 2024 at 17:30):
Yeah, that could have helped with this case, though my point was more that old branches are more prone to failing in general
Yaël Dillies (Feb 14 2024 at 17:35):
Do we have data about probability of a PR failing plotted according to its age (as measured by the latest master merge)?
Eric Rodriguez (Feb 14 2024 at 18:29):
I really think we just need to find some number that seems sensible and stick with it. Fiddling with very specific numbers is not going to fix all issues. One thing that mergify does which could be helpful is that it can do batches speculatively, i.e. in parallel. I think that's much more likely to give us a boost than trying to dig deeper and deeper into this
Mario Carneiro (Feb 14 2024 at 20:08):
Jireh Loreaux said:
I'm wondering if we should take more advantage of priority, especially in common cases where we expect things to break when part of a batch.
There is the bors single on
command, which says "don't put this PR in a batch, compile it on its own". I recommend using this for PRs that have lots of merge conflicts
Jireh Loreaux (Feb 14 2024 at 21:07):
Can a bot find the most recent common ancestor of a PR branch and master, and then use the age of that commit to add a label?
Last updated: May 02 2025 at 03:31 UTC