Zulip Chat Archive

Stream: general

Topic: Travis


Sebastien Gouezel (Apr 07 2019 at 19:35):

Travis is complaining on my PRs #883 and #897, but I am not sure this is my fault: what is failing is the shell tests "check install scripts" and "check dev scripts". Is there anything I can do about it?

Chris Hughes (Apr 07 2019 at 19:59):

I conjecture it's something to do with pushing from your own fork instead of a branch of leanprover-community. @Simon Hudon 's working on fixing it. I guess #883 can probably safely be merged, and it will build once it's on master.

Simon Hudon (Apr 07 2019 at 20:42):

@Sebastien Gouezel can you push your branches to leanprover-community/mathlib? It makes it easier to test the installation scripts that way because GitHub rejects requests when you make more than a few without authentication.

Floris van Doorn (Sep 10 2019 at 17:49):

Travis is timing out on completely innocent pull requests, like #1423.
Does anyone know which commit caused this regressed behavior?

I saw that other Github repo's automatically got a reply with compilation performance. Maybe we can make a bot that automatically replies to a PR if the compilation time is increased by more than 5% or something?

Floris van Doorn (Sep 10 2019 at 17:55):

I assumed in my previous message that there was a merged commit that caused performance issues, but they all seem fine. Maybe this was a cache-related issue (I merged the PR into master, maybe that fixes it)

Rob Lewis (Sep 10 2019 at 18:11):

Regardless of what caused it here, regression testing for compilation time would be a good idea.

Rob Lewis (Sep 10 2019 at 18:11):

I don't know if Travis is reliable enough to do this.

Rob Lewis (Sep 10 2019 at 18:11):

But maybe we could run something once a day on Scott's server?

Rob Lewis (Sep 10 2019 at 18:12):

I guess that would more reliably throw the same processing power at it.

Reid Barton (Sep 10 2019 at 18:12):

Last three pushes to master built in around 107 (most recent), 100, 110 minutes. I didn't look at the changes, but I assume they were not responsible for this variance.

Reid Barton (Sep 10 2019 at 18:14):

Interestingly this build of PR #1428 timed out, but leanpkg build actually finished, it just took so much longer than usual that there wasn't enough time left to run leanchecker.

Reid Barton (Sep 10 2019 at 18:17):

I'm sure a dedicated machine would be more consistent, I'm not sure how much more consistent.
Doesn't Scott already build everything anyways? Or did I imagine that?

Rob Lewis (Sep 10 2019 at 18:19):

I think the mathlib nightlies get pushed to Scott from Travis. Would he be building for another reason?

Rob Lewis (Sep 10 2019 at 18:20):

I could be wrong about that though.

Scott Morrison (Sep 10 2019 at 21:35):

No, at present I'm still building nothing. That project fell through, for lack of anyone who had the time/expertise combination to script things. I think the idea had been to push from travis to my account on S3, and then for my computer to poll that S3 account, but the pushing-to-S3 step never happened.

Scott Morrison (Sep 10 2019 at 21:36):

I would love to have the infrastructure in place so that every commit received an olean cache. I can provide the CPU time, I think.

Rob Lewis (Sep 10 2019 at 21:38):

I think that's @Simon Hudon 's territory. Checking build times once a day would be a lot simpler. Do you think your server has a consistent enough load to do this with enough precision?

Scott Morrison (Sep 10 2019 at 21:45):

Probably. It has 18 physical cores. I routinely leave some single-threaded calculations running, but it's never more than 6. It's been a while since I've had anything multithreaded that I want left running. So if I limit the build to say 6 or 8 cores, and run it in the middle of my night, it shouldn't have any competition.

Scott Morrison (Sep 10 2019 at 21:46):

The idea is that I would just pull a fresh copy of master, build, and have a script post the time here?

Rob Lewis (Sep 10 2019 at 21:48):

Yeah, something like that. We just want to see an alert if there's an unexpected jump in compile time. Posting here would be enough.

Rob Lewis (Sep 10 2019 at 21:50):

Ideally, if it noticed a spike it would try to find the responsible commit.

Rob Lewis (Sep 10 2019 at 21:50):

That could use a bunch more cycles.

Scott Morrison (Sep 10 2019 at 21:52):

I think our limiting factor is scripting and testing, not available cycles.

Scott Morrison (Sep 10 2019 at 21:54):

I am just capable enough at writing bash scripts to do a bad job of this, so help would definitely be appreciated. It might require jumping through some hoops, but I could probably work out how to give someone ssh access.

Simon Hudon (Sep 11 2019 at 05:36):

Feel free to put that code up on mathlib-tools and I'll see if I can contribute something

Scott Morrison (Sep 11 2019 at 06:05):

@Simon Hudon, I'm helpless at this, it seems. I downloaded zulip's sample python file for posting a message from https://leanprover.zulipchat.com/api/send-message, as scripts/zulip.py. I ran pip install zulip, which seemed to work. However running ./zulip.py just says:

idaeus:scripts scott$ python zulip.py
Traceback (most recent call last):
  File "zulip.py", line 1, in <module>
    import zulip
  File "/Users/scott/projects/lean/mathlib-tools/scripts/zulip.py", line 4, in <module>
    client = zulip.Client(config_file="~/.zuliprc")
AttributeError: 'module' object has no attribute 'Client'

which suggests to me that python is not correctly finding the zulip library.

Scott Morrison (Sep 11 2019 at 06:05):

(I've committed the sample script to mathlib-tools, so you can see what I'm trying to do.)

Scott Morrison (Sep 11 2019 at 06:05):

I tried some variations, e.g. pip3 install zulip, and it reports success too. :-)

Scott Morrison (Sep 11 2019 at 06:12):

(If my computer can't even work out how to run a little shell script, how on earth is it going to get a gold medal at the IMO...?!)

Johan Commelin (Sep 11 2019 at 06:15):

I'm sure there are a lot of IMO gold medalists that down know what a bash script is...

Scott Morrison (Sep 11 2019 at 07:53):

Cancel this request for help, I found zulip-send, which just works from the command line.

Reid Barton (Sep 11 2019 at 12:36):

I can also put together some shell scripts, especially if you can figure out a way to grant me access to your machine

Bryan Gin-ge Chen (Sep 11 2019 at 16:56):

I was thinking it might be nice to run lean --profile with these regular builds of mathlib to get more detailed diagnostics. However, I tried running lean --make --profile src/ > profile.log in my mathlib directory and it keeps segfaulting after a while. Has anyone tried anything like this?

Reid Barton (Sep 11 2019 at 16:57):

I also meant to ask whether there are statistics (lean --profile or anything else like it) which would be good to record and don't affect compile time too much

Bryan Gin-ge Chen (Sep 12 2019 at 01:06):

I managed to profile mathlib with lean -j1 --make --profile src/ > profile.log. It took maybe 2 hours on my machine, so it's probably not something we could run more than once a day:

cumulative profiling times:
    compilation 12.8s
    decl post-processing 2.56e+03s
    elaboration 3.87e+03s
    elaboration: tactic compilation 70.9s
    elaboration: tactic execution 2.67e+03s
    parsing 122s
    type checking 17.2s

I've attached the (compressed) profile.log if anyone else wants to have a go at extracting useful statistics from it: profile_140a606.log.zip (3.1MB)

Reid Barton (Sep 12 2019 at 01:11):

Couldn't we run it 12*N times per day, if we have N processors?

Reid Barton (Sep 12 2019 at 01:11):

(Assuming we believe in isolation between processors which is definitely false, but hey)

Bryan Gin-ge Chen (Sep 12 2019 at 01:12):

It also used a fairly large amount of memory too (~2 GB by the end).

Scott Morrison (Sep 12 2019 at 03:46):

I have 128gb on this machine, so memory shouldn't be a problem.

Scott Morrison (Sep 12 2019 at 03:46):

Unfortunately my computer has rebooted itself several times in the last few days, and I can't work out why.

Scott Morrison (Sep 12 2019 at 03:47):

A student has been running calculations in GAP that use lots of RAM, but simultaneously I started experimenting with building Lean regularly via cron, so I don't really know which is at fault.

Scott Morrison (Sep 12 2019 at 03:49):

I've also been having trouble getting my build script to run multithreaded.... Usually Lean uses lots of cores, but somehow when it runs in the cron job, it uses 2 cores for a few minutes, then drops to 1, and I'm completely bamboozled as to why.

Scott Morrison (Sep 12 2019 at 03:52):

@Reid Barton, in terms of giving you (or @Simon Hudon) access to this computer, I think the options are:

1. We fill out the "campus visitor" form, which would give you an ANU id number. I then get our IT people to add that id number to our VPN configuration. To ssh in, you'd then need to connect to the relevant VPN, then ssh.
2. I make a persistent outbound ssh tunnel to some other server (e.g. the one hosting my website), that forwards traffic back to port 22 here. Dodgy as anything, but that's what they invented ssh tunnelling for, I guess.
It's a bit of a hassle either way.

Simon Hudon (Sep 12 2019 at 03:53):

1. looks simpler to me. It also looks less likely you'll get reprimanded for it.

Scott Morrison (Sep 12 2019 at 04:00):

Okay, I have just initiated our process to give both you and Reid university id's here. You'll receive an email with a link to a form, but feel free to fill out the barest minimum of information it will let you submit with. :-)

Simon Hudon (Sep 12 2019 at 04:01):

Cool :)

Simon Hudon (Sep 12 2019 at 04:01):

Sorry I'm not too verbose today. I'm exhausted

Reid Barton (Sep 12 2019 at 12:46):

Is "suburb" what you call a city?

Scott Morrison (Sep 12 2019 at 12:54):

I guess. :-)

Scott Morrison (Sep 12 2019 at 12:55):

By Australian standards, clearly Berkeley is part of San Francisco, and thus we have very few "cities", so apparently "suburb" is the next division down.

Wojciech Nawrocki (Sep 12 2019 at 20:34):

Couldn't we run it 12*N times per day, if we have N processors?
(Assuming we believe in isolation between processors which is definitely false, but hey)

I tried this once, and the conclusion was that sadly if you want to get reliable benchmark results, you need to run your program with pretty much exclusive access to the machine (and on real hardware as opposed to a cloud vCPU). Running one benchmark per physical CPU core skews the timings to the point of uselessness, probably due to contention on memory, on I/O, and whatever else is shared between CPUs. Anyway in case of mathlib, because the build times are so long, maybe random noise is not that much of an issue - unless it's possible for a single run of lean --make to somehow get in a failure mode which makes the entire build proceed slower than usual, but that seems unlikely.

Scott Morrison (Sep 12 2019 at 22:06):

Yeah, I'm pretty disappointed by the results from overnight.

113m51.476s
113m18.228s
114m49.561s
110m48.239s
105m30.344s
101m58.073s

I would guess that's too much random variability to be useful.

Reid Barton (Sep 12 2019 at 22:10):

What conditions is that under?

Reid Barton (Sep 12 2019 at 22:11):

Travis or your own machine?

Reid Barton (Sep 12 2019 at 22:11):

1 core?

Scott Morrison (Sep 12 2019 at 22:17):

This is on my own machine, running with -j1.

Scott Morrison (Sep 12 2019 at 22:21):

I realise now that I left it running every hour, so builds were overlapping. I've slowed down cron and will see if that improves things.

Bryan Gin-ge Chen (Sep 12 2019 at 22:25):

If no one else is already working on it I'll take a stab at writing a Python script to parse the profile output into JSON (or some other format?) later tonight.

Scott Morrison (Sep 15 2019 at 00:07):

Now that the builds are not overlapping, the timing is quite uniform ---

98m31.200s
98m52.938s
99m3.707s
99m7.776s
98m57.582s
98m42.635s
98m45.471s

Scott Morrison (Sep 15 2019 at 00:10):

However this has been running over the weekend, without anything else running on this machine. We'll have to see if it stays uniform tomorrow once I'm in the office.

Scott Morrison (Sep 15 2019 at 00:12):

@Bryan Gin-ge Chen if you feel like parsing the profile output into something useful that might be fun. :-)

Scott Morrison (Sep 15 2019 at 00:12):

My script keeps uploading the logs.

Kenny Lau (Sep 15 2019 at 00:30):

top 100 items:

62.4 decl post-processing of lattice.semilattice_sup_top took 62.4s
61.2 decl post-processing of subtype.comm_group took 61.2s
59.3 decl post-processing of is_subgroup.of_div took 59.3s
46.6 decl post-processing of to_monoid_hom took 46.6s
34.8 decl post-processing of image_subgroup took 34.8s
27.8 decl post-processing of list_prod_apply took 27.8s
25.7 decl post-processing of subset_closure took 25.7s
25.4 decl post-processing of closure_subset took 25.4s
24.6 elaboration of monoidal_of_has_finite_coproducts took 24.6s
23.3 decl post-processing of group took 23.3s
23.2 decl post-processing of subtype.group took 23.2s
22.4 decl post-processing of open_subgroup.has_coe took 22.4s
22.1 decl post-processing of continuous_multiset_prod took 22.1s
21.9 decl post-processing of exists_list_of_mem_closure took 21.9s
21.8 elaboration of monoidal_of_has_finite_products took 21.8s
18.9 decl post-processing of image_pointwise_mul took 18.9s
18.8 decl post-processing of image_closure took 18.8s
18.6 decl post-processing of pointwise_mul_monoid took 18.6s
18.1 decl post-processing of inv_ker_one took 18.1s
16.8 decl post-processing of subtype.monoid took 16.8s
16.7 elaboration of functoriality_is_right_adjoint took 16.7s
16.6 decl post-processing of ker_lift_mk' took 16.6s
16.3 elaboration of functoriality_is_left_adjoint took 16.3s
16.1 decl post-processing of ker_lift took 16.1s
15.9 decl post-processing of map took 15.9s
15.5 decl post-processing of is_subgroup.coe_inv took 15.5s
15.5 decl post-processing of preimage_pointwise_mul_preimage_subset took 15.5s
15.4 decl post-processing of closure_subgroup took 15.4s
15.2 elaboration of prod_filter_range_p_mul_q_div_two_eq_prod_product took 15.2s
15.1 decl post-processing of mem_pointwise_one took 15.1s
14.8 elaboration of quadratic_reciprocity took 14.8s
14.6 decl post-processing of ext took 14.6s
14.4 decl post-processing of multiset_prod_apply took 14.4s
14.3 elaboration of simple_func_sequence_tendsto took 14.3s
14.2 decl post-processing of normal_in_normalizer took 14.2s
14.1 elaboration of has_sum_sigma took 14.1s
13.9 decl post-processing of continuous_on.mul took 13.9s
13.9 decl post-processing of prod took 13.9s
13.7 decl post-processing of tendsto_multiset_prod took 13.7s
13.7 elaboration of exists_preimage_norm_le took 13.7s
13.7 decl post-processing of exists_list_of_mem_closure took 13.7s
13.6 decl post-processing of lift_mk took 13.6s
13.6 decl post-processing of closure.is_submonoid took 13.6s
13.6 decl post-processing of pointwise_mul_eq_Union_mul_left took 13.6s
13.6 decl post-processing of refl took 13.6s
13.6 decl post-processing of inj_of_trivial_ker took 13.6s
13.4 elaboration of CommRing_yoneda took 13.4s
13.4 decl post-processing of normal_iff_eq_cosets took 13.4s
13.3 decl post-processing of subtype_val.is_group_hom took 13.3s
13.3 decl post-processing of finset_prod_apply took 13.3s
13.2 decl post-processing of closure_singleton took 13.2s
12.9 elaboration of cocones_iso took 12.9s
12.8 decl post-processing of pointwise_mul_semigroup took 12.8s
12.7 decl post-processing of closure took 12.7s
12.2 decl post-processing of coe.is_group_hom took 12.2s
11.9 decl post-processing of coe_inf took 11.9s
11.8 decl post-processing of mem_closure took 11.8s
11.8 decl post-processing of le_iff took 11.8s
11.7 decl post-processing of pointwise_mul_eq_Union_mul_right took 11.7s
11.5 decl post-processing of mem_closure_union_iff took 11.5s
11.4 elaboration: tactic execution took 11.4s
11.3 decl post-processing of empty_pointwise_mul took 11.3s
11.3 decl post-processing of symm took 11.3s
11.3 elaboration: tactic execution took 11.3s
11.2 decl post-processing of pointwise_mul_empty took 11.2s
11.2 decl post-processing of inhabited took 11.2s
11.1 decl post-processing of eq_trivial_iff took 11.1s
11.1 decl post-processing of continuous_subgroup took 11.1s
11.1 elaboration of map_integral took 11.1s
11.1 decl post-processing of mul_left took 11.1s
11.1 decl post-processing of pointwise_mul took 11.1s
11.1 decl post-processing of ext' took 11.1s
10.9 decl post-processing of mem_norm_comm took 10.9s
10.8 decl post-processing of comm_monoid took 10.8s
10.8 decl post-processing of map_one took 10.8s
10.5 elaboration of cones_iso took 10.5s
10.2 elaboration of exists_subgroup_card_pow_prime took 10.2s
10.2 elaboration of lintegral_eq_nnreal took 10.2s
10.1 decl post-processing of inv_ker_one' took 10.1s
10.1 decl post-processing of singleton.is_mul_hom took 10.1s
10.1 decl post-processing of closure_mono took 10.1s
9.91 decl post-processing of injective_ker_lift took 9.91s
9.86 decl post-processing of continuous_monoid took 9.86s
9.74 decl post-processing of prod_map_range_index took 9.74s
9.71 decl post-processing of tendsto_list_prod took 9.71s
9.58 decl post-processing of one_ker_inv' took 9.58s
9.56 elaboration of unique_diff_within_at.eq took 9.56s
9.5 decl post-processing of open_subgroup took 9.5s
9.49 decl post-processing of inv_iff_ker' took 9.49s
9.47 decl post-processing of inj_iff_trivial_ker took 9.47s
9.44 decl post-processing of prod_zero_index took 9.44s
9.38 decl post-processing of image_closure took 9.38s
9.38 elaboration: tactic execution took 9.38s
9.37 decl post-processing of subtype_mk.is_monoid_hom took 9.37s
9.2 elaboration: tactic execution took 9.2s
9.15 decl post-processing of lift took 9.15s
9.07 elaboration of glue_dist_triangle took 9.07s
9.06 decl post-processing of continuous_group took 9.06s
9.06 elaboration: tactic execution took 9.06s
9.03 decl post-processing of closure_subset_iff took 9.03s

Kenny Lau (Sep 15 2019 at 00:31):

but there is a bit fishy going on; I suspect the times there do not actually reflect the time

Kenny Lau (Sep 15 2019 at 00:31):

On Lines 267140-267142 we have:

elaboration of subtype.comm_group took 670ms
type checking of subtype.comm_group took 0.786ms
compilation of subtype.comm_group took 2.67ms

and then suddenly on Line 271385:

decl post-processing of subtype.comm_group took 61.7s

Scott Morrison (Sep 15 2019 at 02:01):

How long did everything in the lines in between take? Is the decl post-processing for some reason being deferred, and the report time is wall time, including all the work on other things that happened in between?

Wojciech Nawrocki (Sep 15 2019 at 12:27):

@Kenny Lau I have looked into this a few weeks ago, and I'm almost certain that "type checking .. took" is incorrect - see #58. I'm not sure about the other timings (elaboration, compilation) though.

Bryan Gin-ge Chen (Sep 15 2019 at 23:16):

Here's a python script that does some minimal parsing of these log files into JSON format: https://gist.github.com/bryangingechen/005e51814e698f889a76b59da0e58852

The JSON files consist of an array of key-value pairs, which mostly correspond to individual lines of the log files (except for elaboration / tactics, which I tried to group together). As a simple example, the following lines:

parsing took 0.117ms
elaboration of integral took 63.5ms
type checking of integral took 0.289ms
compilation of measure_theory.integral took 0.00211ms
decl post-processing of integral took 0.932ms

become:

[{
"type": "pars",
"t": 0.117
},
{
"type": "elab",
"lines": [],
"t": 63.5,
"decl": "integral"
},
{
"type": "tc",
"t": 0.289,
"decl": "integral"
},
{
"type": "comp",
"t": 0.00211,
"decl": "measure_theory.integral"
},
{
"type": "decl",
"t": 0.932,
"decl": "integral"
}]

I didn't try too hard to group things together further – if there are other things I should do please let me know.

The JSON files turn out to be ~35MB compared to the ~25MB of the logs. Here's one corresponding to this log in the Travis stream. profile_81a31ca.json.zip (3.1MB)

Rob Lewis (Oct 24 2019 at 15:57):

Based on https://travis-ci.org/leanprover-community/mathlib/jobs/602292786?utm_medium=notification&utm_source=github_status, I'm guessing the Travis config in #1606 isn't right. I wanted to reuse the olean files from the build to run a final test. That test takes ~45 seconds on my laptop and has been going for 20 min on Travis, so I'm guessing it's recompiling everything. I tried to copy the setup for archive and test, which do use the oleans -- any ideas?

Bryan Gin-ge Chen (Oct 24 2019 at 16:09):

My guess is that this is because the master branch was updated after the travis build started. My understanding is that each build stage does a merge with master when it starts, so this can lead to later stages having to do rebuilds.

Rob Lewis (Oct 24 2019 at 16:11):

What? Every build stage does a merge with master? That can't be right. It's testing the PR, not the PR's compatibility with master.

Bryan Gin-ge Chen (Oct 24 2019 at 16:12):

Yes, this surprised me at first too. But see https://travis-ci.org/leanprover-community/mathlib/jobs/602292786#L164 and https://docs.travis-ci.com/user/pull-requests/#my-pull-request-isnt-being-built

Rob Lewis (Oct 24 2019 at 16:12):

If that's what it's doing, I'm very confused about our CI, heh.

Rob Lewis (Oct 24 2019 at 16:22):

But the archive task ran the same command, and finished in a minute. I don't think https://github.com/leanprover-community/mathlib/commit/3f8a492bfa4ba2ad0ff26c3bff298e3afb05f861 landed in between the two.

Bryan Gin-ge Chen (Oct 24 2019 at 16:35):

You're right that the commit happened earlier, so I'm not sure why the docs and archive stages didn't require a rebuild.

Rob Lewis (Oct 24 2019 at 16:51):

@Simon Hudon Do you know what's up here?

Rob Lewis (Oct 24 2019 at 16:53):

I do see wildly different build times for the archive in the PR history, which is consistent with oleans from the first step being invalidated by a merge.

Rob Lewis (Oct 24 2019 at 16:54):

If this is what's happening, it sounds both inefficient and unreliable.

Reid Barton (Oct 24 2019 at 16:54):

My understanding is that each build stage does a merge with master when it starts, so this can lead to later stages having to do rebuilds.

Where's my upside-down light bulb emoji

Rob Lewis (Oct 24 2019 at 16:56):

Ah. Perhaps one relevant line is this? https://github.com/leanprover-community/mathlib/blob/master/.travis.yml#L14
Edit - never mind, everything in the last step is happening in src.

Simon Hudon (Oct 24 2019 at 16:59):

Sometimes, it's enough to just restart the travis build

Simon Hudon (Oct 24 2019 at 16:59):

we might also want to try the build of test and archive with touching all the olean files in src

Simon Hudon (Oct 24 2019 at 17:00):

Because each stage is a whole new build, you restore the cache and clone the git repo and there the time stamps can be unreliable.

Rob Lewis (Oct 24 2019 at 17:04):

Okay, I restarted the build. #1565 should hit master while this is going, so let's see if the same issue comes up.

Rob Lewis (Oct 24 2019 at 17:05):

Should we just add the command to touch all oleans in src? It can't hurt.

Rob Lewis (Oct 24 2019 at 17:05):

It also wouldn't hurt to remove --make when we build the archive and tests.

Rob Lewis (Oct 24 2019 at 17:08):

If each stage does a fresh merge with master, how do the first stages even work? What if a commit lands during pre-build 2? Doesn't that mean all the work has to be redone in the test stage?

Bryan Gin-ge Chen (Oct 24 2019 at 17:24):

Doesn't that mean all the work has to be redone in the test stage?

I think so, doesn't it explain the spread in run times for the test stages?

Rob Lewis (Oct 24 2019 at 17:28):

I thought the reason we had the pre-build stages was because we were hitting time limits when we built the whole thing in one stage. If we were doing that in the test stage, shouldn't we hit the same limits?

Bryan Gin-ge Chen (Oct 24 2019 at 17:35):

I think we do sometimes, but people often restart the jobs when they fail.

Rob Lewis (Oct 24 2019 at 17:50):

It feels like that should be way more common. I guess commits near the leafs of the import hierarchy would have less of an effect, maybe that limits it.

Rob Lewis (Oct 24 2019 at 17:51):

My mind is blown, and not in a good way, heh.

Rob Lewis (Oct 24 2019 at 17:52):

We could move the linting test to the test stage. That would create less potential duplication of work.

Rob Lewis (Oct 24 2019 at 17:54):

But then maybe we should move archive and docs too? Neither is slow, unless the oleans are outdated.

Simon Hudon (Oct 24 2019 at 18:03):

What do you mean by move docs and archive?

Rob Lewis (Oct 24 2019 at 18:07):

The docs and archive stages of the Travis build. If we build those directories as part of the test stage, instead of as separate stages, there's no chance that updates to master will invalidate the oleans.

Simon Hudon (Oct 24 2019 at 18:07):

Updates to master has no effect on the build. That's not what the issues are

Simon Hudon (Oct 24 2019 at 18:08):

The state of the cache and the timestamps on the files is where you should look for the issues.

Reid Barton (Oct 24 2019 at 18:10):

This does sound more plausible, though I'm having trouble constructing a mechanism under which recompilation sometimes happens, but usually not.
Could it have to do with the cache being shared between concurrent builds?

Bryan Gin-ge Chen (Oct 24 2019 at 18:11):

Updates to master has no effect on the build.

Are you sure about this? I thought this line would mean that a merge happens at the start of each stage, but maybe my git / github understanding is wrong.

Simon Hudon (Oct 24 2019 at 18:12):

Which line do you mean?

Reid Barton (Oct 24 2019 at 18:13):

$ git fetch origin +refs/pull/1606/merge:

Reid Barton (Oct 24 2019 at 18:13):

line 164, it should be highlighted

Bryan Gin-ge Chen (Oct 24 2019 at 18:14):

In particular I thought the suffix /merge retrieves a merge commit from the PR branch to master.

Bryan Gin-ge Chen (Oct 24 2019 at 18:19):

I can't find any official documentation about the /merge ref, but I did find this forum post in which someone quotes Github support as saying:

The /merge refs that are being used here are an undocumented feature and you shouldn’t be relying on them. That feature was build to power Pull Requests in the Web UI, it was not built to be used by 3rd party services. Because it’s undocumented – you should not have any expectations about behavior, the behavior might change at any time and those refs might completely go away without warning.

Rob Lewis (Oct 24 2019 at 18:20):

After some testing on my own fork... maybe the +refs/pull/*/merge branches are fixed to merge with whatever master was at the time of the last commit on the PR branch?

Reid Barton (Oct 24 2019 at 18:20):

It makes a merge with something, but it doesn't seem to be current master, yeah.

Reid Barton (Oct 24 2019 at 18:22):

I tried with #1540 and it gave me a merge with the commit containing #1584. I don't see how these are related yet.

Bryan Gin-ge Chen (Oct 24 2019 at 18:22):

If I'm reading the forum post correctly, it seems that the /merge ref gets updated only when needed by the web UI? Disturbing.

Instead, the mergeability is triggered when it’s needed: when a user visits the page for the pull request via the UI or requests the pull request via the API. In order to determine the mergeability of a pull request, GitHub actually performs a merge between the head and the base ref, and the result of that merge is stored in the /merge ref (the ref you’re using).

Reid Barton (Oct 24 2019 at 18:23):

git tells me the latter commit is also sgouezel/master but I'm assuming that's not relevant :)

Reid Barton (Oct 24 2019 at 18:23):

Oh hey, it gave me a merge with current master this time.

Bryan Gin-ge Chen (Oct 24 2019 at 18:24):

Possibly because you or I clicked on the PR page for #1540 and that triggered an update?

Reid Barton (Oct 24 2019 at 18:24):

Presumably

Reid Barton (Oct 24 2019 at 18:25):

And it seems plausible that the last time anyone looked at that page was 2 days ago

Reid Barton (Oct 24 2019 at 18:25):

So, that's pretty terrible

Rob Lewis (Oct 24 2019 at 18:25):

So basically, Travis is choosing at random what to build every time.

Reid Barton (Oct 24 2019 at 18:25):

Is this /merge stuff built in to Travis or something we configured?

Rob Lewis (Oct 24 2019 at 18:25):

Including on the final builds for Mergify.

Reid Barton (Oct 24 2019 at 18:26):

mergify doesn't build the same commit it will push? I thought that was part of the point

Bryan Gin-ge Chen (Oct 24 2019 at 18:27):

I think it's built into travis. Maybe we will have to look into other CI services?

Rob Lewis (Oct 24 2019 at 18:27):

It's not in the .travis.yml file. Maybe it's configured somewhere I don't have access to?

Reid Barton (Oct 24 2019 at 18:28):

^ @Mario Carneiro (?)

Rob Lewis (Oct 24 2019 at 18:29):

Hmm, actually, I'm not sure about Mergify. It merges the PR into master and lets Travis build that. Travis is... maybe... checking the merge of that with something else? Does Mergify's merge commit force the /merge branch to update?

Rob Lewis (Oct 24 2019 at 18:30):

I'd hope that adding a commit to the PR would update /merge to merge with the current master.

Reid Barton (Oct 24 2019 at 18:30):

Hopefully the commit mergify builds will be a descendant of current master anyways, right? Otherwise what would happen when it tries to push?

Reid Barton (Oct 24 2019 at 18:31):

So the merge from /merge ought to be a no-op

Rob Lewis (Oct 24 2019 at 18:31):

Mergify doesn't build the squashed commit.

Simon Hudon (Oct 24 2019 at 18:31):

We could test it. Start the build with git rev-parse HEAD

Reid Barton (Oct 24 2019 at 18:32):

Mergify doesn't build the squashed commit.

I'm going to go cry in the corner

Rob Lewis (Oct 24 2019 at 18:34):

The file state in the commit that Mergify pushes is exactly the same as the PR branch that Mergify checks. But the git history is different.

Rob Lewis (Oct 24 2019 at 18:34):

And now I don't trust that Travis is actually checking what's in the PR branch.

Rob Lewis (Oct 24 2019 at 18:35):

Travis runs again once a commit is added to master. THAT one should be checked correctly.

Johan Commelin (Oct 24 2019 at 18:43):

Sounds like it's time to look at alternatives

Johan Commelin (Oct 24 2019 at 19:00):

I can potentially look into setting up a Jenkins server. As far as I can see, Jenkins is capable of distributing builds to different servers, so we could share the load if others also have some resources

Johan Commelin (Oct 24 2019 at 19:43):

I guess the problem with moving away from Travis is that this also creates Mergify headaches

Rob Lewis (Oct 24 2019 at 19:47):

So it reached the lint stage a little while ago and started recompiling again. There's no way anything could have changed between the archive and lint stages. The archive took a minute, no commits to master, etc.

Rob Lewis (Oct 24 2019 at 19:48):

I added a line to touch the olean files before the lint stage. Not sure if it will help, but maybe.

Rob Lewis (Oct 24 2019 at 19:50):

If our hypothesis is true and it is getting unpredictable info from the /merge branch: my impression is this is bizarre, and inefficient, but not fatal.

Rob Lewis (Oct 24 2019 at 19:51):

There's no real connection between Mergify and Travis, any CI with proper GitHub support should work with Mergify.

Johan Commelin (Oct 24 2019 at 19:52):

Yeah... but I don't know how hard it is to get Github support working for self-hosted CI's

Rob Lewis (Oct 24 2019 at 19:55):

The costs of the Travis craziness are slower and occasionally broken PR builds, and maybe it's possible a broken PR could make it to master. The costs of switching to a CI with bad GitHub support are higher, I think.

Simon Hudon (Oct 24 2019 at 19:57):

Git lab has a nice CI infrastructure and it supports github. If Travis is too much of a pain, it can be worth investigating. And the maximum timeout is 3 hours which goes most of the way to building mathlib completely

Tim Daly (Oct 24 2019 at 19:59):

@Johan Commelin Did you mean running a github server on a local machine? It is a pretty straightforward effort.

Johan Commelin (Oct 24 2019 at 20:04):

A Jenkins server

Johan Commelin (Oct 24 2019 at 20:04):

And it should integrate with the github repo

Johan Commelin (Oct 24 2019 at 20:04):

You can't run github locally can you? I thought it was closed source. Or did they open source it?

Rob Lewis (Oct 24 2019 at 20:06):

I think you can buy a license. Even if it were free, it would definitely not be worth it for us.

Bryan Gin-ge Chen (Oct 24 2019 at 20:08):

There's also "github actions", which should definitely integrate well with github, although it's still in "limited public" beta. https://help.github.com/en/github/automating-your-workflow-with-github-actions/about-github-actions

Bryan Gin-ge Chen (Oct 24 2019 at 20:10):

Each build is allowed to upload some "artifacts", so we might be able to set up olean caches for each PR?

Rob Lewis (Oct 24 2019 at 20:47):

Argh. I'll let it go a bit longer to be sure, but it looks like #1606 is still rebuilding everything. https://travis-ci.org/leanprover-community/mathlib/builds/602507227?utm_source=github_status&utm_medium=notification

Rob Lewis (Oct 24 2019 at 20:47):

I'd expect max 5 min if all the oleans are there already.

Rob Lewis (Oct 24 2019 at 20:48):

How is the linting step relevantly different from the archive step?

Rob Lewis (Oct 24 2019 at 20:48):

https://github.com/leanprover-community/mathlib/blob/3f1125af2a425123fc2535e59714c6a991b1cf7f/.travis.yml

Tim Daly (Oct 24 2019 at 21:31):

You can run a git server. Github is just a collection of servers with a user api.

Tim Daly (Oct 24 2019 at 21:33):

I ran a private git server on my local net when I was doing security research

Simon Hudon (Oct 24 2019 at 21:36):

I would advise against jumping to that solution. Once we setup our own servers, we have to keep them running, fix the configuration, etc. There are other CI services that we can investigate before we decide to run our own

Rob Lewis (Oct 24 2019 at 21:37):

There is precisely 0% chance we are going to self-host the mathlib git repo.

Tim Daly (Oct 24 2019 at 21:38):

Github does what lean needs and seems to do it best. I'm not suggesting that you use local servers. I do, but for public work like Axiom, I use github.

Tim Daly (Oct 24 2019 at 21:40):

For my security work I needed jenkins and git but hosted on a secure subnet. I was just responding to the question of whether it was possible to run git-like services locally.

Tim Daly (Oct 24 2019 at 21:42):

Nobody wants the extra burden of admin work on a server. I do it, but it is annoying and I don't recommend it unless you have the skills, patience, and nothing else worth doing. :-)

Simon Hudon (Oct 24 2019 at 21:46):

Thanks for clarifying. I'm especially worried about the last one. I think there are plenty of skilled and dedicated people around but there's plenty of other stuff they can look into

Mario Carneiro (Oct 25 2019 at 00:08):

AFAICT I don't see any magic options in travis that adjust how it builds PRs. This page mentions the */merge ref

Mario Carneiro (Oct 25 2019 at 00:10):

the only thing I can do is turn different kinds of builds on or off (they are all on), or adjust the .travis.yml

Rob Lewis (Oct 25 2019 at 09:12):

Oh my god, this is so frustrating. I got the linting test to work (i.e., Travis failed due to linting errors) by moving it to the test stage. Then I changed the linting test to a weaker one that should pass. Now it's running the linting test in the pre-build. It's reusing cached oleans from the previous run, which is fine and expected. But the auto-generated lint_mathlib.lean shouldn't show up until the test stage. Is it somehow retrieving the file state from the previous run?

Rob Lewis (Oct 25 2019 at 09:12):

lint_mathlib.lean isn't part of the git repo. It failed to build in the last run, so there shouldn't be an olean, and even if there were, purge_olean.sh should delete it.

Rob Lewis (Oct 25 2019 at 09:13):

Is there a way I can see the contents of the directory Travis is working in?

Rob Lewis (Oct 25 2019 at 09:14):

I still have no idea why the separate linting stage was refusing to use the cached oleans.

Rob Lewis (Oct 25 2019 at 09:21):

Hmm. I guess this is the meaning of this line: https://github.com/leanprover-community/mathlib/blob/master/.travis.yml#L17

Rob Lewis (Oct 25 2019 at 09:23):

But shouldn't https://github.com/leanprover-community/mathlib/blob/master/.travis.yml#L38 delete lint_mathlib.lean?

Simon Hudon (Oct 25 2019 at 12:42):

Which part runs the linting step?

Simon Hudon (Oct 25 2019 at 12:56):

What I would suggest is that you delete lint_mathlib.lean after you're done using it. This way, it won't make its way into the build cache

Rob Lewis (Oct 25 2019 at 12:56):

Right now, https://github.com/leanprover-community/mathlib/blob/d5b4664462e70132be1e6f0ef6cfbf2669b2ef77/.travis.yml#L89

Rob Lewis (Oct 25 2019 at 12:57):

I'm solving issues locally. It turns out the file name data/rat/meta.lean causes issues.

Rob Lewis (Oct 25 2019 at 12:57):

The autogenerated data/rat/all.lean imports .meta which parses wrong.

Simon Hudon (Oct 25 2019 at 12:57):

ah! What if you quote it?

Rob Lewis (Oct 25 2019 at 12:57):

I'm going to push a commit renaming that in a sec, which should make the current setup work.

Rob Lewis (Oct 25 2019 at 12:58):

But it still doesn't explain why Travis wanted to build mathlib from scratch when I had the linting step in its own stage.

Simon Hudon (Oct 25 2019 at 12:59):

Did using touch help?

Rob Lewis (Oct 25 2019 at 12:59):

Nope.

Rob Lewis (Oct 25 2019 at 13:02):

The whole thing is muddled by the fact that Travis actually builds merges with "master." data.rat.meta didn't even exist on the PR branch so I couldn't figure out locally what was happening.

Rob Lewis (Oct 25 2019 at 13:04):

Is it okay if I merge #1612 directly? I built locally with these changes.

Simon Hudon (Oct 25 2019 at 13:04):

That seems so non-sensical to me

Rob Lewis (Oct 25 2019 at 13:04):

It's absolutely insane.

Simon Hudon (Oct 25 2019 at 13:04):

go for it

Simon Hudon (Oct 25 2019 at 13:05):

I'm going to look into setting up a gitlab pipeline. Maybe that can make things better

Rob Lewis (Oct 25 2019 at 13:05):

Oh, actually, never mind. I don't need to merge that yet, we can let the check run just to be sure.

Rob Lewis (Oct 25 2019 at 13:08):

https://travis-ci.org/leanprover-community/mathlib/builds/602804142?utm_source=github_status&utm_medium=notification

Rob Lewis (Oct 25 2019 at 13:08):

This has to pass.

Bryan Gin-ge Chen (Oct 25 2019 at 15:37):

Unfortunately, it will fail soon because #1612 missed a data.rat.meta in test/rat.lean. I've fixed it in that PR, but not in the lint one yet.

Bryan Gin-ge Chen (Oct 25 2019 at 15:39):

@Rob Lewis Do you want me to rebase travis_lint on rename_rat_meta?

Rob Lewis (Oct 25 2019 at 15:40):

Gah.

Rob Lewis (Oct 25 2019 at 15:40):

I just did, but thanks.

Rob Lewis (Oct 25 2019 at 16:00):

It passed. Proof of concept, we can run a lint test in Travis.

Rob Lewis (Oct 25 2019 at 16:00):

Of course, the script in that PR is a mess now from all my attempts to make it work. And I'm done with this for the weekend.

Simon Hudon (Oct 25 2019 at 19:07):

I just finished an experiment with gitlab. Have a look: https://gitlab.com/simon.hudon/mathlib/commit/5b32a70e3f0006f3f6cc93f836aac533ce89c6ed/pipelines?ref=master

Simon Hudon (Nov 08 2019 at 20:10):

@Rob Lewis I'm trying Azure for lean-depot. So far, it seems so much faster than Travis or GitLab. I think we might want to consider adopting it for mathlib

Rob Lewis (Nov 09 2019 at 09:41):

Sounds good to me! The free or paid version?

Rob Lewis (Nov 13 2019 at 21:03):

@Simon Hudon GitHub just popped a bunch of balloons in my face about this: https://github.com/features/actions

Rob Lewis (Nov 13 2019 at 21:04):

I usually don't click annoying ads like that out of principle, but it's free for open projects and (presumably) well integrated with GitHub. At a glance, it looks like it could do what we want.

Rob Lewis (Nov 13 2019 at 21:04):

Maybe worth investigating too.

Simon Hudon (Nov 13 2019 at 21:07):

That looks like an exciting tool. Are you thinking of a specific use?

Rob Lewis (Nov 13 2019 at 21:10):

Uh, as far as I can tell can do the same thing as Travis or Azure.

Rob Lewis (Nov 13 2019 at 21:11):

I couldn't find any info about how much time or speed they give you.

Bryan Gin-ge Chen (Nov 13 2019 at 21:12):

Here's a page with some more details on limits: https://help.github.com/en/github/automating-your-workflow-with-github-actions/about-github-actions

Sebastian Ullrich (Nov 14 2019 at 09:40):

Github Actions in fact run on Azure and they are quite similar to Azure Pipelines in most other aspects, but not completely. It is quite confusing.

Simon Hudon (Nov 14 2019 at 15:22):

Are there things that Github Actions can do but Azure can't?

Sebastian Ullrich (Nov 14 2019 at 15:25):

I don't know, I have only looked at Pipelines in detail


Last updated: Dec 20 2023 at 11:08 UTC