Zulip Chat Archive

Stream: general

Topic: mathlib growth


Johan Commelin (Dec 14 2021 at 12:52):

Over at https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/lines.20per.20commit/near/264120072 I asked about the trend for lines per commit to mathlib master. I received some helpful replies.

I'm trying to understand mathlib growth better. The graphs at https://leanprover-community.github.io/mathlib_stats.html suggest a pretty stable linear growth rate, with a singular increase in derivate around the start of the covid pandemic (correlation?).

The number of contributors is growing, but the number of lines added per month isn't. One explanation is that the number of contributors per month is quite stable. For every new contributor, an old one stops contributing.

  • Does the data support that hypothesis?
  • What is the average number of commits per contributor? How is this distributed?
  • Other interesting metrics?

I have no skills at all in the data science department. If someone can help me out, that would be great.

Riccardo Brasca (Dec 14 2021 at 13:00):

It would be interesting to know how much work is devoted to refactoring and how much to prove new theorem. But I don't know how to measure it.

Yaël Dillies (Dec 14 2021 at 13:00):

Number of lines added - number of lines removed?

Riccardo Brasca (Dec 14 2021 at 13:02):

This is an interesting idea!

Yaël Dillies (Dec 14 2021 at 13:03):

My net total usually goes down when I refactor, for example.

Johan Commelin (Dec 14 2021 at 13:04):

Right. All the golfing attempts fight against the growth trend. If mathlib grows 25K lines in a month, it actually grew a bit more, probably. But I have no idea by how much.

Riccardo Brasca (Dec 14 2021 at 13:04):

I am thinking about stuff like changing the definition of a group (I don't find the PR). Does someone has the PR number?

Johan Commelin (Dec 14 2021 at 13:05):

#7255

Arthur Paulino (Dec 14 2021 at 13:14):

@Johan Commelin Another hypothesis is that space for contribution saturates as the lib grows. And people do get involved in refactoring and/or simplification. That is, the lib can also evolve with line removals

Riccardo Brasca (Dec 14 2021 at 14:17):

#7255 had 634 lines added and 465 removed. Added - removed gives 169, that is surely not a lot considering the work done, so this seems a reasonable thing to consider if we want to ignore refactoring

Floris van Doorn (Dec 14 2021 at 16:20):

If I look at the commits per month graph, the growth definitely seems to be increasing (even when ignoring all data before March 2020). The lowest months of 2021 (May and July) are about or above average when compared to the period 2020-03 till 2020-12.

Floris van Doorn (Dec 14 2021 at 16:22):

So I'm not sure if "linear growth rate" is accurate. But it's possible that the number of commits has increased, but the size of commits has decreased.

Rob Lewis (Dec 14 2021 at 16:40):

If you do want a growth metric that includes refactoring, lines added + lines removed might be interesting. On our stats page, an increase here would only show up in the commit graph, where it could be obscured by other things.

Arthur Paulino (Dec 14 2021 at 17:12):

Backwards moving average (20-sized windows) of the metric Rob mentioned (using the data mentioned here):
image.png

Eric Wieser (Dec 14 2021 at 17:14):

Can you make the window time-based rather than commit based?

Arthur Paulino (Dec 14 2021 at 17:17):

The data doesn't have that info :/

Arthur Paulino (Dec 14 2021 at 17:24):

I also wanted to check the "average journey of a mathlib contributor", motivated by the hypothesis that Johan raised. I thought about plotting contribution metrics by tenure (after the first contribution), normalized by the average size of the contributions (attempting to mitigate the bias that person A contributes more than person B in volume). I'd expect to see the contribution volume decreasing after the first contribution

Johan Commelin (Dec 14 2021 at 17:31):

I think to get useful graphs, it will be very helpful to have data that contains commit timestamps. Is that hard to get?

Rob Lewis (Dec 14 2021 at 17:33):

There's a git API in Python that you could use to extract the data pretty easily

Arthur Paulino (Dec 14 2021 at 17:36):

That would make things easier. I glanced through the GitHub API and I couldn't find it. It's a big API

Rob Lewis (Dec 14 2021 at 17:39):

The GitHub API is different, although if you want to associate contributors to GitHub user accounts you'd have to use it too. I was talking about https://gitpython.readthedocs.io/en/stable/

Arthur Paulino (Dec 14 2021 at 17:46):

Nice! I might take a loot at it later

Arthur Paulino (Dec 15 2021 at 02:57):

Okay, I finally had some time to take a look at it. Turns out GitHub blocked me from using the API because I made too many requests (mathlib has too many commits).

So I created a Python script that parses the returns from terminal calls and processes everything locally. Here's the gist if anyone wants to use it.

And here is the extracted data :+1:

Arthur Paulino (Dec 15 2021 at 03:13):

And here's the graph with line changes per day (smoothed) across time:
image.png

Arthur Paulino (Dec 15 2021 at 03:21):

Looks like a good moment to buy :stuck_out_tongue:

Johan Commelin (Dec 15 2021 at 05:14):

@Arthur Paulino Great! Thanks for doing this

Patrick Stevens (Dec 15 2021 at 08:10):

Your get_commit_hashes is secretly git log --pretty='format:%H', by the way

Patrick Stevens (Dec 15 2021 at 08:16):

Probably simpler to parse what you want from git log --pretty='format:%H%x00%an%x00%ai' --numstat

Patrick Stevens (Dec 15 2021 at 08:17):

which outputs the complete list of commits in the form:
long-hash null-byte author name null-byte author date in ISO format newline
followed by a list of files changed in the form lines-added whitespace lines-deleted whitespace filename

Patrick Stevens (Dec 15 2021 at 08:18):

(just saves the multiple calls to git really)

Stuart Presnell (Dec 15 2021 at 08:37):

  • Is it possible to count the number of characters rather than just lines?

  • Is there any usable information in the content of commit or PR descriptions — for example, what fraction of commits contain "golf"?

  • Can we count the growth rate of the number of lemmas, theorems, defs, classes etc?

Arthur Paulino (Dec 15 2021 at 09:25):

Thanks @Patrick Stevens !
Any other format option to extract the info Stuart asked for?

Arthur Paulino (Dec 15 2021 at 09:29):

Actually I will check the documentation for --pretty later :+1:
This is very useful knowledge

Eric Wieser (Dec 15 2021 at 09:29):

Have you considered using gitpython as Rob suggested?

Arthur Paulino (Dec 15 2021 at 09:30):

Yes but I got blocked. Unless I did something wrong (totally possible)

Eric Wieser (Dec 15 2021 at 09:31):

By what?

Eric Wieser (Dec 15 2021 at 09:31):

Gitpython works offline

Arthur Paulino (Dec 15 2021 at 09:31):

By GitHub

Eric Wieser (Dec 15 2021 at 09:31):

You should be able to use gitpython on a local clone

Arthur Paulino (Dec 15 2021 at 09:32):

There. I misused it :smiley:

Eric Wieser (Dec 15 2021 at 09:32):

I didn't even know that it supported any part of the github api

Arthur Paulino (Dec 15 2021 at 09:36):

I still want to learn about --pretty tho, so I might as well do both

Johan Commelin (Dec 15 2021 at 10:51):

@Arthur Paulino Thanks for doing this. I think it's great to get some insight into this data.

Arthur Paulino (Dec 15 2021 at 11:43):

Peeking at commit diffs content is a bit more expensive, but should be feasible

Arthur Paulino (Dec 15 2021 at 13:49):

Gist updated with Patrick's tips. It's waaaay shorter and faster :pray:

Julian Berman (Dec 15 2021 at 13:49):

I also got slightly curious, so here's a notebook you may also find very slightly useful if you didn't poke at gitpython too much yet: https://gist.github.com/Julian/3cb202d9461e73b6c06dd564683f5118

I was on my way to trying to train some simple model but got tied up grading finals womp womp so probably no more for me till later.

Arthur Paulino (Dec 15 2021 at 13:52):

I'm gonna use gitpython to get more details about the commit and the diffs (I suppose it's possible to look at diffs in gitpython)

Arthur Paulino (Dec 15 2021 at 13:53):

About "area": file.split("/")[1] if "/" in file else None,: I thought about doing something similar, but a commit might tweak more than 1 area so I left it for later

Julian Berman (Dec 15 2021 at 13:53):

yes you can look at diffs in gitpython

Julian Berman (Dec 15 2021 at 13:54):

right in that dataset above one row is not a commit

Julian Berman (Dec 15 2021 at 13:54):

it's a single file in a single commit

Arthur Paulino (Dec 15 2021 at 13:54):

Got it

Arthur Paulino (Dec 15 2021 at 13:55):

Right, I'm blind :sweat_smile:

for commit in repo.iter_commits()
for file, stats in commit.stats.files.items()

Rob Lewis (Dec 15 2021 at 14:01):

Stuart Presnell said:

  • Is it possible to count the number of characters rather than just lines?

  • Is there any usable information in the content of commit or PR descriptions — for example, what fraction of commits contain "golf"?

  • Can we count the growth rate of the number of lemmas, theorems, defs, classes etc?

Counting declarations (properly) goes beyond git. You can't get an accurate number without a compiled mathlib, which even if you wanted to get Azure caches, we don't have for every commit. So this data would take weeks to collect. You'd have to approximate it by counting occurences of lemma/theorem/inductive etc in the text.

Rob Lewis (Dec 15 2021 at 14:02):

Arthur Paulino said:

And here's the graph with line changes per day (smoothed) across time:
image.png

Sorry if I'm missing something, what are MA_25/50/100?

Arthur Paulino (Dec 15 2021 at 14:03):

@Rob Lewis those are backward moving averages of 25/50/100 days (ago)

Arthur Paulino (Dec 15 2021 at 14:07):

And yeah, counting declarations is a bit noisy because if a declaration is renamed or moved around, there will be a "removed" declaration and a "created" one

Arthur Paulino (Dec 16 2021 at 00:08):

Found some extra time to play around with mathlib3 git data. This time using part of @Julian Berman's logic to extract it.

Here's the notebook!

Arthur Paulino (Dec 16 2021 at 00:31):

I was able to learn some interesting details about mathlib.

  • Before humans existed, mathlib didn't have a src folder and every field of study had a folder in the root directory of the repository
  • Mathlib has a reasonably high rate of line removals per commit, which reflects the community's concern about modularity/organization (actually, let me add this one graph)
  • Not only the number of cumulative contributors is growing across time, which is obviously expected, but also the number of distinct contributors by month. I believe this is a very healthy sign of success
  • I was rather surprised by the fact that the ratio of k-stayers is roughly stable since the beginnings of time

Johan Commelin (Dec 16 2021 at 06:19):

@Arthur Paulino Great! These are exactly the kind of insights that I was hoping for. (I distinctly remember the moment in time when humans appeared, and all top-level folder were moved into src/.)


Last updated: Dec 20 2023 at 11:08 UTC