Zulip Chat Archive

Stream: general

Topic: Number of contributors to mathlib?


Jake Levinson (Jan 24 2024 at 22:20):

Hi all, I'm moderating a panel for a conference this weekend, which will touch a bit on Lean and formalization tools. I'd like to be able to cite some statistics about Lean and mathlib.

This page [https://leanprover-community.github.io/mathlib_stats.html] says there are around 300 contributors to mathlib. That seems low to me, maybe it's only for mathlib4? I'm also pretty sure there used to be an actual list somewhere, so e.g. if I wanted to look at the number of users who've contributed ≥X lines of code or ≥Y PRs (I remember going and finding my own name on it!).

Can anyone confirm that this is the correct number? Or point me to some other statistics?

Yaël Dillies (Jan 24 2024 at 22:24):

We hit 400 on mathlib before the port, so I think your guess is correct (it only looks at mathlib4)

Yaël Dillies (Jan 24 2024 at 22:25):

The stats table disappeared because stats were hugely skewed due to porting. I don't know whether there's plans to bring it back in some form.

Moritz Firsching (Jan 24 2024 at 22:48):

~> git --git-dir=./mathlib4/.git log --format='%aN' | sort -u > mathlib4_authors.txt
~> git --git-dir=./mathlib/.git log --format='%aN' | sort -u > mathlib3_authors.txt
~> cat mathlib4_authors.txt | wc -l
210
~> cat mathlib3_authors.txt | wc -l
330
~> cat mathlib3_authors.txt mathlib4_authors.txt | sort -u | wc -l
425

This is not perfect, because a few names appear twice in slighly different versions (e.g. with and without space, or in capital letters and without), but it should be a good rough estimate?!

Moritz Firsching (Jan 24 2024 at 22:49):

This is at 65a1391a for mathlib and at a686ba82 for mathlib4

Adam Topaz (Jan 24 2024 at 22:50):

shouldn't you pipe the last command into uniq to avoid duplicates?

Adam Topaz (Jan 24 2024 at 22:50):

or does the -u flag do this?

Adam Topaz (Jan 24 2024 at 22:50):

TIL!

Adam Topaz (Jan 24 2024 at 22:52):

There's probably some API provided by github that would give the data from https://github.com/leanprover-community/mathlib4/graphs/contributors which will then include number of commits, loc, etc. per contributor

Jake Levinson (Jan 24 2024 at 23:45):

Thank you all, this is great!

Oliver Nash (Jan 25 2024 at 09:46):

A metric I'd be interested in seeing is the following. Pick a time period, say 6 months and calculate the net number of lines added. Then take all contributors, sort them by net lines added (unfortunately this unfairly punishes golfers) in this period and count many are required till the sum of their net lines added is some critical percentage, say 95%. Then plot this count over time.

Rob Lewis (Jan 25 2024 at 14:52):

FYI, the count on the stats page was never updated with the port. It's fixed at the mathlib 3 number. PRs to the website repo to compute the updated number here would be welcome.

Jake Levinson (Jan 25 2024 at 20:14):

Oliver Nash said:

A metric I'd be interested in seeing is the following. Pick a time period, say 6 months and calculate the net number of lines added. Then take all contributors, sort them by net lines added (unfortunately this unfairly punishes golfers) in this period and count many are required till the sum of their net lines added is some critical percentage, say 95%. Then plot this count over time.

Maybe you could add up the absolute number of lines added or removed per PR, instead of lines added? If my PR reduces a 50-line proof to 20-lines, I feel like I've contributed 30 lines (of negative space). Of course if someone repeatedly adds and removes the same material, that would be overcounted, but I imagine that doesn't actually happen much.

Jireh Loreaux (Jan 26 2024 at 03:47):

What does happen though are cleanup or renaming, which may touch a lot of lines, but are perhaps not very hard.

In any case, I think the spirit of Oliver's suggestion is probably more along the lines of: what does the Lorenz curve of Mathlib contributions look like as a function of time?

Yury G. Kudryashov (Jan 26 2024 at 04:35):

Jireh Loreaux said:

What does happen though are cleanup or renaming, which may touch a lot of lines, but are perhaps not very hard.

If you look at |linesAdded - linesRemoved| for each PR, then renaming PRs will have ~zero impact.

Jireh Loreaux (Jan 26 2024 at 05:45):

Aha, that's different than what I was thinking. Yes, indeed, that could be a nice metric.


Last updated: May 02 2025 at 03:31 UTC