Zulip Chat Archive

Stream: general

Topic: mathlib statistics


Rob Lewis (Jun 27 2020 at 12:16):

June is now a record setting month for mathlib! 70% more commits this month than our pre-corona high. Our 2020 total is already very close to our 2019 total. And there's a very good chance we'll have 300k lines of code in the repository before the end of the month.

Kevin Buzzard (Jun 28 2020 at 00:18):

My blog has also had has many hits this year as in the whole of last year

Kevin Buzzard (Jun 28 2020 at 00:23):

43000 views and 20000 visitors this year. My post on mathematics in type theory has already had as many views as my viraly fashionable mathematics post. These posts are both just basically adverts for the lean and mathlib community. I'm really hoping that as the API gets better we'll be able to really start attracting mathematicians

Rob Lewis (Jul 01 2020 at 08:34):

Yesterday was a big day! We closed out June by hitting a bunch of milestones. For one, we broke 300k loc. For another, the number of commits in June was more than double the pre-corona high. And hafway into 2020, we have more commits for the year than we did in all of 2019.

Patrick Massot (Jul 01 2020 at 08:41):

Let's hope we'll have another lockdown to migrate mathlib to Lean 4. We'll need that kind of productivity.

Johan Commelin (Jul 01 2020 at 09:00):

And we didn't even count the commits to the community fork.

Heather Macbeth (Oct 27 2020 at 18:05):

On the mathlib statistics page:
https://leanprover-community.github.io/mathlib_stats.html
there's a running total of the number of definitions and theorems and meta declarations in mathlib. Could there also be a running total of the number of people who have contributed to mathlib?

Yury G. Kudryashov (Oct 27 2020 at 18:06):

https://github.com/leanprover-community/mathlib/graphs/contributors

Eric Wieser (Oct 27 2020 at 18:07):

Pretty sure those never show more than 99 contributors

Yury G. Kudryashov (Oct 27 2020 at 18:07):

Sorry, this page doesn't show total

Heather Macbeth (Oct 27 2020 at 18:07):

Yes (there's also a list like this further down the website page
https://leanprover-community.github.io/mathlib_stats.html
). But I'm suggesting adding the absolute number to the top-line total. Just for marketing!

Yury G. Kudryashov (Oct 27 2020 at 18:07):

https://github.com/leanprover-community/mathlib/ says 119

Rob Lewis (Oct 27 2020 at 18:09):

It's slightly trickier than you'd expect. The numbers at the top come from doc-gen, at the time the website is generated. The stats below come from a JS file, which is generated once a day by gitstats, and you get the most recent numbers whenever you load the website.

Rob Lewis (Oct 27 2020 at 18:09):

Presumably gitstats can give us the contributor count.

Rob Lewis (Oct 27 2020 at 18:10):

But it won't be static at the time the website is generated, we'd have to update the count with JS or something.

Kevin Buzzard (Oct 27 2020 at 18:10):

There are certainly Imperial undergrads who contributed in 2018 and are mentioned in the first few lines of Lean files but who never made a PR (e.g. they were working in a small group together, and one PR'ed).

Rob Lewis (Oct 27 2020 at 18:11):

Kevin Buzzard said:

There are certainly Imperial undergrads who contributed in 2018 and are mentioned in the first few lines of Lean files but who never made a PR (e.g. they were working in a small group together, and one PR'ed).

You beat me to it, this is the second reason it's tricky. The "real" count is something like 15-20 more than what GitHub says.

Heather Macbeth (Oct 27 2020 at 18:11):

But this can be addressed, right? Just add 15 :)

Rob Lewis (Oct 27 2020 at 18:11):

And there are all kinds of ambiguities in how you count this.

Heather Macbeth (Oct 27 2020 at 18:12):

Frankly, I don't think it matters if the count is exactly right. What would be nice would be to have a count which is reasonable, approximately correct, and keeps going up.

Gabriel Ebner (Oct 27 2020 at 18:12):

I wouldn't use the git statistics, is sgouezel <urkud@urkud.name> a contributor? Counting the number of github usernames who made a PR seems more robust.

Yury G. Kudryashov (Oct 27 2020 at 18:13):

We have a mailmap file

Eric Wieser (Oct 27 2020 at 18:13):

You can use a .mailmap to merge together emails from the same github user

Rob Lewis (Oct 27 2020 at 18:13):

Gabriel Ebner said:

I wouldn't use the git statistics, is sgouezel <urkud@urkud.name> a contributor? Counting the number of github usernames who made a PR seems more robust.

I think this is what the GitHub stats do.

Bryan Gin-ge Chen (Oct 27 2020 at 18:13):

I've been maintaining a mailmap which takes care of sgouezel <urkud@urkud.name>: https://github.com/leanprover-community/mathlib-mailmap (maintainers only so that spam crawlers won't immediately get all our email addresses)

Rob Lewis (Oct 27 2020 at 18:13):

Oh, wait, maybe we can use the PyGithub API to get a contributor count as part of the website build?

Eric Wieser (Oct 27 2020 at 18:14):

There are certainly Imperial undergrads who contributed in 2018 and are mentioned in the first few lines of Lean files but who never made a PR

As a hack, you could add their names to the .mailmap, and stick co-authored-by in the commit message with their names

Heather Macbeth (Oct 27 2020 at 18:14):

Or just use the algorithm, "count the number of entries in the Authors table".

Heather Macbeth (Oct 27 2020 at 18:15):

Sine that is maintained by hand by @Bryan Gin-ge Chen to be approximately duplication-free.

Eric Wieser (Oct 27 2020 at 18:15):

(maintainers only so that spam crawlers won't immediately get all our email addresses)

arguably this is a lost cause, any crawler clever enough to crawl github for email addresses probably knows they can get the emails from downloading the repo. I can get the email for any given PR by appending .patch to the URl

Bryan Gin-ge Chen (Oct 27 2020 at 18:15):

Yeah, that's true.

Yury G. Kudryashov (Oct 27 2020 at 18:17):

Some crawlers can browse the Internet randomly and sometime come to a link to .mailmap

Rob Lewis (Oct 27 2020 at 18:26):

Is it better to put this number at the top of https://leanprover-community.github.io/mathlib_stats.html (and change the "Declaration Counts" header) or down by the authorship table/graphs?

Heather Macbeth (Oct 27 2020 at 18:27):

I suggest at the top (possibly even replacing "Meta Declarations")

Heather Macbeth (Oct 27 2020 at 18:28):

With no disrespect intended to meta declarations ... :)

Heather Macbeth (Oct 27 2020 at 18:28):

Or next to it, if there's space for 4 entries.

Eric Wieser (Oct 27 2020 at 18:29):

Or fold meta declarations under definitions as 12345 (+ 678 meta)

Heather Macbeth (Oct 27 2020 at 18:29):

New heading: "mathlib by the numbers" (I like this less on second thought), or "at a glance", or ...

Eric Wieser (Oct 27 2020 at 18:29):

Since lots of non-meta definitions are still useful in meta code anyway

Rob Lewis (Oct 27 2020 at 18:30):

https://github.com/leanprover-community/leanprover-community.github.io/pull/149

Heather Macbeth (Oct 27 2020 at 18:31):

Awesome!

Eric Wieser (Oct 27 2020 at 18:31):

Some other interesting subsets of definitions worth counting might be:

  • Types
  • type classes
  • noncomputable vs computable
  • instances

Rob Lewis (Oct 27 2020 at 18:31):

image.png

Heather Macbeth (Oct 27 2020 at 18:46):

The stats count makes for interesting reading. Mathlib has doubled in size (from 200k to 400k lines of code) in the last year?

Johan Commelin (Oct 27 2020 at 18:46):

That's what a lockdown does for you

Yury G. Kudryashov (Oct 27 2020 at 18:47):

Note that we moved some code from stdlib

Heather Macbeth (Oct 27 2020 at 18:48):

I was going to say, and 40% of those new lines of code are from @Yury G. Kudryashov -- but in fact since he refactors so much, "only" 20% :)

Yury G. Kudryashov (Oct 27 2020 at 18:49):

3.5K lines were moved from stdlib in #2697

Heather Macbeth (Oct 27 2020 at 18:50):

Compared to the other growth, that's a rounding error!

Kevin Lacker (Oct 27 2020 at 19:04):

hopefully the work required to port to Lean 4 is not a linear function of the lines of code in mathlib....

Mario Carneiro (Oct 27 2020 at 19:05):

it is, although the constant is probably small

Bryan Gin-ge Chen (Oct 27 2020 at 19:08):

A lot will depend on how quickly Lean 3 users can get trained in Lean 4. So those who get proficient quickly should get ready to write lots of documentation and tutorials.

Yury G. Kudryashov (Oct 27 2020 at 19:11):

I expect that the main issue is the tactic code (probably because I don't feel proficient in writing tactics).

Yury G. Kudryashov (Oct 27 2020 at 19:12):

I mean, the fact that I'm not proficient in writing meta code affects my impression about relative difficulty of migrating meta vs non-meta code.

Yury G. Kudryashov (Oct 27 2020 at 19:13):

I wrote a sentence, then realized that it was ambiguous.


Last updated: Dec 20 2023 at 11:08 UTC