Zulip Chat Archive

Stream: general

Topic: Periodic reports


Damiano Testa (May 15 2024 at 16:48):

It was suggested in this thread that there might be interest to have a periodic report on some generic global information about mathlib. I personally find this very interesting!

This would be a bot that periodically posts a message on Zulip with some mathlib statistics.

I think that there are two main decisions to take:

  • what should be the frequency (possible ranges: never...daily -- more is probably too much?);
  • what statistics would the community like to see!

I am going to start two polls, one for frequency and one for possible topics: please vote your favourite or suggest more!

PS I am happy to try to implement the automation and the statistics, but if you ask for something too wild, I may not be able to do it! :smile:

Damiano Testa (May 15 2024 at 16:48):

/poll Frequency
Daily
Weekly
Biweekly
Monthly

Damiano Testa (May 15 2024 at 16:48):

/poll Mathlib statistic (could be more than one)
Top five popular tactics since last post
Net change in number of lines of Mathlib
Number of new declarations
Number of invisible (autogenerated) declarations
Number of docstrings
Longest proof (expression)
Longest proof (syntax)
Most popular starting tactic
Most popular finishing tactic

Matthew Ballard (May 15 2024 at 17:10):

Zulip is good but a bit incestuous. I was only half joking with the LinkedIn suggestion. Where does one disseminate this information to broader public these days?

Patrick Massot (May 15 2024 at 17:13):

We already have a statistics webpage. But we could indeed post something on mastodon for instance.

Damiano Testa (May 15 2024 at 17:14):

Ok, I guess that maybe I should also start a poll on where this should be posted! Maybe I will wait to see what information surfaces above, though.

Yaël Dillies (May 15 2024 at 17:42):

I can't seem to add options to the pole? I meant to add Slowest proofs and Slowest files

Damiano Testa (May 15 2024 at 19:32):

I had forgotten about the statistics page that Patrick mentioned! The number of theorems and the number of lines are already mentioned there, and these were the two most prominent features that were requested. This was easy!

Damiano Testa (May 15 2024 at 19:33):

The next one up seems to be number of new modules with docstrings (I guess that this means a file that did not exist the previous week and that contains at least one doc-string?).

Damiano Testa (May 15 2024 at 19:33):

I can look into either incorporating this information on the webpage or prepare a bot that posts on Zulip.

Damiano Testa (May 15 2024 at 19:34):

Even if we decide to post somewhere else, probably testing it in-house is a good idea.

Bolton Bailey (May 15 2024 at 20:02):

In particular, I think that longest pole/longest file(s)/longest proof(s) would be a good candidate for addition to the stats page rather than a feed, since those ones are less of a "here are new updates you might want to be made aware of" and more of a "long-running performance metrics we would like to track and optimize"

Yaël Dillies (May 15 2024 at 20:04):

Yaël Dillies said:

I can't seem to add options to the pole? I meant to add Slowest proofs and Slowest files

Now I can! Please move your reacts from my message to the poll :smile:

Yaël Dillies (May 15 2024 at 20:05):

I believe we should differentiate between:

  • sexy statistics to boast mathlib
  • gory statistics for people like Matthew or I trying to get mathlib faster/more organised

Bolton Bailey (May 15 2024 at 20:08):

Another option that I think has been suggested for longest pole is to include it in the speed center report somehow

Damiano Testa (May 15 2024 at 20:12):

In terms of differentiating, my initial idea was to have a (weekly according to the poll) feed that would report on one or two statistics on Zulip. I like the idea of advertising elsewhere, and I think that we should do it. I also agree that I would likely post different statistics on the two places.

Damiano Testa (May 15 2024 at 20:13):

I hope that this is how people voted above -- those are statistics that a bot would post weekly in a dedicated stream on Zulip. Feel free to change your answer, if this is not what you had intended!

Damiano Testa (May 15 2024 at 20:13):

I think that a little bit of testing of this model before we make another poll for a "public" posting is probably advisable.

Jon Eugster (May 15 2024 at 21:04):

You could also add links from the statistics page above to the speed center and to wherever stats are posted, so that it's easier to randomly stumble across them

Bolton Bailey (May 16 2024 at 00:45):

Also, while I take the point that the longest pole is not exactly great at being outward-facing, there is something sexy about knowing that "The ring of integers of a number field generated by a 5th-root of unity cyclotomic extension of the rationals is a principal ideal ring" is the deepest theorem we have been able to prove in the library.

Kim Morrison (May 16 2024 at 01:07):

Except that people who like theorem X should be trying to make sure that X is not the longest pole. :-)

Kim Morrison (May 16 2024 at 01:08):

And where "deepest" here could instead mean "least effort has been put into streamlining the dependencies"!

Eric Wieser (May 16 2024 at 02:58):

Or it could mean "streamlining is at the expense of this theorem because the fewest people care about it". I think a fair bit of streamlining is subjective and optimizes the bits current contributors care about at the expense of bits they don't (which is good, but not necessarily stable)

Yaël Dillies (May 16 2024 at 04:56):

Eric Wieser said:

I think a fair bit of streamlining is subjective and optimizes the bits current contributors care about at the expense of bits they don't

I doubt that's actually true in practice. The library is currently so underoptimised that any streamlining is good. Maybe in a few months once I've reasonably straightened out the imports the situation will be different, but right now I can't see any streamlining being bad to some files.

Kevin Buzzard (May 16 2024 at 07:15):

The fact that the integers of Q(ζ5)\mathbb{Q}(\zeta_5) are a PID is a nontrivial result, needing inputs from both algebra and analysis, so maybe it deserves to be the longest pole

Riccardo Brasca (May 16 2024 at 08:31):

Note that the proof is very close to the math proof (it is indeed quite short, and fast). To be honest I don't see anything wrong with it (for example we can avoid the Minkowski bound and make the proof more direct, but I don't see the point in doing it).

Ruben Van de Velde (May 16 2024 at 08:33):

I think the point is that we're seeing that this proof depends on a lot of prior work, and the question is whether this complexity is inherent or accidental - and it seems like it's inherent

Ruben Van de Velde (May 16 2024 at 08:33):

But it's worth asking the question

Kim Morrison (May 16 2024 at 08:33):

I was referring more to the 4/5th of the longest pole leading up to the result that you wouldn't immediately notice have anything to do with it

Kim Morrison (May 16 2024 at 08:34):

The proof itself can be inherently complex, and still have way too much stuff transitively imported into it.

Ruben Van de Velde (May 16 2024 at 08:42):

(Speaking of splitting, maybe someone should split this thread)

Damiano Testa (May 16 2024 at 12:59):

I have started looking into preparing a summary of the weekly increase. Below is a first draft: please, feel free to comment on what you like, dislike, would like to add!

Damiano Testa (May 16 2024 at 12:59):


Weekly stats (2024-05-09 2024-05-16)

615 files changed, 13859 insertions(+), 5864 deletions(-), 7995 total(insertions-deletions)


% Folder
4.3% Mathlib/Algebra/CharP/
4.3% Mathlib/Algebra/Group/
11.8% Mathlib/Algebra/
3.1% Mathlib/Analysis/
11.1% Mathlib/CategoryTheory/
5.3% Mathlib/Data/
3.0% Mathlib/MeasureTheory/Measure/
3.6% Mathlib/NumberTheory/LSeries/
3.4% Mathlib/NumberTheory/
8.7% Mathlib/RingTheory/Ideal/
6.8% Mathlib/RingTheory/
10.1% Mathlib/Tactic/CC/
3.5% Mathlib/Topology/
16.9% Mathlib/

commits: old fa33d89fc6, current 71f93293db.

Take also a look at the Mathlib stats page.

Rémy Degenne (May 16 2024 at 13:03):

What does the number in the % column mean?

Damiano Testa (May 16 2024 at 13:05):

I'm not sure, but I think that it is the percentage of change, according to git? The "% graph" is the output of git diff --dirstat "${commit}"...HEAD.

Damiano Testa (May 16 2024 at 13:06):

I found out about this by scavenging the git invocations, looking for "insertions and deletions". Someone on the internet mentioned --dirstat and it looked cool! :smile:

Damiano Testa (May 16 2024 at 13:08):

From the git manual:
--dirstat[=<param1,param2,…​>]

Output the distribution of relative amount of changes for each sub-directory. The behavior of --dirstat can be customized...

Damiano Testa (May 16 2024 at 13:08):

I guess that it is saying that 1617% of mathlib changed in the last week? :shrug:

Rémy Degenne (May 16 2024 at 13:11):

So most likely those 17% reflect some large refactors of low level stuff, or changes of the names of lemmas used everywhere, rather than new things added to mathlib?
I am trying to see if that stat is actually meaningful.

Luigi Massacci (May 16 2024 at 13:12):

Damiano Testa said:

Take also a look at the Mathlib stats page.

Is the recent part of the mathlib3 graph still relevant? Presumably the points after July 16, 2023 could be dropped?

Damiano Testa (May 16 2024 at 13:12):

Yeah, I am also trying to figure out what it means. Mathlib has 1,477,829 lines...

Damiano Testa (May 16 2024 at 13:13):

...and 13,859 is not really close to 17%.

Damiano Testa (May 16 2024 at 13:14):

Luigi Massacci said:

Damiano Testa said:

Take also a look at the Mathlib stats page.

Is the recent part of the mathlib3 graph still relevant? Presumably the stuff after July 16, 2023 could be dropped?

Yes, as the graph shows, mathlib3 became essentially obsolete as soon as the port was finished. I do not maintain that web-page though.

Damiano Testa (May 16 2024 at 13:16):

Back to the percentages: since it is a default option that git gives, I imagine that it has its reason of being, however I do not really know what it is. Unless someone is in favour of keeping it, I may as well drop it.

Damiano Testa (May 16 2024 at 13:16):

(Note that I am also planning to add the number of declarations to the first actual implementation, I just began with what did not require interacting simultaneously with the shell and lean in CI as a first attempt.)

Damiano Testa (May 16 2024 at 13:20):

The sum of all the percentages is 95.9, which may mean that this is explaining how the 13859 addition are spread across folders, where Mathlib presumably means "everything that was not accounted before".

Yaël Dillies (May 16 2024 at 14:31):

Damiano Testa said:

...and 13,859 is not really close to 17%.

Is it not 17% of the files?

Yaël Dillies (May 16 2024 at 14:32):

Damiano Testa said:

commits: old fa33d89fc6, current 71f93293db.

Can you make the commit hashes be links to the commits on github?

Damiano Testa (May 16 2024 at 14:34):

Yaël Dillies said:

Damiano Testa said:

...and 13,859 is not really close to 17%.

Is it not 17% of the files?

This could be.

Yaël Dillies said:

Damiano Testa said:

commits: old fa33d89fc6, current 71f93293db.

Can you make the commit hashes be links to the commits on github?

This should be easy, yes!

Edward van de Meent (May 16 2024 at 19:21):

i'm curious if a "top n merged PRs by diff size" might be informative?

Edward van de Meent (May 16 2024 at 19:22):

i imagine this might highlight important changes/additions

Ruben Van de Velde (May 16 2024 at 21:27):

Depending on how you count, they might all be Yaël moving things

Damiano Testa (May 18 2024 at 06:25):

Here is another proposal: what do you think of this report?

Damiano Testa (May 18 2024 at 06:25):


Weekly stats (2024-05-11 2024-05-18)

Reference commits: old 3d4137543f, new 712ef83de5.

987 files changed, 19810 insertions(+), 11144 deletions(-), 8666 total(insertions-deletions)

Declarations:

Type New Old Change
Definitions 73624 73161 463
Theorems 231138 230101 1037
Inductives 2451 2430 21
Other 6148 6080 68

Take also a look at the Mathlib stats page.

Johan Commelin (May 18 2024 at 07:42):

For an internal report, I think that's nice. For an external report, I guess many people will be confused about why "Inductives" is an entry.

Damiano Testa (May 18 2024 at 11:05):

Yes, I was aiming to produce a report for Zulip, test it a little and then upgrade it to an "external" report.

Eric Wieser (May 18 2024 at 11:09):

I'm not sure counting inductives is interesting; def MulOpposite X := X is not really any different to structure MulOpposite where val : X

Eric Wieser (May 18 2024 at 11:10):

I think the interesting counts are maybe:

  • Things in Sort _ (types / predicates)
  • Things in P where P : Prop (theorems)
  • Everything else (data)

Damiano Testa (May 18 2024 at 12:11):

@Eric Wieser, is this the distinction that you had in mind? The ConstantInfo:

  • does not have a value;
  • has a value whose type is Prop;
  • has a value whose type is Type _.

Eric Wieser (May 18 2024 at 12:19):

I don't think the "has a value" is interesting, even things without a value (opaque, axiom) still have a type?

Damiano Testa (May 18 2024 at 12:21):

defnInfo and thmInfo have a value, all other ConstantInfos do not.

Damiano Testa (May 18 2024 at 12:21):

(Though they still have a type.)

Damiano Testa (May 18 2024 at 12:22):

To be precise, I was thinking of docs#Lean.ConstantInfo.value?

Eric Wieser (May 18 2024 at 12:27):

I think just look at type?

Matthew Ballard (May 18 2024 at 12:33):

New contributors, total?

Damiano Testa (May 18 2024 at 12:37):

So,

  • inferType type = Prop => Theorem
  • inferType type = Type _ => types/predicates
  • everything else => data?

Eric Wieser (May 18 2024 at 13:02):

To detect types/ predicates you need to first introduce all the binders

Eric Wieser (May 18 2024 at 13:03):

Your first bullet is correct

Damiano Testa (May 18 2024 at 13:05):

Oh, let me try once more!

  • inferType type = Prop => Theorem
  • inferType (whatever MetaTelescope type gives) = Sort _ => types/predicates
  • everything else => data?

Eric Wieser (May 18 2024 at 13:12):

I think bullet two should not have an infertype

Damiano Testa (May 18 2024 at 13:13):

updated the above. I'll try to implement it!

Eric Wieser (May 18 2024 at 13:13):

Should be = Sort _ too, but otherwise looks good!

Damiano Testa (May 18 2024 at 13:15):

Ok, updated the above!

Damiano Testa (May 18 2024 at 13:54):

Here is the current iteration, with the tally as suggested by Eric:

Damiano Testa (May 18 2024 at 13:54):


Weekly stats (2024-05-11 2024-05-18)

Reference commits: old 82af17d1e5, new 964b14278e.

843 files changed, 19601 insertions(+), 11029 deletions(-), 8572 total(insertions-deletions)

Declarations:

Type New Old Change
Theorems 246865 245737 1128
Types/predicates 7649 7589 60
Other 58879 58486 393

Take also a look at the Mathlib stats page.

Yaël Dillies (May 18 2024 at 13:56):

So what does "Other" actually contain?

Damiano Testa (May 18 2024 at 13:56):

Do you want to see some example? If I did not mess up, I implemented the suggestion that Eric proposed.

Damiano Testa (May 18 2024 at 13:59):

It seems mostly garbage.
This is a random sample from the middle of other in Mathlib.Data.Nat.Defs or below:

Set.«term{_|_}_1»
Lean.Expr.isExplicitNumber
Preorder.toLT
Mathlib.getMathlibDir
Simps.zsmulArgs
ToAdditive.proceedFields
Simps.elabSimpsRule
Lean.Parser.Command.simpsRule
Mathlib.Util.«commandCompile_def%_»
instTransGeGt_mathlib
Function.«term_»
Simps.ParsedProjectionData.isPrefix
ToAdditive.addToAdditiveAttr
Function.extend
Nat.leRecOn
Lean.Meta.simpType
to_additive_relevant_arg
Mathlib.Prelude.Rename.align.precheck
Classical.decRel
ToAdditive.relevantArgAttr
Lean.MVarId.gcongr
Mathlib.Tactic.PushNeg.elabPushNegConv
Parser.Attr.qify_simps_proc
Trans.simple
Mathlib.Notation.unexpandExistsUnique
Nat.decidableLoHiLe
Lean.Parser.Attr.attrSimps?_
Mathlib.Tactic.tacticTransitivity___
Set.instSDiff
Classical.decPred
Mathlib.Tactic.tacticAssumption'
One.toOfNat1
ToAdditive.warnParametricAttr
Simps.instInhabitedAutomaticProjectionData
Simps.findAutomaticProjections
Mathlib.Tactic.variables
Simps.ProjectionRule.rename
Nat.strongRec'
Mathlib.Prelude.Rename.RenameMap.insert
Mathlib.ProjectionNotation.mkExtendedFieldNotationUnexpander
ToAdditive.applyReplacementFun.aux

Yaël Dillies (May 18 2024 at 14:01):

They are all definitions, right?

Damiano Testa (May 18 2024 at 14:04):

It seems like it. I wonder whether my code is not discriminating correctly.

Damiano Testa (May 18 2024 at 14:05):

This is what I am using, with hopefully clear names:

  if ( isProp type) then
    return {s with thms := s.thms + 1}
  else
    if  forallTelescopeReducing type fun _ e => return e.isSort then
      return {s with typesPreds := s.typesPreds + 1}
    else
      return {s with other := s.other + 1}

Damiano Testa (May 18 2024 at 14:12):

I wonder whether the other field

  • should be dropped, as it is junk;
  • has interesting first difference, in that there is a lot of junk at the beginning, but (almost) everything that gets added week by week is interesting.

Eric Wieser (May 18 2024 at 14:12):

Yes, I think "other" is "data"

Eric Wieser (May 18 2024 at 14:13):

And "predicate" should probably have been "propositions"

Damiano Testa (May 18 2024 at 14:13):

So, should I simply label it by data and be happy with it?

Eric Wieser (May 18 2024 at 14:13):

Or "definitions"

Damiano Testa (May 18 2024 at 14:13):

propositions has a possibly ambiguous meaning for mathematicians, as in "lemma, theorem, proposition".

Yaël Dillies (May 18 2024 at 14:14):

Data is junk! You've heard it here first :stuck_out_tongue_closed_eyes:

Damiano Testa (May 18 2024 at 14:14):

So, it may be good for this Zulip to say proposition, but for the external tally, I would not use propositions where theorems is also used.

Eric Wieser (May 18 2024 at 14:15):

You could also split propositions (Sort 0) and types (everything else), though maybe those counts are too small

Damiano Testa (May 18 2024 at 14:16):

Ok, I'll produce the finer splitting.

Eric Wieser (May 18 2024 at 14:16):

Some possible reasons that the stats page disagrees about theorems:

  • It might be omitting internal theorems, like equation lemmas
  • It might be counting all instances as definitions

Damiano Testa (May 18 2024 at 14:17):

The commits are possibly also different.

Eric Wieser (May 18 2024 at 14:17):

Not 100k theorems different I would hope

Damiano Testa (May 18 2024 at 14:17):

I also filter out non-mathlib files and I screen for

  if c.isUnsafe || ( n.isBlackListed) then return s else

Eric Wieser (May 18 2024 at 14:18):

I think including core and std is probably interesting; most mathlib users don't care whether a lemma is in mathlib/batteries/core, just that it's available with import Mathlib

Damiano Testa (May 18 2024 at 14:19):

I think that a reasonable count should contain results that have either an actual syntax or "should" (such as to_additive generated stuff).

Eric Wieser (May 18 2024 at 14:20):

Yes, to_additive and simps lemmas are worth counting

Damiano Testa (May 18 2024 at 14:20):

Ok, should I then include all declarations, not just the ones in mathlib? (The difference will hardly notice the difference, I imagine).

Damiano Testa (May 18 2024 at 14:26):

The refined count with Mathlib only:

Theorems 157759
Types 1492
Predicates 3731
Other 37490

and with all declarations in the environment:

Theorems 167953
Types 3325
Predicates 4148
Other 60099

Damiano Testa (May 18 2024 at 14:27):

Oh, I forgot to rename Other to Data!

Damiano Testa (May 18 2024 at 14:29):

... and finally, screening out only unsafe, but keeping all blacklisted declarations, in std, core or mathlib:

Theorems 262537
Types 3431
Predicates 8053
Data 106992

Damiano Testa (May 18 2024 at 14:47):

Here is what it seems that the discussion is converging to:

Damiano Testa (May 18 2024 at 14:47):


Weekly stats (2024-05-11 2024-05-18)

Reference commits: old 311c57e763, new 53364710b4.

844 files changed, 19661 insertions(+), 11032 deletions(-), 8629 total(insertions-deletions)

Declarations:

Type New Old Change
Theorems 168051 167286 765
Types 3325 3317 8
Predicates 4150 4117 33
Data 60106 59820 286

Take also a look at the Mathlib stats page.

Damiano Testa (May 18 2024 at 14:49):

Any further comments/additions? Would having a total for each column of the table be better? Worse?

Patrick Massot (May 18 2024 at 14:49):

All those numbers are rather difficult to read because there is no separation between groups of three digits.

Damiano Testa (May 18 2024 at 14:50):

Patrick Massot said:

All those numbers are rather difficult to read because there is no separation between groups of three digits.

Ah, let me take care of that!

Yaël Dillies (May 18 2024 at 14:51):

Also I think you should emphasize that theorems are to predicates what data is to types. What about ordering the rows as "Theorems, Data, Predicates, Types"?

Yaël Dillies (May 18 2024 at 14:52):

Damiano Testa said:

Would having a total for each column of the table be better?

I think it does not really matter since the different quantities are not really commensurable

Richard Osborn (May 18 2024 at 14:57):

Perhaps it would be better to have Type | New | Change | Percentage. It's not particularly meaningful to compare a number like 168051 to 167286.

Damiano Testa (May 18 2024 at 14:58):

With thousands separator and Yaël reordered:

Damiano Testa (May 18 2024 at 14:58):


Weekly stats (2024-05-11 2024-05-18)

Reference commits: old 311c57e763, new 53364710b4.

844 files changed, 19661 insertions(+), 11032 deletions(-), 8629 total(insertions-deletions)

Declarations:

Type New Old Change
Theorems 168,051 167,286 765
Data 60,106 59,820 286
Predicates 4,150 4,117 33
Types 3,325 3,317 8

Take also a look at the Mathlib stats page.

Yaël Dillies (May 18 2024 at 14:59):

That's very pretty as the bigger numbers are now on the top

Damiano Testa (May 18 2024 at 15:00):

Ok, the latest request is to have a percentage instead of/in addition to the Change column: any strong opinions?

Eric Wieser (May 18 2024 at 15:00):

Regarding the first line, I think you should link to https://github.com/leanprover-community/mathlib4/compare/311c57e763368176a19af895c8a4a6ec1bcab3cc...53364710b45b6e223ccb0a720e59c422a05dff1a

Damiano Testa (May 18 2024 at 15:00):

(If not, I will simply add another column with the percentage.)

Yaël Dillies (May 18 2024 at 15:01):

I think Richard wanted to remove the Old column instead. Personally, I think you should just add a new column

Damiano Testa (May 18 2024 at 15:02):

Oh, I misunderstood. I'll definitely add a percentage column. As to removing the Old one... don't particularly care, but maybe would keep it and try this format for a bit?

Richard Osborn (May 18 2024 at 15:02):

I don't mind keeping the old column either.

Eric Wieser (May 18 2024 at 15:02):

It would be cool to represent the change column as +X -Y; even if the matching is just by name

Patrick Massot (May 18 2024 at 15:03):

I also think the Old column is not very useful.

Richard Osborn (May 18 2024 at 15:05):

Does this track when declarations move files? Or are those considered new declarations?

Damiano Testa (May 18 2024 at 15:05):

Eric Wieser said:

It would be cool to represent the change column as +X -Y; even if the matching is just by name

For this, I would have to restructure the code a bit, since currently I am just tallying the numbers, not recording the declaration names.

Damiano Testa (May 18 2024 at 15:07):

Richard Osborn said:

Does this track when declarations move files? Or are those considered new declarations?

No, this is just tracking numbers of declarations, it is not doing anything with file names, nor even with declaration names. I loop over the declarations, classify them and maintain a tally of the types.

Richard Osborn (May 18 2024 at 15:11):

This could still be useful to track the "largest" PRs that aren't primarily organizational in nature (which touch many lines of code, but do not change the total number of declarations).

Damiano Testa (May 18 2024 at 15:14):

There is already some automation for that with the move-decls label.

Damiano Testa (May 18 2024 at 15:15):

Once the comments settle, I can look into storing HashSets of names of declarations, so that the +X -Y column could exist.

Damiano Testa (May 18 2024 at 15:15):

In the meantime:

Damiano Testa (May 18 2024 at 15:15):


Weekly stats (2024-05-11 2024-05-18)

Reference commits: old 4dffa6dc6e, new f9e2c51760.

842 files changed, 19359 insertions(+), 11033 deletions(-), 8326 total(insertions-deletions)

Declarations:

Type New Change %
Theorems 168,051 730 0.43%
Data 60,106 283 0.47%
Predicates 4,150 32 0.77%
Types 3,325 8 0.24%

Take also a look at the Mathlib stats page.

Mario Carneiro (May 18 2024 at 16:12):

I think you could collapse the new and change columns to just "168 051 (+730)"

Ruben Van de Velde (May 18 2024 at 17:20):

And then also the % column?

Damiano Testa (May 18 2024 at 17:58):

I got this to work

Damiano Testa (May 18 2024 at 17:58):


Weekly stats (2024-05-11...2024-05-18)

Reference commits: old 0e70892cf2, new 57e4f13b3a.

882 files changed, 19159 insertions(+), 11425 deletions(-), 7734 total(insertions-deletions)

Declarations:

Type New +- Change %
Theorems 168,040 +745 -80 167,375 0.40%
Data 60,303 +307 -20 60,016 0.48%
Predicates 4,150 +31 -2 4,121 0.70%
Types 6,652 +16 -0 6,636 0.24%

Take also a look at the Mathlib stats page.

Michael Stoll (May 18 2024 at 18:04):

Change = Old?

Damiano Testa (May 18 2024 at 18:05):

Yes, sorry, I had added Change back for testing, this is closer taking into account some of the previous comments:

Damiano Testa (May 18 2024 at 18:05):


Weekly stats (2024-05-11...2024-05-18)

Reference commits: old 0e70892cf2, new 57e4f13b3a.

882 files changed, 19159 insertions(+), 11425 deletions(-), 7734 total(insertions-deletions)

Declarations:

Type New +- %
Theorems 168,040 +745 -80 0.40%
Data 60,303 +307 -20 0.48%
Predicates 4,150 +31 -2 0.70%
Types 6,652 +16 -0 0.24%

Take also a look at the Mathlib stats page.

Damiano Testa (May 18 2024 at 18:06):

As suggested by Mario and Ruben, there could just be two or three columns. I do not have a preference.

Damiano Testa (May 18 2024 at 18:07):

Btw, the cache for 0e70892cf2 is corrupted.

Patrick Massot (May 18 2024 at 23:20):

The name new without old is hard to understand

Patrick Massot (May 18 2024 at 23:21):

Maybe total is better

Eric Wieser (May 19 2024 at 01:04):

Eric Wieser said:

Regarding the first line, I think you should link to https://github.com/leanprover-community/mathlib4/compare/311c57e763368176a19af895c8a4a6ec1bcab3cc...53364710b45b6e223ccb0a720e59c422a05dff1a

Damiano, it looks like you might have missed this comment

Damiano Testa (May 19 2024 at 03:28):

Patrick: I renamed Change to Total-- I agree that it is much better!
Mario: I merged Total and +-.
Eric: I totally missed your suggestion to link to compare!

Damiano Testa (May 19 2024 at 03:28):


Weekly stats (2024-05-12...2024-05-19)

Type Total %
Theorems 168,072 (+772 -86) 0.41%
Data 60,129 (+320 -19) 0.50%
Predicates 4,150 (+31 -2) 0.70%
Types 3,325 (+8 -0) 0.24%

892 files changed, 20516 insertions(+), 12696 deletions(-), 7820 total(insertions-deletions)

Reference commits: old 11471621f4, new 3015868dac.

Take also a look at the Mathlib stats page.

Enrico Borba (May 19 2024 at 07:05):

Excuse me for messaging here but what is a "Longest pole"? I've seen it mentioned a handful of times.

Bolton Bailey (May 19 2024 at 07:22):

Enrico Borba said:

Excuse me for messaging here but what is a "Longest pole"? I've seen it mentioned a handful of times.

See this topic. The longest pole is the sequence of files, each of which imports the previous, which has the longest total compile time, and which is therefore the bottleneck for project build time in the limit as the building computer becomes more parallel.

Utensil Song (May 19 2024 at 11:28):

Maybe

- 892 files changed, 20516 insertions(+), 12696 deletions(-), 7820 total(insertions-deletions)
+ 892 files changed, 7820 lines changed (+20516 -12696)

is less verbose and still clear?

Yaël Dillies (May 19 2024 at 11:29):

+ 892 files changed (+892 -0), 7820 lines changed (+20516 -12696) even?

Damiano Testa (May 19 2024 at 13:30):

Yaël, what does your (+892 -0) represent?

Yaël Dillies (May 19 2024 at 13:45):

The number of added files and the number of removed files

Damiano Testa (May 19 2024 at 13:45):

Oh, I see, so a piece of information that was not actually present in the output that I wrote, right?

Mario Carneiro (May 19 2024 at 13:48):

I think the usual git algorithm for the +- numbers would make the + number be added+changed and - is removed+changed

Damiano Testa (May 19 2024 at 13:49):

Ok, I can work with that!

Damiano Testa (May 19 2024 at 13:49):

Although, does git diff <some annotation> give me that directly?

Damiano Testa (May 19 2024 at 13:50):

(git diff --name-status gives the information with filenames, but not the tally already)

Mario Carneiro (May 19 2024 at 13:53):

I found --compact-summary but --name-status indeed looks easier to parse

Mario Carneiro (May 19 2024 at 13:54):

renames are interesting:

R087    Mathlib/Algebra/BigOperators/List/Basic.lean    Mathlib/Algebra/BigOperators/Group/List.lean

I'm not sure what the 087 means

Damiano Testa (May 19 2024 at 13:54):

Ok, for simplicity, I am going to produce counts of A+M (added and modified) and D+M (deleted and modified) and will not worry about renames and copies.

Damiano Testa (May 19 2024 at 13:55):

So, ultimately, report A-D files changed (+(A+M) -(D+M)).

Mario Carneiro (May 19 2024 at 13:57):

maybe A+D+M files changed?

Mario Carneiro (May 19 2024 at 13:57):

it's a bit weird for that number to be negative because it doesn't sound like a net value

Damiano Testa (May 19 2024 at 13:59):

Actually, git already has its own heuristic for which numbers of files changed to give: maybe I'll just use that and report (+(A+M) -(D+M)) as +-?

Damiano Testa (May 19 2024 at 14:02):

(I have a feeling that + will be very often equal to what git considers to be the number of files changed, but - may be somewhat different.)

Damiano Testa (May 19 2024 at 14:03):

E.g. for the two commits that I was using as a test, I get these numbers:

92 files changed, 1711 insertions(+), 544 deletions(-)
(+92 -85)

I suspect that the coincidence of 92 is very stable, but we'll see!

Mario Carneiro (May 19 2024 at 14:06):

hm, it seems hard to see when a deletion happens though, it would be nice to see that... sorry for waffling but maybe (+A -D ~M) would be better

Mario Carneiro (May 19 2024 at 14:06):

where a rename counts as A+D

Damiano Testa (May 19 2024 at 14:07):

Sure, this is very easy to do.

Damiano Testa (May 19 2024 at 14:09):

In the previous example, I get these numbers:

(+7 -0 ~85)

which indeed is probably a more faithful description of what happened.

Mario Carneiro (May 19 2024 at 14:13):

Is it possible to collect up this information to get a graph over time, or is the mathlib stats page already doing this?

Damiano Testa (May 19 2024 at 14:13):

The mathlib stats page has graphs plotted against time, but the information that I am extracting now is not in that page.

Mario Carneiro (May 19 2024 at 14:14):

I guess one cheap thing you could do is make sure that the weekly reports are distinctive enough that they can be harvested later via some zulip API

Damiano Testa (May 19 2024 at 14:14):

How does this look like?

Damiano Testa (May 19 2024 at 14:14):


Weekly stats (2024-05-12...2024-05-19)

Type Total %
Theorems 168,073 (+772 -90) 0.41%
Data 60,131 (+320 -19) 0.50%
Predicates 4,150 (+31 -2) 0.70%
Types 3,325 (+8 -0) 0.24%

895 files changed (+42 -11 ~839), 7725 lines changed (+20466 -12741)

Reference commits: old 3f8be61e1c, new d874bdffd1.

Take also a look at the Mathlib stats page.

Damiano Testa (May 19 2024 at 14:15):

Now that we have the + - ~, I wish git had something similar for the lines as well...

Mario Carneiro (May 19 2024 at 14:16):

I suppose you could just interpret the 7725 as ~

Mario Carneiro (May 19 2024 at 14:17):

although it seems to be just 20466-12741

Kevin Buzzard (May 19 2024 at 14:17):

I think "Data" is confusing? Well, it's confusing me right now. What's the difference between "Data" and "Types"? Is Data = terms?

Damiano Testa (May 19 2024 at 14:18):

The 7725 was computed by me and it is indeed just the difference of the (absolute values of the) + and -.

Mario Carneiro (May 19 2024 at 14:18):

Maybe they should be categorized by definition kind: theorem, def, opaque, axiom

Mario Carneiro (May 19 2024 at 14:18):

and then separate between data, predicates, types for the relevant definition kind

Damiano Testa (May 19 2024 at 14:19):

I had the names of the constructors initially, but it was suggested that this other splitting was better... :smile:

Kevin Buzzard (May 19 2024 at 14:19):

I just don't know what people are going to think about the statement "mathlib has 168073 theorems and 60131 datas"

Mario Carneiro (May 19 2024 at 14:19):

I think someone said above this is not for external consumption

Damiano Testa (May 19 2024 at 14:20):

For the moment, I was thinking that this report would be just in Zulip.

Mario Carneiro (May 19 2024 at 14:20):

Other stats which would be useful:

  • number of instances
  • number of simp lemmas
  • number of classes

Damiano Testa (May 19 2024 at 14:20):

It is certainly good to think about what to make public, but I was hoping to get a prototype up and running.

Damiano Testa (May 19 2024 at 14:21):

Yes, these are good suggestions. I focused on producing 1 git-based report and 1 lean-based report.

Mario Carneiro (May 19 2024 at 14:21):

Are there any other metrics people can think of which correlate with overall performance?

Damiano Testa (May 19 2024 at 14:21):

Once this works, I can definitely add more statistics.

Mario Carneiro (May 19 2024 at 14:23):

  • number of linters
  • number of environment extensions

Damiano Testa (May 19 2024 at 14:24):

Ok, I'll look into how to make a bot post on Zulip the statistics above once per week and I'll make a PR.

Damiano Testa (May 19 2024 at 14:24):

After that, we can see what else to add and what would be good "external" statistics.

Damiano Testa (May 19 2024 at 14:28):

Should the bot post on this thread/topic?

Damiano Testa (May 19 2024 at 14:29):

For testing, I may make it post a few times, just to get a feeling for how these things work.

Mario Carneiro (May 19 2024 at 14:30):

It should be in #mathlib4 stream, with a title like "Mathlib weekly change report"

Yaël Dillies (May 19 2024 at 14:41):

Kevin Buzzard said:

I think "Data" is confusing? Well, it's confusing me right now. What's the difference between "Data" and "Types"? Is Data = terms?

"Definitions", maybe?

Eric Wieser (May 19 2024 at 15:16):

"types" is things like Nat, "data" is things like Nat.factorial.

Mario Carneiro (May 19 2024 at 15:19):

another possible subdivision of the "data def" category is noncomputable/computable

Mario Carneiro (May 19 2024 at 15:20):

lemma/theorem would also be a great split, although I think we don't have the metadata needed to differentiate

Mario Carneiro (May 19 2024 at 15:21):

maybe documented/undocumented theorem

Mario Carneiro (May 19 2024 at 15:21):

actually % documented would be useful on every category

Damiano Testa (May 19 2024 at 15:22):

Mario Carneiro said:

lemma/theorem would also be a great split, although I think we don't have the metadata needed to differentiate

I was planning on doing that later. My plan was to run lake build on a prepped version of mathlib that would log things like this.

Damiano Testa (May 19 2024 at 15:22):

Mario Carneiro said:

actually % documented would be useful on every category

This was one of the suggestions above, I think.

Eric Wieser (May 19 2024 at 15:22):

The reason I suggested we avoid counting theorem is that the choice of what's a theorem and what's part of the definition (ie a projection) is arbitrary, and not interesting mathematically

Mario Carneiro (May 19 2024 at 15:22):

I don't think we should just restrict to mathematically interesting distinctions

Damiano Testa (May 19 2024 at 15:22):

(Note that in the doc-strings for the report, I am also putting in all the suggestions that came up.)

Mario Carneiro (May 19 2024 at 15:24):

I think the main things we want to get out of this relate to proof engineering/maintenance aspects, which is why my suggestions relate to either performance or other code quality indicators

Damiano Testa (May 19 2024 at 15:30):

#13038

Damiano Testa (May 19 2024 at 15:32):

Mario Carneiro said:

I think the main things we want to get out of this relate to proof engineering/maintenance aspects, which is why my suggestions relate to either performance or other code quality indicators

I really like your suggestions!

I am hoping that this current version will help setting up some infrastructure with some easy statistics, before we expand and actually use it for structural reports.

Utensil Song (May 20 2024 at 04:05):

Maybe "Data" is better renamed to "Definition"? It's easier to comprehend, and would it be inaccurate?

Damiano Testa (May 20 2024 at 04:40):

I wonder whether calling it Definition would suggest that it is a count of defs, rather than what it is actually counting.

Mario Carneiro (May 20 2024 at 04:52):

I think it should be a count of defs, subdivided accordingly

Mario Carneiro (May 20 2024 at 04:52):

theorems can't be Data anyway

Damiano Testa (May 20 2024 at 05:11):

Mario Carneiro said:

I think it should be a count of defs, subdivided accordingly

"subdivided accordingly" means that if the ConstantInfo is a defnInfo, then you want so split further into Def Type, Def Predicate and Def Data? What should be the labelling for defs that are in Prop? With the current scheme, everything in Prop is a Theorem.

Damiano Testa (May 20 2024 at 05:13):

(The initial split was precisely by classifying declarations based on their ConstantInfo constructor, so it seems that we are going back to that.)

Damiano Testa (May 20 2024 at 05:37):

What about

  • thmInfo counts as Theorem
  • defnInfo whose type is Prop counts as Definition (Prop)
  • defnInfo whose type-after-binders is Type counts as Definition (Type)
  • defnInfo whose type-after-binders is Sort counts as Definition (Predicate)
  • defnInfo whose type-after-binders is something else counts as Definition (Data)

I suspect that this is close to what Mario was suggesting. This would ignore other types of ConstantInfo, but maybe this is ok?

Yaël Dillies (May 20 2024 at 06:27):

Damiano Testa said:

With the current scheme, everything in Prop is a Theorem.

That's wrong, right? Everything whose type has type Prop is a Theorem.

Damiano Testa (May 20 2024 at 06:29):

Yaël, yes, this is the usual isProp vs is*A*Prop slip! :man_facepalming:

Damiano Testa (May 20 2024 at 06:29):

(However, I think that the code is doing it right, since I access the type of a declaration, not the value.)

Damiano Testa (May 20 2024 at 10:26):

I am testing the split suggested above: there may be a post in the other stream soon with the new format... hopefully!

Mario Carneiro (May 20 2024 at 18:23):

I think you can throw all other constantinfo into "other"

Mario Carneiro (May 20 2024 at 18:24):

well, inductives should be a category

Eric Wieser (May 20 2024 at 21:01):

Mario, what's your argument for why we care about the type of Constant info? My argument for why we don't is that those are counting implementation details, and what we actually want to measure is API surface. The API surface of a one-field structure is the same as a type synonym, but the constantinfos are very different.

Eric Wieser (May 20 2024 at 21:03):

Mario Carneiro said:

theorems can't be Data anyway

Regarding the converse: are Prop instances still defs, or is that fixed in Lean 4.8.0?

Mario Carneiro (May 20 2024 at 21:11):

The API surface of a one-field structure is the same as a type synonym

I disagree about this, a one-field structure adds to/from functions and several generated declarations

Mario Carneiro (May 20 2024 at 21:12):

but also, in line with my point about gathering better data up-front, I think it's better to have the information than not, if we can easily collect it, and if the more coarse metrics are also there then you can ignore the fine-grained metrics

Mario Carneiro (May 20 2024 at 21:13):

It's difficult to guess in advance which metrics are going to be important until we start collecting the data

Mario Carneiro (May 20 2024 at 21:14):

Eric Wieser said:

Mario Carneiro said:

theorems can't be Data anyway

Regarding the converse: are Prop instances still defs, or is that fixed in Lean 4.8.0?

No, proofs can be def but data can't be theorem


Last updated: May 02 2025 at 03:31 UTC