Zulip Chat Archive

Stream: condensed mathematics

Topic: how much mathlib in LTE?

Kevin Buzzard (Sep 09 2022 at 23:14):

What percentage of mathlib ends up imported in a proof of the main theorem of LTE? Is there an easy way to answer this? One measure of "how big is LTE" is "about 90K lines of code" and another measure is "about 90K lines of code and a one million line maths library on top". Knowing how much mathlib is imported will give a third answer which in some sense is more representative.

Adam Topaz (Sep 10 2022 at 00:41):

That's a good question. I think with the Radon stuff in the examples file we end up importing a big chunk of mathlib's functional analysis,, although it doesn't get imported for the actual challenge

Adam Topaz (Sep 10 2022 at 00:43):

My guess would be 75% of mathlib

Riccardo Brasca (Sep 10 2022 at 14:47):

I don't know in terms of lines, but using leancrawler we can count the number of theorems/definitions in mathlib

Yaël Dillies (Sep 10 2022 at 15:11):

I assume we could proxy the number of lines by the number of files

Riccardo Brasca (Sep 10 2022 at 15:45):

I don't have time during the weekend, but I can easily enough provide a list of things used in the LTE

Johan Commelin (Sep 12 2022 at 09:25):

@Riccardo Brasca If you want to create some precise stats, that would be nice.

Riccardo Brasca (Sep 12 2022 at 11:33):

I am running leancrawler, then gephi can probably produce all the stats we want

Riccardo Brasca (Sep 12 2022 at 13:28):

The full graph leading to liquid_tensor_experiment has 20172 nodes and 466568 edges. There are 15201 declarations from mathlib and 1010 from core, giving a total of 16211 declarations already in Lean, so 3961 are only in LTE. Here is a csv file with the full list, including other data.

Riccardo Brasca (Sep 12 2022 at 13:32):

Unsurprisingly, the most important node is docs#eq

Riccardo Brasca (Sep 12 2022 at 15:29):

Here is something that really surprises me, and it makes me thinking I've done something wrong.

Doing the same as above without selecting only the connected component of liquid_tensor_experiment gives 99729 nodes and 23991227 edges. 4366 are from core and 86996 are from mathlib, so 8367 are only in the LTE (note that I did leancrawler challenge, so files that are completely useless to that should be ignored). This means that we wrote 4406 declarations that are never used to prove liquid_tensor_experiment?! :dizzy:

Yaël Dillies (Sep 12 2022 at 15:32):

If you counted autogenerated declarations, that wouldn't surprise me. Many of them are never used, LTE or not.

Riccardo Brasca (Sep 12 2022 at 15:32):

data.yaml and LTE_data if someone wants to play with them (the first one is the output of lean --run crawl.lean after leancrawler challenge the second one is by lib.dump).

Riccardo Brasca (Sep 12 2022 at 15:34):

Ah, autogenerated declarations, of course, this makes sense.

Riccardo Brasca (Sep 12 2022 at 15:39):

I don't see any way of filtering out these declarations unfortunately :(

Johan Commelin (Sep 12 2022 at 15:40):

Right, I guess @[simps] might generate for example 4 lemmas of which only 2 are used.

Riccardo Brasca (Sep 12 2022 at 15:46):

I can produce the list of the useless declarations. Then a stupid workaround to find those we really wrote would be to grep the names in the LTE folder. Do you think it is an interesting information to have?

Patrick Massot (Sep 12 2022 at 15:54):

Personally I would be interested to know if there is a huge bug in leancrawler that explains those numbers.

Riccardo Brasca (Sep 12 2022 at 15:55):

It's very possible I did something wrong, this is why I posted the data. In any case I am trying to find an unused lemma and check if I can really remove it.

Johan Commelin (Sep 12 2022 at 16:43):

I'm quite certain you can find ~ 30 lemmas with a -- move me comment that haven't been used.

Johan Commelin (Sep 12 2022 at 16:43):

I kept those, because they are generally good mathlib lemmas. Even if they weren't used in LTE after a refactor.

Riccardo Brasca (Sep 12 2022 at 16:50):

Here is a sample of apparently useless declarations

Ab.is_limit_ext
Ab.limit_cone'_cone
Ab.limit_cone'_is_limit
Ab.ulift_map_apply
Ab.ulift_obj
AddCommGroup.AB5
AddCommGroup.Ab.category_theory.limits.has_colimits_of_size
AddCommGroup.Ab.category_theory.limits.has_limits_of_size
AddCommGroup.add_equiv_of_iso
AddCommGroup.add_subgroup.equiv_top
AddCommGroup.add_subgroup.equiv_top_apply_coe
AddCommGroup.add_subgroup.equiv_top_symm_apply
AddCommGroup.cocone
AddCommGroup.coe_of'
AddCommGroup.colimit_comparison
AddCommGroup.comp_helper
AddCommGroup.coproduct_cocone_of_basis
AddCommGroup.coproduct_cocone_of_basis_is_colimit
AddCommGroup.diagram

Here the full list.

Adam Topaz (Sep 12 2022 at 16:52):

Oh, some of those were things that made it to mathlib but never deleted from LTE.

Adam Topaz (Sep 12 2022 at 16:52):

E.g. Ab.is_limit_ext is a special case of docs#category_theory.limits.concrete.is_limit_ext

Adam Topaz (Sep 12 2022 at 16:56):

It's not quite correct to say that AddCommGroup.diagram isn't used, because I'm quite sure AddCommGorup.is_colimit_cocone is used, and this is a statement about a cocone over AddcommGroup.diagram,.

Riccardo Brasca (Sep 12 2022 at 16:59):

let me see if it is possible to compile the project without AddCommGroup.diagram

Johan Commelin (Sep 12 2022 at 16:59):

But it claims that AddCommGroup.is_colimit_cocone is also not used.

Adam Topaz (Sep 12 2022 at 17:00):

I think they're used in one of the proofs in the same file

Johan Commelin (Sep 12 2022 at 17:01):

It is only used in AddCommGroup.tensor_obj_map_preserves_mono, which is also in the "unused" list.

Riccardo Brasca (Sep 12 2022 at 17:01):

I am not claiming these are never used.

Adam Topaz (Sep 12 2022 at 17:01):

https://github.com/leanprover-community/lean-liquid/blob/17f13b74d4bc045e82465f86f5705a3f6e1994e3/src/for_mathlib/AddCommGroup.lean#L329

Riccardo Brasca (Sep 12 2022 at 17:01):

Exactly, I am saying these are not in the connected component of liquid_tensor_experiment

Adam Topaz (Sep 12 2022 at 17:02):

But that lemma is definitely used in the proof of the challenge

Adam Topaz (Sep 12 2022 at 17:02):

I'm confused.

Riccardo Brasca (Sep 12 2022 at 17:03):

It's maybe my fault, so you claim that tensor_obj_map_preserves_mono is used to prove liquid_tensor_experiment?

Johan Commelin (Sep 12 2022 at 17:03):

Yes. At least the math version of that statement.

Johan Commelin (Sep 12 2022 at 17:03):

And I don't know of other lean-similar statements that might have been used instead.

Riccardo Brasca (Sep 12 2022 at 17:03):

OK, let's see.

Adam Topaz (Sep 12 2022 at 17:04):

I'm claiming that is_iso_of_preserves_of_is_tensor_unit is used in liquid_tensor_experiment, which uses is_iso_of_preserves which uses AddCommGroup.is_colimit_cocone which needs AddCommGroup.diagram, etc.

Johan Commelin (Sep 12 2022 at 17:10):

Something is off. There are a lot of names on the list that I'm pretty sure are used.

Johan Commelin (Sep 12 2022 at 17:11):

A whole bunch of thm95 stuff, starting at: https://gist.github.com/riccardobrasca/df3a90f2326b2020633ad0fd6c1c7ae8#file-useless-txt-L4165

Johan Commelin (Sep 12 2022 at 17:11):

Certainly thm95.col_exact is used.

Adam Topaz (Sep 12 2022 at 17:14):

Does this count check whether a declaration is used in the proof of a prop?

Riccardo Brasca (Sep 12 2022 at 17:15):

I guess this is the whole point of leancrawler

Riccardo Brasca (Sep 12 2022 at 17:15):

But thm95.col_exact is very close to the end, let me experiment with that

Andrew Yang (Sep 12 2022 at 17:15):

Can it see instance arguments?

Johan Commelin (Sep 12 2022 at 17:16):

Very close to the end of the proof of thm95. But the whole 2nd part of the project is still built on top of that.

Johan Commelin (Sep 12 2022 at 17:16):

I guess it is still "relatively" close to the end though.

Johan Commelin (Sep 12 2022 at 17:16):

But I would expect that you still need to recompile 10 files or so.

Andrew Yang (Sep 12 2022 at 17:19):

Note that homology_bd_eval is definitely used, and not on the list, but its construction is (as_iso (tensor_to_homology BD M i)).symm, and the instance is_iso (tensor_to_homology BD M i) is on the list.

Riccardo Brasca (Sep 12 2022 at 17:55):

@Patrick Massot is it possible that leancrawler makes a mistake if a lemma is used in a definition? In particular, I see that NSC appears in the connected component leading to liquid_tensor_experiment, but thm95.col_exact, that it is used to give the definition, does not.

Riccardo Brasca (Sep 12 2022 at 18:51):

Indeed, if I do lib['NSC'], I see col_exact := _, instead of

col_exact :=
  begin
 ...
    refine thm95.col_exact BD.data κ r r' V Λ M N j (lem98.d Λ N) (k₁_sqrt κ' m) m _ _
      (k₁ κ' m) (K₁ r r' BD κ' m) (le_of_eq _) _ _ (c₀ r r' BD κ κ' m Λ) ⟨le_rfl⟩ infer_instance ⟨le_rfl⟩,
...
  end

so the need for thm95.col_exact seems to be forgotten.

Johan Commelin (Sep 13 2022 at 09:57):

Could it have to do something with this flag: https://github.com/leanprover-community/leancrawler/blob/master/leancrawler/crawler.py#L275 ?

Riccardo Brasca (Sep 13 2022 at 10:04):

I am having lunch, then I will try to produce a MWE

Riccardo Brasca (Sep 13 2022 at 10:46):

I've tried with something as simple as

import init.data.nat.lemmas

namespace crawler

structure add (n m : ℕ) := ( add : n + m = m + n)

def test (n m : ℕ) : crawler.add n m :=
{ add :=
  begin
    exact nat.add_comm n m
  end }

end crawler

but then lib['crawler.test'] shows

LeanDecl(name='crawler.test', filename='/home/brasca/lean/test_crawl/src/test_crawl.lean',
line_nb=7, kind='definition', is_inductive=False, is_structure=False, is_structure_field=False,
is_class=False, is_instance=False, is_recursor=False, is_constructor=False, Type='Π (n m : ℕ),
crawler.add n m', type_uses_proofs=set(), type_uses_others={'crawler.add', 'nat'}, type_size=0,
type_dedup_size=0, type_pp_size=0, Value='λ (n m : ℕ), {add := _}', value_uses_proofs={'nat.add_comm'},
value_uses_others={'crawler.add.mk', 'nat'}, value_size=0, value_dedup_size=0, value_pp_size=0,
target_class=None, parent=None, fields=[])

and nat.add appears as expected in the field value_uses_proofs. On the other way, the full output of lib['NSC'] is

...
value_uses_proofs={'NSC._proof_15', 'NSC._proof_18', 'NSC._proof_10', 'NSC._proof_16', 'NSC._proof_12',
'NSC._proof_11', 'NSC._proof_13', 'NSC._proof_14', 'NSC._proof_19', 'NSC._proof_20',
"thm95.universal_constants.one_le_k'", 'NSC_row_exact', 'NSC._proof_17'}
...

Riccardo Brasca (Sep 13 2022 at 10:46):

(deleted)

Riccardo Brasca (Sep 13 2022 at 10:56):

I think col_exact is hidden in one of these NSC._proof_

Riccardo Brasca (Sep 13 2022 at 11:11):

OK, I've a simple example.

import init.data.nat.lemmas

namespace crawler

structure comm :=
( x : ℕ )
( y : ℕ )
( comm : x + y = y + x)

def test : crawler.comm :=
{ x := 2,
  y := 5,
  comm := nat.add_comm _ _ }

lemma stupid : test.1 + test.2 = test.2 + test.1 := test.comm

end crawler

Riccardo Brasca (Sep 13 2022 at 11:12):

lib['crawler.stupid'] gives

LeanDecl(name='crawler.stupid', filename='/home/brasca/lean/test_crawl/src/test_crawl.lean', line_nb=15, kind='lemma', is_inductive=False,
is_structure=False, is_structure_field=False, is_class=False, is_instance=False, is_recursor=False, is_constructor=False,
Type='crawler.test.x + crawler.test.y = crawler.test.y + crawler.test.x', type_uses_proofs=set(),
type_uses_others={'crawler.comm.x', 'eq', 'crawler.comm.y', 'has_add.add', 'crawler.test', 'nat.has_add', 'nat'}, type_size=0,
type_dedup_size=0, type_pp_size=0, Value='crawler.test.comm', value_uses_proofs={'crawler.comm.comm'},
value_uses_others={'crawler.test'}, value_size=0, value_dedup_size=0, value_pp_size=0,
target_class=None, parent=None, fields=[])

Riccardo Brasca (Sep 13 2022 at 11:12):

nat.add_comm is never mentioned.

Riccardo Brasca (Sep 14 2022 at 09:36):

I've the result from the new version of leancrawler. As expected, the full graph has the same number of nodes (well, four more, but I added the statement of the challenge, so maybe it is because of this), 99733.

On the other hand, the connected component of liquid_tensor_experiment has now 24780 nodes instead of 20172.

Riccardo Brasca (Sep 14 2022 at 09:38):

The statement has 13873 nodes.

Johan Commelin (Sep 14 2022 at 09:40):

Wow, that's still a lot!

Johan Commelin (Sep 14 2022 at 09:40):

Can you split that into defs vs lemmas

Riccardo Brasca (Sep 14 2022 at 09:40):

Yes, I have all of that. Give me a moment to list all the stats I have.

Riccardo Brasca (Sep 14 2022 at 09:51):

Here they are:

full graph (everything imported by challenge.lean): 99733 nodes. 16010 definitions, 71286 lemmas and 10454 instances. 86996 in mathlib and 4366 in core.
challenge: 24780 nodes. 5012 definitions, 15390 lemmas and 3571 instances. 18173 in mathlib and 1082 in core.
statement: 13873 nodes. 2512 definitions, 8502 lemmas and 2180 instances. 11765 in mathlib and 988 in core.

Riccardo Brasca (Sep 14 2022 at 10:06):

first_target : 15223 nodes. 2274 definitions, 10125 lemmas and 2217 instances. 12693 in mathlib and 10128 in core.

Johan Commelin (Sep 14 2022 at 10:30):

Thanks! 2512 definitions going into the statement!! That's a lot.

I'm still confused about the 99733 vs 24780. That delta seems very large to me.

Riccardo Brasca (Sep 14 2022 at 10:41):

I am going to post the list of unused stuff

Riccardo Brasca (Sep 14 2022 at 10:57):

There are 2846 declarations in the full graph that are not in mathlib or core, and are not used in the challenge graph (before we had more than 4000 of them). Here is the list.

Riccardo Brasca (Sep 14 2022 at 11:02):

All the examples above are now used, as expected. Note that first_target is in the useless list, but if we want to also have this literal statement it is enough to include

first_target
thm95'
thm95''

so the difference is negligible.

Johan Commelin (Sep 14 2022 at 11:03):

Right, it's actually expected that first_target is not a dependency of liquid_tensor_experiment. I ended up using a "homotopic" variant called thm94_explicit or something like that.

Johan Commelin (Sep 14 2022 at 11:05):

Ok, I see a lot of bounded_derived_category on that list. Makes sense, because we ended up only working with the homotopy category.

Riccardo Brasca (Sep 14 2022 at 11:23):

I've removed here all the namespaces. It would be nice to remove the lines that never appear in src/, because these declarations are autogenerated. I am quite bad in writing scripts, does someone want to do it?

Riccardo Brasca (Sep 14 2022 at 11:44):

I've done it. The list is now down to 1722. Of those 490 appear exactly once. Unless these are instances, I think it can be removed and the project should compile without problems.

Johan Commelin (Sep 14 2022 at 11:58):

Ok, 1722 sounds like a more reasonable number to me.

Riccardo Brasca (Sep 14 2022 at 12:03):

Grepping the whole word it's down to 1666

Riccardo Brasca (Sep 14 2022 at 12:08):

Here is the list. It is the list of declarations that appear in the LTE but are not needed to get to liquid_tensor_experiment, with the number of times the name appears in src/*. Note that I searched for the name, so for example cochain_complex.hom.cons.f appears more than 10000 times because f does so, in particular it is possible that some of these declarations are still autogenerated. Also it's possible that some useless instance can be missing, since the autogenerated name never appears, not even once.

Riccardo Brasca (Sep 14 2022 at 12:11):

If I am not wrong all these can be removed safely, meaning that the project should compile, without requiring any further modification.

Johan Commelin (Sep 14 2022 at 12:20):

We could try it out on a branch.

Johan Commelin (Sep 14 2022 at 12:20):

A cheaper optimization is checking which files are not in the import graph closure of challenge.lean and examples/*.lean.

Johan Commelin (Sep 14 2022 at 12:20):

There might be several files that can be removed entirely.

Riccardo Brasca (Sep 14 2022 at 12:24):

Let me try removing some of these declarations.

Riccardo Brasca (Sep 14 2022 at 12:26):

ouf, it considered (add_subgroup.equiv_top _) not the same as add_subgroup.equiv_top because of the (, so it is already wrong.

Riccardo Brasca (Sep 14 2022 at 12:27):

Let's ignore the whole word, there will be just a bunch of false positive.

Riccardo Brasca (Sep 14 2022 at 12:33):

In any case I am considering all the examples useless :(

Riccardo Brasca (Sep 14 2022 at 12:47):

I just realized that what I am doing is stupid, all the information we want are in the graph, it doesn't make sense to use grep or whatever. A never used lemma is just a node with out-degree (maybe in-degree) equal to 0.

Riccardo Brasca (Sep 16 2022 at 15:37):

Johan Commelin said:

I'm still confused about the 99733 vs 24780. That delta seems very large to me.

This is something I don't understand. The number of useless declaration in the project seems reasonable, but I doubt very much they need more than 70000 mathlib declarations. Here is the list.

Andrew Yang (Sep 16 2022 at 15:39):

Does the 99733 count the definitions declared in some file included by challenge.lean but never used in any (unused or not) definition in LTE?

Riccardo Brasca (Sep 16 2022 at 15:41):

Ah this can be the explanation! I think leancrawler list all declarations in all files (recursively) imported. So if we only use one declaration from a large file we get all of them.

Riccardo Brasca (Sep 16 2022 at 15:43):

So 99733 is a totally meaningless number, it depends on the files structure of mathlib.

Riccardo Brasca (Sep 19 2022 at 13:05):

final.png I've redone the graph using the fixed leancrawler. It's more dense, and liquid_tensor_experiment is now a theorem in analysis, even if it's more in the pseudonormed group region.

Adam Topaz (Sep 19 2022 at 16:54):

The discrete stuff makes this look like a :unicorn:

Adam Topaz (Sep 19 2022 at 16:54):

Sounds about right!

Adam Topaz (Sep 19 2022 at 16:55):

Especially since category theory and homological algebra is the mouth which will eat all of mathematics

Kevin Buzzard (Sep 19 2022 at 17:00):

One thing which really excited me about this graph is that it is finally something we can show @Mario Carneiro and say "there, it's taken years but you can finally see category theory interacting with other mathematics". For many years category theory was essentially completely isolated from the rest of mathlib, and Mario raised the issue about whether it was actually ever useful; I assured him that it was just a matter of time. I don't think there was any category theory at all in the perfectoid project and I strongly doubt it's used in sphere eversion either, but sometimes it plays a crucial role.

Johan Commelin (Sep 19 2022 at 17:02):

If you squint, it looks a bit like https://en.wikipedia.org/wiki/Antarctica#/media/File:Antarctica_6400px_from_Blue_Marble.jpg

Adam Topaz (Sep 19 2022 at 17:10):

Should we make a page similar to https://leanprover-community.github.io/lean-perfectoid-spaces/ and include this graph?

Adam Topaz (Sep 19 2022 at 17:11):

We could of course add it to README.md as well

Riccardo Brasca (Sep 19 2022 at 17:17):

Let me post a version with the y of category theory in the image. I am with the phone, I will do it later

Adam Topaz (Sep 19 2022 at 17:19):

Maybe I should mention that the perfectoid page linked above was one of the first things that got me personally interested in lean. It's a bit friendlier than a github readme to an outsider.

Johan Commelin (Sep 19 2022 at 17:24):

I'm completely fine with having a dedicated landing page that links to blueprint/github/press (and papers, once we have those). Sounds like a good idea to me.

Kevin Buzzard (Sep 19 2022 at 17:25):

It says something about a person if a picture of a graph with tens of thousands of nodes/edges is "a bit friendler" than a text file ;-)

Kevin Buzzard (Sep 19 2022 at 17:25):

Oh sorry I misunderstood! You mean the page in general -- yes I agree!

Kevin Buzzard (Sep 19 2022 at 17:27):

@Riccardo Brasca if you want to tweet the image (once you're happy with it) I'll retweet.

Adam Topaz (Sep 19 2022 at 18:04):

Yeah I was talking about the webpage vs. the readme file

Riccardo Brasca (Sep 20 2022 at 07:37):

final.png Here is the fixed image. @Kevin Buzzard feel free to tweet it!

Kevin Buzzard (Sep 20 2022 at 07:42):

Can you make a bigger version?

Riccardo Brasca (Sep 20 2022 at 07:51):

I am not sure how to do it, but if someone knows how to tell gephi to produce an higher quality image here is the file

Last updated: May 02 2025 at 03:31 UTC