Zulip Chat Archive

Stream: mathlib4

Topic: Metamathematics: Theorems by Domain Areas


Jack Heseltine (Jul 10 2023 at 02:12):

Hi, I am relatively new to Mathlib and trying to get a sense of all the theorems in the library, by domain. I know it was done in 2020 as part of
https://writings.stephenwolfram.com/2018/03/roaring-into-2018-with-another-big-release-launching-version-11-3-of-the-wolfram-language-mathematica/#proofs
where these cloud links give the exact data
https://www.wolframcloud.com/obj/sw-blog/EuclidBlog/Programs/Programs-04.wl
Also has a wonderful dependency graph of theorems!
https://www.wolframcloud.com/obj/sw-blog/EuclidBlog/Programs/Programs-05.wl
Would anyone know how to go about this? I would like to see how Mathlib has developed, i.e. what theorems by category are there (and ideally, how do the interrelate). Would web/directory scraping be the way to do it?

Scott Morrison (Jul 10 2023 at 02:15):

I doubt looking at those mathematica notebooks is particularly helpful. There are much better metaprogramming tools available in Lean 4. Perhaps you could explain in more detail what you are looking to see?

Scott Morrison (Jul 10 2023 at 02:16):

One useful tool (still only in a PR, so you'd need to check out a copy of #5513) is lake exe graph, which draws visualisation of the import graph.

Scott Morrison (Jul 10 2023 at 02:17):

For the whole library there is just too much information there (let alone any visualisation at theorem level of granularity), but the --to and --from flags let you zoom in to parts of the import hierarchy.

Scott Morrison (Jul 10 2023 at 02:18):

Also, remember that that the directory structure in Mathlib is fairly ad-hoc and to a large extent based on historical accident, so be very skeptical of any "overview" based on analyzing file names or namespaces!

Scott Morrison (Jul 10 2023 at 02:21):

Overall, I would say Mathlib is already too big to "get a sense of all the theorems" in any meaningful! I'm sure every expert here is regularly surprised to find that something is, or is not, in the library. (Within specific areas this is not true.)

Jack Heseltine (Jul 10 2023 at 03:27):

Hi, looking into the tool! Thanks. The context is, I would for example like to situate the theorem "nat.prime.dvd_of_dvd_pow" (how do I find the name of this? I posted in the new members forum as well) in the way it is done in that kind of import graph: Alternatively I would like to look at that area of Mathlib (nat.prime), which would also be helpful.

Jack Heseltine (Jul 10 2023 at 03:29):

I am comparing to a similar visualization done using this tool: https://github.com/minchaowu/mm-lean So this has a bit of a Mathematica slant, what I am trying to do, that is why I reference the notebooks.

Yury G. Kudryashov (Jul 10 2023 at 03:34):

how do I find the name of this?

We try to give lemmas very predictable names based on their types. It would be nice to have something like https://hoogle.haskell.org/ in the future though.

Yury G. Kudryashov (Jul 10 2023 at 03:36):

With ~100K theorems (slightly less if you grep for ^theorem, more if you count autogenerated theorems; we have 130K #aligns) it's hard to find something using graph visualization.

Jack Heseltine (Jul 10 2023 at 20:23):

Yury G. Kudryashov said:

With ~100K theorems (slightly less if you grep for ^theorem, more if you count autogenerated theorems; we have 130K #aligns) it's hard to find something using graph visualization.

Can I follow up on this? The 100K, how might I get a number like that by grepping or similar? I am running a regex python script on my download of https://github.com/leanprover-community/mathlib and particularly the src folder but getting a much lower number, 2-3K for "lemma", 1-2K for "theorem".

Jack Heseltine (Jul 10 2023 at 20:24):

My script is:

def count_files(path, regex):
    regObj = re.compile(regex)
    ct = 0
    for root, dirs, fnames in os.walk(path):
        for fname in fnames:
            file_path = os.path.join(root, fname)
            with open(file_path, 'r', errors='ignore') as f:
                content = f.read()
                if regObj.search(content):
                    ct += 1
    return ct

print(count_files('...\\mathlib\\src\\.', r'lemma'))

Patrick Massot (Jul 10 2023 at 20:24):

Lean has very strong introspection capabilities that allow a much more precise count.

Jack Heseltine (Jul 10 2023 at 20:24):

Patrick Massot said:

Lean has very strong introspection capabilities that allow a much more precise count.

Oh, could you tell me more/refer me to the docs? Sorry, still learning.

Reid Barton (Jul 10 2023 at 20:25):

You seem to be counting files that contain lemmas, not lemmas themselves.

Reid Barton (Jul 10 2023 at 20:26):

A little hard to tell: see #backticks for formatting tips

Jack Heseltine (Jul 10 2023 at 20:28):

Reid Barton said:

A little hard to tell: see #backticks for formatting tips

fixed

Damiano Testa (Jul 10 2023 at 20:29):

For a very naïve count

git ls-files "*.lean" | xargs grep "^theorem" | wc -l

yields 91728 hits.

Yury G. Kudryashov (Jul 10 2023 at 20:35):

Or

$ git grep '^\(@\[[^]]*\]\)\? *\(theorem\|lemma\)' | wc -l
93731
$ git grep '^#align' '*.lean' | wc -l
130828

Patrick Massot (Jul 10 2023 at 20:38):

('...\\mathlib\\src\\.', r'lemma')

I suspect you don't fully understand what r'' means in python.

Patrick Massot (Jul 10 2023 at 20:38):

You're using it where it does nothing and don't use it where it would be useful.

James Gallicchio (Jul 10 2023 at 20:39):

Jack Heseltine said:

Oh, could you tell me more/refer me to the docs? Sorry, still learning.

A good place to start is the Lean4 metaprogramming book: https://github.com/arthurpaulino/lean4-metaprogramming-book

Patrick Massot (Jul 10 2023 at 20:39):

You may want to google python raw string

Jack Heseltine (Jul 10 2023 at 21:10):

Yury G. Kudryashov said:

Or

$ git grep '^\(@\[[^]]*\]\)\? *\(theorem\|lemma\)' | wc -l
93731
$ git grep '^#align' '*.lean' | wc -l
130828

Thanks Yury, Damiano and Patrick, still trying to reproduce in Python, taking all the feedback into account. I am getting close (slightly over), in case anyone has a Python suggestion (comparing to Yury's "$ git grep '^\(@\[[^]]*\]\)\? *\(theorem\|lemma\)' | wc -l" which gives 93731):

def count_occurrences(path, regex):
    regObj = re.compile(regex)
    count = 0
    for root, dirs, fnames in os.walk(path):
        for fname in fnames:
            file_path = os.path.join(root, fname)
            with open(file_path, 'r', errors='ignore') as f:
                content = f.read()
                matches = regObj.findall(content)
                count += len(matches)
    return count

print(count_occurrences('C:\\..\\mathlib\\src', '(theorem|lemma)'))

I assume it is the missing caret ^ to mark the occurrence at the beginning but if I do "print(count_occurrences('C:\\..\\mathlib\\src', '(^theorem|^lemma)'))" I actually get 0 results, which has me confused at this point.

Using it the way as printed gives me 102132 at the moment.

Eric Wieser (Jul 10 2023 at 21:26):

I would strongly discourage spending any more time trying to do this counting with regex; there are lots of theorems that you will never find without implementing a full lean parser. Handily, we already have one of those!

Jack Heseltine (Jul 10 2023 at 21:30):

Eric Wieser said:

I would strongly discourage spending any more time trying to do this counting with regex; there are lots of theorems that you will never find without implementing a full lean parser. Handily, we already have one of those!

Ok, got it. I am reading the Lean programming book but would you want to go into how you would do it, Lean-natively?

Eric Wieser (Jul 10 2023 at 21:42):

Ultimately you look for the function that says "give me everything in the environment", and then filter it for things that are theorems (as opposed to definitions or axioms). In the same way as in python, you'd use dir(some_module) or globals() instead of loading in the source code.

Eric Wieser (Jul 10 2023 at 21:42):

In lean 3 it was docs3#environment.get_decls

Kyle Miller (Jul 10 2023 at 21:48):

import Mathlib

open Lean Elab

structure State where
  axioms : Nat := 0
  defs : Nat := 0
  thms : Nat := 0
  inductives : Nat := 0
  ctors : Nat := 0
  other : Nat := 0
  deriving Repr

elab "count_decls" : command => do
  let consts := ( getEnv).constants
  let update (s : State) (_ : Name) (c : ConstantInfo) : State :=
    if c.isUnsafe then s else
    match c with
    | .axiomInfo .. => {s with axioms := s.axioms + 1}
    | .defnInfo .. => {s with defs := s.defs + 1}
    | .thmInfo .. => {s with thms := s.thms + 1}
    | .inductInfo .. => {s with inductives := s.inductives + 1}
    | .ctorInfo .. => {s with ctors := s.ctors + 1}
    | _ => {s with other := s.other + 1}
  let s : State := consts.fold update {}
  logInfo s!"{repr s}"

count_decls

Kyle Miller (Jul 10 2023 at 21:51):

This gives

{ axioms := 6,
  defs := 156316,
  thms := 155485,
  inductives := 3413,
  ctors := 5010,
  other := 5455 }

(Edit: I changed it to count only declarations not marked "unsafe", which excludes only things not having to do with proving theorems.)

Eric Wieser (Jul 10 2023 at 21:53):

docs#Lean.Environment.constants

Kyle Miller (Jul 10 2023 at 22:01):

If you want to see the six defined axioms, an easy way is to change line 19 to | .axiomInfo .. => dbg_trace "axiom: {n}"; {s with axioms := s.axioms + 1}. Here they are:

axiom: sorryAx
axiom: Quot.sound
axiom: Lean.ofReduceNat
axiom: propext
axiom: Lean.ofReduceBool
axiom: Classical.choice

Eric Wieser (Jul 10 2023 at 22:11):

I'm curious how many of those defs are morally theorems (from what I remember things in Prop like docs#Subtype.property end up as defs)

Kyle Miller (Jul 10 2023 at 22:24):

Yeah, also there are probably lots of metaprogramming-generated def theorems. Here's a count that splits up the definition count into defs that are strictlyProp-valued ("defThms") and the rest of the defs. (Edit: and theorems that aren't strictly Prop-valued, "thmDefs")

code

/-
{ axioms := 6,
  defs := 89384,
  defThms := 66936,
  thms := 155474,
  thmDefs := 11,
  inductives := 3413,
  ctors := 5010,
  other := 5455 }
-/

Eric Wieser (Jul 10 2023 at 22:35):

Wow, that's almost an order of magnitude larger than I expected

Eric Wieser (Jul 10 2023 at 22:35):

I assume there are almost no thmDefs

Kyle Miller (Jul 10 2023 at 22:40):

There are 11:

thmDef: Nat.caseStrongInductionOn
thmDef: Nat.div.inductionOn
thmDef: HEq.ndrec
thmDef: WellFounded.recursion
thmDef: HEq.elim
thmDef: Nat.strongInductionOn
thmDef: Lean.Data.AC.List.two_step_induction
thmDef: Nat.div.inductionOn._unary
thmDef: HEq.ndrecOn
thmDef: Lean.Data.AC.Context.unwrap_isNeutral
thmDef: Nat.mod.inductionOn

Mario Carneiro (Jul 11 2023 at 03:45):

what are the 6 axioms? I'm guessing the 3 usual ones plus the other constants involved in defining Quot?

Yury G. Kudryashov (Jul 11 2023 at 03:58):

Kyle Miller said:

If you want to see the six defined axioms, an easy way is to change line 19 to | .axiomInfo .. => dbg_trace "axiom: {n}"; {s with axioms := s.axioms + 1}. Here they are:

axiom: sorryAx
axiom: Quot.sound
axiom: Lean.ofReduceNat
axiom: propext
axiom: Lean.ofReduceBool
axiom: Classical.choice

Mario Carneiro (Jul 11 2023 at 04:01):

oh, that makes sense


Last updated: Dec 20 2023 at 11:08 UTC