Zulip Chat Archive

Stream: general

Topic: Extracting proof dependencies

Riyaz Ahuja (Aug 28 2024 at 22:40):

Hello! I'm fairly new to lean metaprogramming and I'm looking to make a program that, when given an input theorem in a lean file will extract all dependencies (lemmas, definitions, structures, etc defined in the current file or other packages/files) and output some sort of pointer to said theorem/definition/etc (not an actual pointer, maybe just a file path + start/end coordinate in row/column form, or perhaps the file path and declaration).

Currently I'm extracting the infotree using something like how this works: https://github.com/semorrison/lean-training-data. But I'm not sure if any of the tactic_invocations have the data inside them to find where the references are.

Any advice on what datastructures I should be looking for or how lean stores and handles these references would be really helpful to help me get started on this. Thanks!

Kim Morrison (Aug 28 2024 at 23:02):

If you're happy with just knowing about the constants referenced in the proof term generated by the tactic (but not looking at actual tactic invocations), you will find life much easier using functions such as src#Lean.ConstantInfo.getUsedConstantsAsSet. It may also be helpful to read https://github.com/leanprover-community/import-graph/blob/main/ImportGraph/RequiredModules.lean

Riyaz Ahuja (Aug 28 2024 at 23:44):

As far as I can see, the outputted NameSet doesn't contain any information about which file the reference is from, or what kind of reference it is (type, lemma, etc). For example, with the function:

theorem dsimp_test (x y z : Nat) (p : Nat → Prop) (h : p (x * y))
    : p ((x + 0) * (0 + y * 1 + z * 0)) := by
  let h₁ := Nat.mul_zero
  simp_all

Outputting the getUsedConstantsAsSet we get something like:

[["OfNat","ofNat"],
["instAddNat"],
["instHAdd"],
["HAdd","hAdd"],
["True"],
["Nat","mul_zero"],
["Eq"],
["instHMul"],
["Nat","zero_add"],
["eq_true"],
["of_eq_true"],...

(truncated...)

Which matches up mostly with the proof term:

theorem dsimp_test : ∀ (x y z : Nat) (p : Nat → Prop), p (x * y) → p ((x + 0) * (0 + y * 1 + z * 0)) :=
fun x y z p h =>
  let h₁ := Nat.mul_zero;
  of_eq_true
    (Eq.trans
      (congrArg (fun x_1 => p (x * x_1))
        (congr (congrArg HAdd.hAdd (Eq.trans (congrArg (HAdd.hAdd 0) (Nat.mul_one y)) (Nat.zero_add y)))
          ((fun n => h₁ n) z)))
      (eq_true h))

However, I would like to get some more info about these references, such as what file/module Nat.mul_zero is inside. Is there any way to do this?

Adam Topaz (Aug 28 2024 at 23:56):

There is a function called getModuleFor (approximately) which gives the module name of a constant

Adam Topaz (Aug 28 2024 at 23:57):

https://leanprover-community.github.io/mathlib4_docs/ImportGraph/RequiredModules.html#Lean.Environment.getModuleFor?

Riyaz Ahuja (Aug 29 2024 at 04:10):

Thanks that's really useful! Additionally, is there anyway to extract a ConstantInfo from an InfoTree? I'm trying to figure out how to call the getUsedConstantsAsSet.

Riyaz Ahuja (Aug 29 2024 at 04:10):

Or if not from an infotree, then where do I get ConstantInfo from?

Riyaz Ahuja (Aug 29 2024 at 05:40):

It seems that it should come from an Environment, but since I'm given a ModuleName, when I compile it and get a List[CompilationStep], which I can map to a List[Environment], If I use these environments to get my ConstantInfo's, there ends up being millions of items in the associated ConstantMap, which is not reasonable. So I believe I'm using the wrong environment. The question is how do I get the correct environment(s)? Is it through the infotrees?

Kim Morrison (Aug 29 2024 at 06:06):

Millions of items seems reasonable? It's all the constants coming from imported files.

Riyaz Ahuja (Aug 29 2024 at 06:09):

Hm, then is there any way to filter to only constants in the current file? By that I mean for a specific theorem in a specific file, I want to extract out the constants inside that theorem only as a ConstantInfo. Then I can run the aforementioned functions to get the files/lemmas that this theorem depends on.

Kim Morrison (Aug 29 2024 at 06:11):

constants.map\2 rather than constants.map\1

Riyaz Ahuja (Aug 29 2024 at 06:21):

That worked, thanks!!
However, it is still very slow (looks like when I'm running env.constants it processes all of the millions of constants through all the imports, and then filtering by map2 just throws out the ones we don't need). Is there any way to speed this up without having to process all the constants?

Riyaz Ahuja (Aug 29 2024 at 06:23):

I'm looking at Jixia and modified the Symbol.lean to output from the same functions as you mentioned earlier. However, instead of compiling the module name and converting each of the compilation steps into environments directly, it uses a function:

def getResult : CommandElabM (Array SymbolInfo) := do
  let env ← getEnv
  let f a name info := do
    if env.getModuleIdxFor? name |>.isSome then return a
    let si ← liftTermElabM <| getSymbolInfo name info env
    return a.push si
  let a ← env.constants.map₁.foldM f #[]
  env.constants.map₂.foldlM f a

Which gives the same output, but way way quicker.

Riyaz Ahuja (Aug 29 2024 at 06:26):

The Jixia code is hard to understand imo and harder to port into my code, as I am unsure how they are accessing the module or getting the environment. For reference, here is my modified getSymbolInfo:

def getSymbolInfo (name : Name) (info : ConstantInfo) (env : Environment): TermElabM SymbolInfo := do
  let temp : NameSet := info.getUsedConstantsAsSet
  let temp2 := temp.toList
  let get_mod := temp2.map (fun x=> (x,env.getModuleFor? x))
  let refs := HashSet.empty.insertMany get_mod

  return { refs }

(With SymbolInfo updated accordingly, of course)

Riyaz Ahuja (Aug 29 2024 at 16:28):

Note that the output ends up being essentially the same in either case, but the computation time is significantly less for the modified Jixia version

Thomas Zhu (Aug 29 2024 at 23:39):

Hi Riyaz, you may want to check our code in cmu-l3/ntp-toolkit. It is based on Kim's lean-training-data. Specifically using the --premises flag seems to do exactly what you want (code in scripts/premises.lean). Extracting mathlib still takes time on the order of ~1h. It is designed to extract everything in each module in a repository. In fact it should be a lot faster if you have one specific module/theorem in mind, or if our engineering pipeline was changed, but that increases engineering difficulty for our other functions.

Thomas Zhu (Aug 29 2024 at 23:46):

I also don't understand why you would get a list of environments from one module name? You can look at how we got the environment in our code; basically

CoreM.withImportModules #[... module name ...] do
  let env ← getEnv

Adam Topaz (Aug 29 2024 at 23:49):

After every command the environment changes.

Thomas Zhu (Aug 29 2024 at 23:50):

Yes, but for extracting premise data, you only need the final environment

Adam Topaz (Aug 29 2024 at 23:51):

Well, it depends what you mean by “premise data”

Thomas Zhu (Aug 29 2024 at 23:56):

I mean a mapping from constant name to names of its dependencies, like what was asked for above

Adam Topaz (Aug 29 2024 at 23:58):

For that you don’t even need to process things file by file. Just import all of mathlib, get the env, loop through the constants, and look at all used constants for each one. You can filter using isBlacklisted or some variant

Adam Topaz (Aug 29 2024 at 23:58):

I did this a little while ago, and it only takes a few minutes to process

Thomas Zhu (Aug 30 2024 at 00:01):

Yes, that's what I wanted to mean by "if our engineering pipeline was changed", but thanks for point that out.

Adam Topaz (Aug 30 2024 at 00:03):

import Mathlib

open Lean

#eval show CoreM Unit from do
  let env ← getEnv
  let mut data : Array (Name × Array Name) := #[]
  for (n,c) in env.constants do
    if ← n.isBlackListed then continue
    let cs ← c.getUsedConstantsAsSet.toList.filterM fun t => do
      let b ← t.isBlackListed
      return !b
    data := data.push (n, cs.toArray)
  IO.FS.withFile "data" .write fun handle =>
    for (n,cs) in data do
      handle.putStrLn <| toString n ++ " " ++ toString cs

Adam Topaz (Aug 30 2024 at 00:04):

The output is around 145M

Adam Topaz (Aug 30 2024 at 00:05):

(too big to upload to zulip unfortunately)

Thomas Zhu (Aug 30 2024 at 00:18):

Thanks, this is great. Just a minor point, I assume in this way the output might contain constants like ._auxLemma.4 produced by simp which need additional processing (depending on the needs), and it is dealt with in our code (from Kim Morrison's code)

Adam Topaz (Aug 30 2024 at 00:26):

Yeah the filter should be a bit better, I agree.

Siddhartha Gadgil (Aug 30 2024 at 03:00):

The function Lean.Name.isInternal can be used to filer out constants like ._auxLemma.4

Riyaz Ahuja (Sep 08 2024 at 04:11):

I'm circling back on this thread as although everyone's help as now let me be able to extract all the important constants/dependencies from a given module, organized by the Name object that has the dependency inside (Thm, def, etc).

@Adam Topaz I ended up using a modified version of the ntp-toolkit premises function as shown in this github.

However, if you look at how my code outputs, it has outputs in the format of

[{name:Name,
dependencies:[{
    name:Name,
    location:String}
    ]
}]

Where the outside Name is the key part of the Env.constants.map₁, and in practice ends up being the name of the theorem we're looking at.

However, what I'm trying to do is given this Name (and Environment and ConstantInfo as needed) Is there any way to extract the start and end positions (in row,column form) of said theorem (or i suppose it would work equivalently for defs, etc.)? And same thing with each dependency, given the location (module name as a string), can I get the start and end position of the dependency?

Thomas Zhu (Sep 11 2024 at 19:56):

docs#Lean.findDeclarationRanges? should do what you want

Frederick Pu (Dec 27 2024 at 00:02):

how does https://github.com/semorrison/lean-training-data handle pattern matching tactics like cases or induction?

Jason Rute (Dec 27 2024 at 00:33):

Maybe this should be its own thread or connected to a previous thread in lean-training-data?

Jason Rute (Dec 27 2024 at 00:42):

It is not that hard to run, so you could try it and see. There is also an example here: https://github.com/kim-em/lean-training-data/tree/master/Examples. Search for “goalsbefore” to see all the recorded tactics. (The tactic is the “pp” field.) You see it records at many levels of depth from the whole proof to the atomic tactics, basically everywhere you can see a goal when hovering over that spot.

Last updated: May 02 2025 at 03:31 UTC