Zulip Chat Archive

Stream: lean4

Topic: Extracting the internal representation


Andrej Bauer (May 29 2023 at 12:18):

How would I go about extracting the internal representation, as represented in the kernel? This is for the purposes of machine learning. The idea is to take a large library, munch all of its files, and obtain the internal representations in a format that can be easily parsed by other tools. Has this been done?

For example, I was able to write a plugin for Agda that works a bit like its documentation plugin, but produces files with s-expressions instead of HTML. It took me about four days to figure it out. If there is a similar plugin for Lean, one that processes files and outputs them in some other format, that might be a good starting point.

Sebastian Ullrich (May 29 2023 at 12:25):

I have written a translator from Lean 4 .olean to a plain-text format: https://github.com/Kha/lean4export

Jason Rute (May 29 2023 at 23:08):

Also in Lean 3 it was easy to write a lean script to loop over the expressions of all the declarations in the library. I assume it is equally easy in Lean 4, but I don’t know the magic innovation. Also, you may be interested in the stream #Machine Learning for Theorem Proving

Andrej Bauer (May 30 2023 at 06:43):

BTW, what's the licensing status of your extension? I am thinking of adapting it so that it outputs s-expressions.

Andrej Bauer (May 30 2023 at 06:47):

One more question if I may: how do I extract the entire content without listing the constants on the command line?

Sebastian Ullrich (May 30 2023 at 08:31):

Andrej Bauer said:

BTW, what's the licensing status of your extension? I am thinking of adapting it so that it outputs s-expressions.

It's Apache now

Sebastian Ullrich (May 30 2023 at 08:32):

Andrej Bauer said:

One more question if I may: how do I extract the entire content without listing the constants on the command line?

That's what should happen by default if you don't use the --

Sebastian Ullrich (May 30 2023 at 08:32):

https://github.com/Kha/lean4export/blob/f1d507ff8c8ba2df2316e78c0186a84c6791305f/Main.lean#L11

Andrej Bauer (May 30 2023 at 14:07):

Thanks for everything. BTW, you should probably fill in the Copyright info at the end of LICENSE (year and your name).

Sebastian Ullrich (May 30 2023 at 15:29):

I think that is meant as a template for source-level copyright comments. The GitHub copyright wizard fills in this information for other licenses where it's needed, but not for Apache

Andrej Bauer (May 31 2023 at 21:29):

Sorry for going on with this, but I understand your code correctly (and based on running it), if no constants are given then it will dump all constants of all imports, recursively. What would be a good way of dumping only the constants that are defined in a given module?

Mario Carneiro (May 31 2023 at 22:02):

you can ask what module a constant is in and skip it if it's not the one you want

Mario Carneiro (May 31 2023 at 22:02):

unfortunately there is no better way to do it since the constants are in a hashmap so they are not meaningfully indexed by anything other than their name

Mario Carneiro (May 31 2023 at 22:08):

I think it would look something like this:

  let some target := env.getModuleIdx? `foo | throw (.userError "module not found")
  M.run env do
    env.constants.forM fun c _ => do
      if !c.isInternal && env.getModuleIdxFor? c == target then
        let _  dumpConstant c

Andrej Bauer (Jun 01 2023 at 13:22):

Is there a way to load a file, rather than to instruct Lean to import a module by name? That would be more useful to me. (And find out which constants were defined in the file.)

Andrej Bauer (Jun 01 2023 at 13:25):

Even better, can I just load .olean files and extract stuff from them? (Looking at Environment.lean now.)

Andrej Bauer (Jun 01 2023 at 13:27):

So Environment.ModuleData seems to perfectly fit my bill. Now just need to find out how to load an .olean file...

Andrej Bauer (Jun 01 2023 at 13:28):

Is Environment.readModuleData the one I want? Let me try.

Andrej Bauer (Jun 01 2023 at 14:00):

The .olean file contains entries like this:

IO.println._at.main._spec_1
IO.println._at.main._spec_1._cstage2
Nat.foo._cstage1
Nat.foo
main
main._cstage1
Nat.foo._cstage2
Array.forInUnsafe.loop._at.main._spec_2
List.forIn.loop._at.main._spec_3
main.match_1
main._cstage2
main.match_1._cstage1
Array.forInUnsafe.loop._at.main._spec_2._cstage2
List.forIn.loop._at.main._spec_3._cstage2

Is there a reliable way to tell which ones are "generated"? These would be _cstage and _spec stuff. BTW, what are they?

Andrej Bauer (Jun 01 2023 at 14:00):

Perhaps they have an attribute I can test.

Henrik Böving (Jun 01 2023 at 14:01):

They are both entries auto generated while doing code (old) generation iirc.

Adam Topaz (Jun 01 2023 at 14:11):

there's docs4#Lean.Name.isInternal

Kyle Miller (Jun 01 2023 at 14:14):

Mathlib has a function that uses that along with a few other checks: https://github.com/leanprover-community/mathlib4/blob/01da6a3c22053b87a02f5a4a7d78c10d6cb25160/Mathlib/Lean/Name.lean#L19

Andrej Bauer (Jun 02 2023 at 07:51):

Is it possible that an .olean file would contain a cyclic datastructure, so that naive traversal will cycle forever/

Mario Carneiro (Jun 02 2023 at 08:16):

no, all lean objects are cycle free (this is important for the RC collector to work)

Mario Carneiro (Jun 02 2023 at 08:16):

although I'm not exactly sure how IO.Refs are garbage collected

Mario Carneiro (Jun 02 2023 at 08:17):

you would need an unsafe inductive to create a structure that can have references to itself, I think

Jannis Limperg (Jun 02 2023 at 08:23):

I use a cyclic data structure in Aesop (involving unsafe and IO.Ref), but it never ends up in any .olean. I'd assume that the serialisation procedure for .oleans doesn't support cyclic data in the first place.

Andrej Bauer (Jun 02 2023 at 08:25):

For some reason Mathlib/Order/Copy.olean causes https://github.com/andrejbauer/lean2sexp to allocate 25GB of RAM and just sit around wasting 100% of CPU. So I thought maybe there are cycles, because lean2sexp simply traverses the constants and outputs ConstantInfo as S-expressions. I suppose I'll just have to debug a bit.

Andrej Bauer (Jun 02 2023 at 12:59):

If I suspect that something is wrong with the internal encoding of proof_8 (or something close to that name) in build/lib/Mathlib/Order/Copy.olean, how would I double-check?

Mario Carneiro (Jun 02 2023 at 13:01):

What kind of problem? if it is something you can see with lean code you are best off using loadModuleData to get at the olean and start hacking around in it with an eval block

Mario Carneiro (Jun 02 2023 at 13:04):

if it is something below that (i.e. in the actual byte representation and encoding) you could try using https://github.com/gebner/oleanparser, which is written like a more traditional parser (it does not use anything from Lean directly and could equally well be written in python) and produces some fairly low level output which you could use to debug the binary format

Mario Carneiro (Jun 02 2023 at 13:05):

(BTW, proof_8 is not enough to find the theorem in question, as a lot of things have a proof_8. Hopefully that is the last name component in some theorem's full name, which might help to debug this from the perspective of the source that generated it)

Andrej Bauer (Jun 02 2023 at 13:23):

I am getting somewhere. I am converting the content of an olean file to s-expressions (see https://github.com/andrejbauer/lean2sexp). The file build/lib/Mathlib/Order/Copy.olean is 2.1MB in size, but somehow in s-expression form that has everything except applications and lambdas pruned, it becomes 227MB. This seems like an amazingly efficient way of comperssing information in olean. Do olean files contain something other than abstract syntax trees? Are they optimized in some way?

Andrej Bauer (Jun 02 2023 at 13:28):

And build/lib/Mathlib/Topology/Sheaves/SheafCondition/OpensLeCover.olean is even worse, it outputs 1.3 GB of stuff.

Mario Carneiro (Jun 02 2023 at 13:28):

Are you dag-sharing?

Mario Carneiro (Jun 02 2023 at 13:28):

because if not this will have massive blowup

Mario Carneiro (Jun 02 2023 at 13:29):

lean expressions are maximally shared in olean files IIRC

Andrej Bauer (Jun 02 2023 at 13:33):

Aha, that would do it. It's just amazing that there would be so much sharing. Where is it coming from? Surely no human produces such dags with bare hands.

Mario Carneiro (Jun 02 2023 at 13:34):

One common source of sharing is the names

Mario Carneiro (Jun 02 2023 at 13:34):

stuff like namespaces and particular theorem names are repeated many thousands of times in the average lean file

Mario Carneiro (Jun 02 2023 at 13:36):

I think actually interesting term sharing in proofs isn't that common, lean itself sometimes has trouble with it so it's not a particularly good thing to rely on

Andrej Bauer (Jun 02 2023 at 13:40):

I measured the size of the proof trees, counting applications and lambdas only. In build/lib/Mathlib/Order/Copy.olean there is a constant called CompleteLattice.copy.proof_8 whose size is 1342037. This is not caused by name sharing, since it contains that many applications and lambdas. And looking around, it looks like the really large ones are always called Foo.bar.proof_X.

Mario Carneiro (Jun 02 2023 at 13:43):

well the bulk of most proofs will be in definitions named something like that

Mario Carneiro (Jun 02 2023 at 13:43):

so that's not a huge indicator of anything

Andrej Bauer (Jun 02 2023 at 13:44):

But such huge proofs must have been generated by tactics, and moreover, are very likely proofs in Prop, no?
I mean, who would care about them?

Mario Carneiro (Jun 02 2023 at 13:44):

I think I can guess why CompleteLattice.copy is big: it's probably instantiating a structure with a deep inheritance hierarchy in one go using a very large structure instance

Andrej Bauer (Jun 02 2023 at 13:45):

So the type-class mechanism might be doing it?

Mario Carneiro (Jun 02 2023 at 13:45):

not quite

Andrej Bauer (Jun 02 2023 at 13:45):

If there is an identifiable way for me to detect that there are very large proofs that will not be very informative, I am more than happy to throw them away.

Mario Carneiro (Jun 02 2023 at 13:50):

To demonstrate why this leads to exponential growth or (more accurately) highly shared subterms, let's consider a smaller example:

class A (X : Type) where
  x : Nat
  y : Nat

def foo (X) [A X] : Nat := A.x X

class B (X : Type) extends A X where
  z : Nat
  h : foo X = z


def bla : B Int := {
  x := Nat.zero -- for a smaller pp.all
  y := Nat.succ Nat.zero
  z := Nat.zero
  h := rfl
}
-- set_option pp.explicit true
#print bla
-- def bla : B Int :=
-- @B.mk Int (@A.mk Int Nat.zero (Nat.succ Nat.zero)) Nat.zero bla.proof_1
#print bla.proof_1
-- def bla.proof_1 : @Eq Nat (@foo Int (@A.mk Int Nat.zero (Nat.succ Nat.zero)))
--   (@foo Int (@A.mk Int Nat.zero (Nat.succ Nat.zero))) :=
-- @rfl Nat (@foo Int (@A.mk Int Nat.zero (Nat.succ Nat.zero)))

Mario Carneiro (Jun 02 2023 at 13:51):

What is interesting here is that the type of h, which is filled in by bla.proof_1, is

@foo Int { x := Nat.zero, y := Nat.succ Nat.zero } = @foo Int { x := Nat.zero, y := Nat.succ Nat.zero }

where { x := Nat.zero, y := Nat.succ Nat.zero } is the first half of the structure just constructed

Mario Carneiro (Jun 02 2023 at 13:55):

in other words, each occurrence of foo (or more generally, any typeclass function) contains the complete record constructed by the structure instance. In practice this happens a lot, because for example groups extend monoids which extend ... which extend Add, which means that when you construct a group it uses the instance it derives for Add from the monoid, and then your instance which builds a group will have a subgoal saying something like (x + y) + z = x + (y + z) where every single + there is actually HAdd.hAdd .... (addOfMonoid { add := bla, add_zero := by simp, zero_add := by horrible proof term })

Mario Carneiro (Jun 02 2023 at 13:56):

so now you have 6 occurrences of the monoid structure in the group structure, and then to build the ring structure you will reference the group structure 6 times...

Mario Carneiro (Jun 02 2023 at 13:56):

rinse and repeat and by the time you get to CompleteLattice which probably has something like 12 parent structures in it things are kind of ridiculous

Andrej Bauer (Jun 02 2023 at 13:57):

Is there any way to detect the sharing present in ModuleData? Maybe some sort of physical equality?

Mario Carneiro (Jun 02 2023 at 13:57):

yes, you can use pointer equality if you need to do this from lean

Mario Carneiro (Jun 02 2023 at 13:57):

there is an unsafe function for it

Mario Carneiro (Jun 02 2023 at 13:58):

you can also compare hashes and make a cache, this is what most expr traversals in lean do

Andrej Bauer (Jun 02 2023 at 13:58):

Although, it seems that a saner option would be to elide all this stuff from the output. The output is meant to represent syntax trees to be used for machine learning. I am not sure machines will learn much from implicit arguments that were not given by the user directly.

Mario Carneiro (Jun 02 2023 at 13:59):

yeah, you can just pretty print everything

Mario Carneiro (Jun 02 2023 at 13:59):

or leave out implicit arguments from the sexprs which is a kind of basic version of pretty printing

Andrej Bauer (Jun 02 2023 at 13:59):

Even if I keep the shared structures, I am just postponing the problem to the next person (who is a machine learner), because they will hit the same problem. Just traversing these structures takes forever.

Andrej Bauer (Jun 02 2023 at 14:00):

Can I tell from olean which implicit arguments were filled by the machine and which ones by the human?

Mario Carneiro (Jun 02 2023 at 14:00):

that information does not exist

Mario Carneiro (Jun 02 2023 at 14:01):

you could look at the input syntax I suppose, but it's not really possible to truly know what was machine written

Andrej Bauer (Jun 02 2023 at 14:01):

Let me try omitting just instImplicit first. Perhaps that'll help.

Andrej Bauer (Jun 02 2023 at 14:02):

"Machine written" is not a well-defined notion, because a user could write a tactic which computes a term T Did the user "write T"?

Mario Carneiro (Jun 02 2023 at 14:03):

or even if text is in the lean file, was it created by a code action? (or by mathport, which probably has authorship on 70+% of mathlib4 right now by a strict reading)

Mario Carneiro (Jun 02 2023 at 14:03):

the line seems pretty impossible to draw

Mario Carneiro (Jun 02 2023 at 14:03):

but then again, isn't that the point of ITP?

Andrej Bauer (Jun 02 2023 at 14:04):

In all of Agda that we extracted to s-expressions, there is just one file that is unreasonably large (30MB). There are no, or close to none, type classes in Agda. I wonder what would happen with Coq.

Mario Carneiro (Jun 02 2023 at 14:05):

I think these issues are mostly deficiencies in lean, places where it is generating something dumb and when it shows up on a profile sometimes we try to fix it

Andrej Bauer (Jun 02 2023 at 14:05):

But what is one to think of the fact that the "trivial" information reconstructed by the type class mechanism in some cases is 99% of the proof?

Mario Carneiro (Jun 02 2023 at 14:06):

type class inference is a hugely expensive part of lean processing (that and defeq checking are the majority of compile time)

Andrej Bauer (Jun 02 2023 at 14:07):

Ugh. When staring at Expr.app e1 e2 it's actually quite difficult to tell whether e2 is an implicit argument.

Mario Carneiro (Jun 02 2023 at 14:08):

from a detached standpoint it all looks pretty crazy, but when you get into the weeds there aren't any obvious fixes - almost all of the typeclass system is actually being used in some way in mathlib in practice

Mario Carneiro (Jun 02 2023 at 14:08):

the way to find implicits is to get the type of the function symbol and traverse the pis

Mario Carneiro (Jun 02 2023 at 14:09):

are you working in a context where inferType does something useful?

Andrej Bauer (Jun 02 2023 at 14:09):

But the type of the function symbol need not be in normal form.

Andrej Bauer (Jun 02 2023 at 14:09):

I am just loading .olean files, so whatever is in them.

Mario Carneiro (Jun 02 2023 at 14:10):

I mean can you load them into an environment such that the usual lean functions work

Mario Carneiro (Jun 02 2023 at 14:11):

there is a function by the name forallTelescopeReducing which does the normalization

Andrej Bauer (Jun 02 2023 at 14:11):

You tell me :-) It's Lean4 and I am loading them in with Lean.readModuleData (and throwing away the compacted region, cause I don't know what that's for).

Mario Carneiro (Jun 02 2023 at 14:11):

that's for freeing the data, if you discard it then it will live until the end of the program, which is usually fine

Andrej Bauer (Jun 02 2023 at 14:11):

In particular, I am not loading the dependencies, so I can't really process anything.

Mario Carneiro (Jun 02 2023 at 14:12):

yeah that's what I was afraid of

Mario Carneiro (Jun 02 2023 at 14:12):

you should probably do that, it is not that expensive

Andrej Bauer (Jun 02 2023 at 14:12):

This is machine learning, we should look for a practical solution.

Mario Carneiro (Jun 02 2023 at 14:12):

I think there is a function which automatically handles all this

Mario Carneiro (Jun 02 2023 at 14:13):

withImportModules

Mario Carneiro (Jun 02 2023 at 14:14):

or I guess importModules

Andrej Bauer (Jun 02 2023 at 14:16):

I could store dag's if I wanted to, my output format allows it.

Andrej Bauer (Jun 07 2023 at 16:54):

Is it just me, or do .olean files reference entities by their short names, i.e., one has to rely on import statements to figure out what a given name refers to? Is there some way to get fully qualified names?

Mario Carneiro (Jun 07 2023 at 17:11):

I would be very surprised for that to be the case

Mario Carneiro (Jun 07 2023 at 17:12):

(also you mean namespace / open, not import. import changes the set of names in the environment but not the names you can use to refer to them)

Andrej Bauer (Jun 07 2023 at 17:18):

Sorry, false alarm.

Andrej Bauer (Jun 07 2023 at 17:19):

Next weirdness: is it possible that the order of entries in .olean is not the same as in .lean, i.e., the definitions get mixed up? (Either that, or I have a bug.)

Mario Carneiro (Jun 07 2023 at 17:20):

I'm not even sure they are stored in any order

Mario Carneiro (Jun 07 2023 at 17:20):

isn't it a hashmap?

Andrej Bauer (Jun 07 2023 at 17:20):

According to ModuleData.constants, they are stored in an Array, which sounds sorted to me.

Andrej Bauer (Jun 07 2023 at 17:21):

But perhaps the complier mixes them up.

Mario Carneiro (Jun 07 2023 at 17:22):

yeah

  let constants  := env.constants.foldStage2 (fun cs _ c => cs.push c) #[]

where env.constants is a hashmap

Mario Carneiro (Jun 07 2023 at 17:22):

that's how the constants : Array ConstantInfo is populated

Mario Carneiro (Jun 07 2023 at 17:23):

agreed that this is a bit odd


Last updated: Dec 20 2023 at 11:08 UTC