Zulip Chat Archive

Stream: lean4

Topic: Export format


John Burnham (Sep 21 2021 at 21:42):

I'm wondering what the plans are for adapting the lean3 export format (https://github.com/leanprover-community/lean/blob/master/doc/export_format.md) to lean4 to facilitate independent checkers? I'm guessing the way to do this would be to write a library in Lean that takes either the Environment or a list of Declaration and writes it to a file?

I'd be happy to help contribute on this if that would be useful. I've written a parser for the lean3 format in Rust: https://github.com/yatima-inc/radiya/tree/main/src/parse, as well as a way to embed the export format as a hash-linked graph in IPFS's https://ipld.io/ format.

Mario Carneiro (Sep 22 2021 at 00:40):

@John Burnham I believe that nanoda_lib has a parser for the lean 3 export format at https://github.com/ammkrn/nanoda/blob/master/src/parser.rs which has been adapted to include lean 4 kernel features, so you could consider it a de-facto specification of the lean 4 export format, but to my knowledge an actual exporter has not been written (cc: @Chris B, is this true?). I believe this to be a fairly simple project that can be done in user space (which is part of why a standard exporter doesn't ship with the lean 4 distribution, unlike lean 3).

In particular, it might be advantageous to not standardize the format and instead have an exporter written in lean 4 using standard API functions that can be adapted to suit the needs of the importer (for example, logging different information or using a different interchange format).

John Burnham (Sep 22 2021 at 01:16):

I referred to nanoda_lib extensively while writing my parser in radiya (https://github.com/yatima-inc/radiya/tree/main/src/parse), but nanoda parses directly into it's particular Env type as opposed to an export syntax representation, and does not include error handling if the export format is malformed. Those are reasonable design decisions for nanoda, but I wanted to something a little safer/more general for my case, so I rewrote it using the nom parser combinator library. Likely that's slower than nanoda's approach, but doesn't seem too bad in practice, since radiya still parses the lean3 standard library in 1.2 seconds in testing.

Also, there are some declarations in lean4 which are missing from the export syntax, like theorems, opaques and mutual definitions. It should be fairly easy to add the first two as #THEO and #OPAQ and we could do mutuals as #MUT <number-of-definitions> <def-1> <def-2>. There are also new Expr constructors in Lean4 like MData and Proj, but I'm not entirely sure what those are and if they're kernel relevant.

But I also am thinking about other possibilities, like replacing the numerical name, universe and expression indices with hashes, to allow AST fragments to be shared over IPFS, like in Yatima. That probably is mainly relevant to my use case though, and shouldn't be the standard, so I'll take a look at implementing that in Lean4 userspace.

Is there documentation somewhere on Lean's C FFI? For my content-addressing concept I've been using https://github.com/BLAKE3-team/BLAKE3, and it'd be cool to use that here

John Burnham (Sep 22 2021 at 01:21):

To clarify, I know we can do

@[extern "lean_nat_add"]
def add : Nat  Nat  Nat
  | a, Nat.zero   => a
  | a, Nat.succ b => Nat.succ (add a b)

as described here: https://leanprover.github.io/lean4/doc/lean3changes.html, but what i'm specifically wondering about is how to provide the C definition in the build. Presumably there's a way to use lake for this?

Mario Carneiro (Sep 22 2021 at 01:21):

I don't know of any documentation for the C FFI (cc: @Mac ?), but my understanding is that you just include lean.h and link with the generated C files. If you are using rust then you probably want to use bindgen in there but otherwise it should be the same process.

I could have sworn that nanoda supported opaque definitions and mutuals, and it almost certainly supports theorems since those are present in the lean 3 format as well

Mario Carneiro (Sep 22 2021 at 01:24):

Personally, I think we should move away from the lean 3 export format style. It requires a custom parser implementation, and it is not very well hardened against weird inputs. Something standard like JSON seems like a better choice to me

John Burnham (Sep 22 2021 at 01:26):

I could be wrong, but nanoda appears to have theorem and opaque in their internal representation (https://github.com/ammkrn/nanoda_lib/blob/998ef68742d2ccc9f82442d70ae16802f44e33f2/src/env.rs#L65), but not in the parser https://github.com/ammkrn/nanoda_lib/blob/998ef68742d2ccc9f82442d70ae16802f44e33f2/src/parser.rs#L56 I don't see mutuals, but maybe I'm not looking at the right thing.

I agree that exporting via JSON would be better than a bespoke format though. Actually the IPFS/IPLD stuff I was talking about is essentially just JSON (plus hash-links encoded as singleton maps,{ "/": <hash-string> }).

Mac (Sep 22 2021 at 01:28):

There is no documentation of the C FFI as far as know. However, the entire C API is now contained within lean.h (outside a few GMP dependent functions in lean_gmp.h) and the file is not that big, so I think just quickly skimming through the signatures should be informative enough in many cases.

Also, @Mario Carneiro , isn't a theorem just a def without codegen?

Mario Carneiro (Sep 22 2021 at 01:29):

I think the main difference post-elaboration is that theorems are irreducible by default

Mario Carneiro (Sep 22 2021 at 01:29):

but the distinction is valuable to downstream consumers regardless of the direct effect of the label

John Burnham (Sep 22 2021 at 01:39):

Does the distinction between def/theorem/opaque or the presence of ReduciblityHints matter in the kernel semantics? The comment at https://github.com/leanprover/lean4/blob/0bea52d1b535559e3f77e257d4afad5af87be54e/stage0/src/Lean/Declaration.lean#L27 says they only affect performance, but perhaps that's still useful to export to third-party checkers?

Mario Carneiro (Sep 22 2021 at 01:39):

You are probably right about the parser being less comprehensive than the internal representation in nanoda. I have the distinct memory of working out a syntax for mutual inductives along your proposed lines, but I can't remember enough of the context to find it now. In any case, the lean 4 data structures to export are fairly clear about their grammar (see Declaration), so I would just do a straightforward serialization of that in your syntax of choice. (But keep in mind that you have to do exporting of expr, level and possibly name in a way that supports backreferences, or else you can get exponential blowup.)

Mario Carneiro (Sep 22 2021 at 01:40):

Yes, both def/theorem/opaque and ReducibilityHints are directly relevant to external verification

Mario Carneiro (Sep 22 2021 at 01:41):

well, it's possible that def/theorem is equivalent to some different ReducibilityHints setting

Mario Carneiro (Sep 22 2021 at 01:42):

opaque is different from both def and theorem because opaque defs never unfold, they are more like axiomatic constants with a proof of inhabitedness of the type to justify the addition

Mario Carneiro (Sep 22 2021 at 01:44):

The comment says they only affect performance, but don't underestimate by just how much. If you leave the garden path of unfolding the way the original proof did, you can hit an absolutely terrible worst case. In principle, all unfolding paths are equivalent because you will eventually hit a normal form; in practice the difference between those can be more than the age of the universe

Mario Carneiro (Sep 22 2021 at 01:46):

That probably won't happen most of the time, but on any corpus as large as mathlib you will hit enough edge cases that it's likely to make the difference between checking in a reasonable time frame and timing out after spending more than 100x more time than lean

John Burnham (Sep 22 2021 at 01:59):

That makes sense, haha, that in practice performance is actually semantics :smiling_face:. Adding all the top level declarations from the Lean source seems pretty straightforward though.

What I'm thinking about for the JSON/IPLD design is not that different from the existing format, actually, I think, but I want to replace the indices with hashes like,

{ <hash-1> : ["#NS", null, "u"],
  <hash-2> : ["#UP", <hash-1>],
  ...
}

The neat thing about this is that it would give every Lean environment a unique content-id. One thing I'm wondering though is whether the hashing could operate over Constants rather Declarations. That way each ConstantInfo could get it's own hash (which could be cached across typechecking sessions, potentially). I saw a comment in nanoda somewhere that it might be interesting to have the export format do more of the work for representing each of the QuotKinds rather than having to re-compile them when we import the export format.

Mac (Sep 22 2021 at 02:18):

@John Burnham, what kind of hashes are we talking about here? If you are talking about an ordinary 32-bit/64-bit hash, I think collisions in a large library like mathlib would be a big concern.

Mac (Sep 22 2021 at 02:18):

If you are talking about a cryptographically secure hash, speed could potentially be an issue.

Mac (Sep 22 2021 at 02:22):

Also, even with cryptographically secure hashes, you are now basing the soundness of your proof on the soundness of the hashing function. As soon as a collision for the hashing function has been found, all proofs checked by your checker could become unsound (depending on how much you are trusting the hash).

John Burnham (Sep 22 2021 at 02:29):

What we've been doing in Yatima is hashing expressions and definitions with a 256-bit Blake3 hash, so negligible collision risk. Speed might be an issue, but Blake3 is very fast, and can be computed in parallel. I agree that the hash function then becomes part of the trusted code base, which certainly wouldn't be appropriate for every use case. But my specific use-case is verifying theorems inside of a blockchain's state-transition function, and this already assumes a sound cryptographic hash function.

John Burnham (Sep 22 2021 at 02:46):

I'm not 100% sure about this idea though, so I will try it both ways as I write an exporter in Lean and see what I learn

Chris B (Sep 22 2021 at 03:42):

John Burnham said:

I could be wrong, but nanoda appears to have theorem and opaque in their internal representation (https://github.com/ammkrn/nanoda_lib/blob/998ef68742d2ccc9f82442d70ae16802f44e33f2/src/env.rs#L65), but not in the parser https://github.com/ammkrn/nanoda_lib/blob/998ef68742d2ccc9f82442d70ae16802f44e33f2/src/parser.rs#L56 I don't see mutuals, but maybe I'm not looking at the right thing.

I haven't read the whole thread so this may have been addressed to some extent, but those aren't in the parser because (at least when I wrote it) I didn't know what those were supposed to look like, and then we figured out we'd have to write the export format ourselves. I would assume theorems and opaque/abbrev will look exactly like a definition but with #OPQ/#THM prefixes, and mutual inductives will be like inductives with one more integer indicating the number of inductive specs in the block.

Chris B (Sep 22 2021 at 03:51):

Mario Carneiro said:

In particular, it might be advantageous to not standardize the format and instead have an exporter written in lean 4 using standard API functions that can be adapted to suit the needs of the importer (for example, logging different information or using a different interchange format).

Verifiers will need to get different outputs depending on whether (and how) they want to support the native reduction stuff for bool, nat, and string, so I think at least some degree of configurability would be required for a standard format.

Chris B (Sep 22 2021 at 03:53):

But I like the mmz idea of letting scripts include data that the verifier can ignore in a principled way (input/output in mmz). There might be a good path to code re-use by following that kind of pattern.

John Burnham (Sep 22 2021 at 22:15):

@Chris B I've started implementing an IPLD serialization library in Lean: https://github.com/yatima-inc/radiya/blob/jcb/kernel/Radiya/Ipld.lean, which just a JSON superset:

inductive Ipld where
| null
| bool (b : Bool)
| num (n : UInt64)
| str (s : String)
| byt (b : ByteArray)
| arr (elems : Array Ipld)
| obj (kvPairs : RBNode String (fun _ => Ipld))
| link (cid: Cid)
deriving Inhabited

It think being able to persist and share the export format over IPFS will be really interesting, particularly because of the growing Nix <-> IPFS integration.

One question I have though is about how the map of Constants in the Environment is built from the list Declaration sent to the kernel. Since the ConstMap is keyed on Names, I'm assuming that mutual definitions and inductives are flattened, so that the graph of Expr.Const references forms cycles? It looks like the mutual inductives are no longer compiled to a single large inductive as in lean3?

There was a comment on the old nanoda which said:

/// This module is pretty much just "by hand" assembly of Quot. I'm not sure
/// why the export file doesn't lend more help in putting this together.

So what I'm wondering is whether we can directly export the constants instead of the declarations, so that instead of a single #QUOT, we'd have in the export file an already compiled constant for Quot, Quot.mk, Quot.lift, Quot.ind. And for the inductives could w have the InductiveVals, ConstructorVals, RecursorVals directly, rather than having to compile the specs?

On the other hand, maybe the exporting is based on Declaration rather Constant so that each object in the export format is self-contained, rather than potentially having cyclical reference spaghetti.

Chris B (Sep 23 2021 at 03:58):

John Burnham said:

Since the ConstMap is keyed on Names, I'm assuming that mutual definitions and inductives are flattened, so that the graph of Expr.Const references forms cycles? It looks like the mutual inductives are no longer compiled to a single large inductive as in lean3?

You're right that they're no longer flattened, but everything should still be acyclic because mutual declarations are only going to refer to each others components by name (using an Expr.Const, which is more primitive than the actual declaration), where the name is able to unfold/reduce to the mutual item by fishing it out of the environment. In a meta sense you could say that it has cycles in that A is allowed to refer to B by name and vice versa, but this shouldn't result in there being "actual" cycles from the program's perspective.

John Burnham said:

So what I'm wondering is whether we can directly export the constants instead of the declarations, so that instead of a single #QUOT, we'd have in the export file an already compiled constant for Quot, Quot.mk, Quot.lift, Quot.ind. And for the inductives could w have the InductiveVals, ConstructorVals, RecursorVals directly, rather than having to compile the specs?

On the other hand, maybe the exporting is based on Declaration rather Constant so that each object in the export format is self-contained, rather than potentially having cyclical reference spaghetti.

The problem with doing this for the recursors is that they're supposed to be derived from the inductive spec and the constructors according to a very specific procedure laid out in the kernel. If you were to be given completed recursors by the proof script, you would either need a separate procedure to check that they're the "proper" recursors for that inductive spec, or you would need to be okay with the risk that the script gives you fugazi recursors.

To some extent this is probably the motivation for wanting to assemble quot manually. Especially in Lean 4 where the quot recursors are hard-coded in the kernel, you don't necessarily want to have to trust the definition of quot that gets passed by the script since the kernel is actually making assumptions about how it behaves. There's a little bit of asymmetry in that the Lean 4 kernel receives Eq as components, but then it does a check to make sure that it matches expectations. I think this is justified in that Eq is declared in core, whereas Quot is primitive. There's a branch of nanod_lib with this check and some other additions, but I ran into weird rustc issues that freaked me out and never got addressed so I haven't pushed it. For Quot I think this is mostly a measure of how paranoid you want to be since you can just verify it by eye if you want to, but that's definitely not the case for recursors in the large. For inductive declarations and constructors, there's very little that happens during compilation, it's mostly just checks that they play by the rules, so there's not much to be gained there.

The approach you mentioned would work with metamath0's Lean verification, where the kernel verifier's steps are recorded alongside the actual data. There are a lot of attractive aspects to this approach, but the downsides are that a) it's not done, and b) the proof scripts become extremely long, probably way too long to put on a distributed system.

John Burnham (Oct 02 2021 at 22:12):

@Chris B Sorry, I didn't see your message earlier (still figuring out Zulip). This is really helpful info, thank you! I think I have a pretty good understanding of how to do an IPLD/JSON based export format now, so will report back when I have something more substantial implemented

Jannis Limperg (Apr 18 2023 at 10:36):

From this old topic, I understand that there's no export format for Lean 4, neither in core nor in any external library. Is this still the state of things? Also, is there any comprehensive description of the Lean 4 type theory apart from the kernel source code?

Henrik Böving (Apr 18 2023 at 10:49):

THe rule of thumb for the Lean 4 type theory I always hear is "Oh its basically marios thesis + a little bit"

Jannis Limperg (Apr 18 2023 at 10:54):

Yeah, but this is not something I can easily give to a student. I was thinking about offering some BSc/MSc theses to build Lean 4 verifiers. But it looks like I need two other projects first: (a) write an exporter and (b) reverse-engineer the kernel to find out what precisely the type theory is.

Henrik Böving (Apr 18 2023 at 10:55):

https://github.com/Kha/lean4export this is a thing

Jannis Limperg (Apr 18 2023 at 10:57):

Aha, beautiful! @Sebastian Ullrich is this current?

Mario Carneiro (Apr 18 2023 at 11:03):

There is also https://github.com/ammkrn/nanoda_lib which IIUC is lean 4 compatible and works on an extension of the export format

Jannis Limperg (Apr 18 2023 at 13:30):

Nice, thanks! @Chris Bailey could you comment on nanoda_lib's support for Lean 4? Does it work with current Lean 4? Does it use the extended export format of Sebastian's tool or something else?

Chris Bailey (Apr 19 2023 at 16:18):

Jannis Limperg said:

Nice, thanks! Chris Bailey could you comment on nanoda_lib's support for Lean 4? Does it work with current Lean 4? Does it use the extended export format of Sebastian's tool or something else?

The breakdown is roughly as follows...

These are implemented:
support for mutual inductives
support for mutual definitions
removal of reduction rules
direct inductive reduction
direct quotient reduction
theorem/opaque declaration types
abbrev/opaque transparency levels
native "let" (#EZ)

These are not:
Support for nested inductive reduction
Native projections
Native Nat and String

I hadn't seen Sebastian's repo before. Looking over the README it seems like projections should be easy if they work like definitions. The native nat/string will be more work, I'll have to check again and see how the primitive operations are supposed to work in the kernel. Reading them in from the export file seems simple, but it's not immediately clear how the native operations appear/are encoded within expressions, particularly the native nat binary expressions. There doesn't seem to be a new encoding for mutual inductives, I can check to see how mutual blocks show up in the export later.

Jannis Limperg (Apr 20 2023 at 09:04):

Great, thank you very much for the detailed answer!


Last updated: Dec 20 2023 at 11:08 UTC