Zulip Chat Archive

Stream: new members

Topic: mathlib3 exports as knowledge artifacts

Stu B22 (Jun 05 2021 at 20:57):

I now have desktop lean3 tutorials working on MS-Win 10 with VSCode, so now I have better frame of reference.
I also did some initial analysis on the export.json.zip which Bryan kindly provided.
Good news : ~90MB unpacked export.json for mathlib parses and loads promptly into objects using an ordinary java setup.
We see 5 kinds of top level entities: decls, instances, mod_docs, notes, tactic_docs
We have 79515 decls, each of which contains these fields,
structure_fields, name , line, equations, constructors, attributes, filename, args, doc_string, kind, type, is_meta
Example:

 "args":  OMITTED
"attributes" : [ ],
  "constructors" : [ ],
  "doc_string" : "",
  "equations" : [ ],
  "filename" : "/Users/chb/Documents/lean/doc-gen/_target/deps/mathlib/src/analysis/calculus/deriv.lean",
  "is_meta" : false,
  "kind" : "theorem",
  "line" : 1856,
  "name" : "has_deriv_within_at.liminf_right_slope_le",
  "structure_fields" : [ ],
  "type" :  OMITTED

I omitted args and type because they contain the meat: long text/structure including much spicy unicode.

A practical sticking point arises from the breadth of unicode in these exported decls. Different text environments (browser/editor/IDE, font) can give visually different results, and may show missing-char codes. (I hope that font-codepage is the ONLY challenge I hit here).

I noticed that even VSCode may fail to display all of the text from the export, so I wanted to be sure I had a real lean desktop setup to compare with. That was Friday's accomplishment, and now I am excited to probe further.

Stu B22 (Jun 06 2021 at 00:51):

It SEEMS that the efmt strings written by export_json.lean contain some unicode chars that do not commonly appear in .lean source files.
In the source code here
https://github.com/leanprover-community/doc-gen/blob/master/src/export_json.lean
we see that the output format is summarized as:

interface DeclInfo {
  name: string;
  args: efmt[];
  type: efmt;
  doc_string: string;
  filename: string;
  line: int;
  attributes: string[];
  equations: efmt[];
  kind: string;
  structure_fields: [string, efmt][];
  constructors: [string, efmt][];
}

Where efmt is defined as follows ('c' is a concatenation, 'n' is nesting):

type efmt = ['c', efmt, efmt] | ['n', efmt] | string;

...and we see the decls are actually marshalled by this function

meta def decl_info.to_json : decl_info → json
| ⟨name, is_meta, args, type, doc_string, filename, line, attributes, equations, kind, structure_fields, constructors⟩ :=
json.object [
  ("name", to_string name),
  ("is_meta", is_meta),
  ("args", json.array $ args.map $ λ ⟨b, s⟩, json.object [("arg", s.to_json), ("implicit", b)]),
  ("type", type.to_json),
  ("doc_string", doc_string.get_or_else ""),
  ("filename", filename),
  ("line", line),
  ("attributes", json.of_string_list attributes),
  ("equations", equations.map efmt.to_json),
  ("kind", kind),
  ("structure_fields", json.array $
    structure_fields.map (λ ⟨n, t⟩, json.array [to_string n, t.to_json])),
  ("constructors", json.array $
    constructors.map (λ ⟨n, t⟩, json.array [to_string n, t.to_json]))]

Stu B22 (Jun 06 2021 at 01:13):

It appears that the output text contains a lot of 3-byte unicode characters beginning with hex EE.
The hex sequence EE 80 80 occurs a lot, similarly EE 80 81 and EE 80 82.
It seems these chars are from the Private Use Area range of unicode. https://en.wikipedia.org/wiki/Private_Use_Areas

Stu B22 (Jun 06 2021 at 01:18):

I get the impression that these special chars are being used by the doc-gen export_json.lean to mark-up the efmt text output, in a way that is useful to the downstream doc-gen python scripts.

Stu B22 (Jun 06 2021 at 01:31):

It looks like most of these occurrences are delimiting the occurrences of (linkable?) symbols.

Eric Wieser (Jun 06 2021 at 06:59):

Does this help? https://github.com/leanprover-community/doc-gen/blob/7e76fe3e93fa96aa6332d3e6cbcfacd168f59d4b/print_docs.py#L354

Eric Wieser (Jun 06 2021 at 07:00):

There is no such thing as a "3 byte unicode character"; what you're describing is a unicode codepoint whose UTF-8 encoding has 3 bytes.

Stu B22 (Jun 06 2021 at 19:25):

Yes, that link helps, thanks, and good point about the unicode definitions.

Last updated: May 02 2025 at 03:31 UTC