Zulip Chat Archive
Stream: new members
Topic: mathlib3 exports as knowledge artifacts
Stu B22 (Jun 05 2021 at 20:57):
I now have desktop lean3 tutorials working on MS-Win 10 with VSCode, so now I have better frame of reference.
I also did some initial analysis on the export.json.zip which Bryan kindly provided.
Good news : ~90MB unpacked export.json for mathlib parses and loads promptly into objects using an ordinary java setup.
We see 5 kinds of top level entities: decls, instances, mod_docs, notes, tactic_docs
We have 79515 decls, each of which contains these fields,
structure_fields, name , line, equations, constructors, attributes, filename, args, doc_string, kind, type, is_meta
Example:
"args": OMITTED
"attributes" : [ ],
"constructors" : [ ],
"doc_string" : "",
"equations" : [ ],
"filename" : "/Users/chb/Documents/lean/doc-gen/_target/deps/mathlib/src/analysis/calculus/deriv.lean",
"is_meta" : false,
"kind" : "theorem",
"line" : 1856,
"name" : "has_deriv_within_at.liminf_right_slope_le",
"structure_fields" : [ ],
"type" : OMITTED
I omitted args
and type
because they contain the meat: long text/structure including much spicy unicode.
A practical sticking point arises from the breadth of unicode in these exported decls. Different text environments (browser/editor/IDE, font) can give visually different results, and may show missing-char codes. (I hope that font-codepage is the ONLY challenge I hit here).
I noticed that even VSCode may fail to display all of the text from the export, so I wanted to be sure I had a real lean desktop setup to compare with. That was Friday's accomplishment, and now I am excited to probe further.
Stu B22 (Jun 06 2021 at 00:51):
It SEEMS that the efmt
strings written by export_json.lean contain some unicode chars that do not commonly appear in .lean source files.
In the source code here
https://github.com/leanprover-community/doc-gen/blob/master/src/export_json.lean
we see that the output format is summarized as:
interface DeclInfo {
name: string;
args: efmt[];
type: efmt;
doc_string: string;
filename: string;
line: int;
attributes: string[];
equations: efmt[];
kind: string;
structure_fields: [string, efmt][];
constructors: [string, efmt][];
}
Where efmt is defined as follows ('c' is a concatenation, 'n' is nesting):
type efmt = ['c', efmt, efmt] | ['n', efmt] | string;
...and we see the decls are actually marshalled by this function
meta def decl_info.to_json : decl_info → json
| ⟨name, is_meta, args, type, doc_string, filename, line, attributes, equations, kind, structure_fields, constructors⟩ :=
json.object [
("name", to_string name),
("is_meta", is_meta),
("args", json.array $ args.map $ λ ⟨b, s⟩, json.object [("arg", s.to_json), ("implicit", b)]),
("type", type.to_json),
("doc_string", doc_string.get_or_else ""),
("filename", filename),
("line", line),
("attributes", json.of_string_list attributes),
("equations", equations.map efmt.to_json),
("kind", kind),
("structure_fields", json.array $
structure_fields.map (λ ⟨n, t⟩, json.array [to_string n, t.to_json])),
("constructors", json.array $
constructors.map (λ ⟨n, t⟩, json.array [to_string n, t.to_json]))]
Stu B22 (Jun 06 2021 at 01:13):
It appears that the output text contains a lot of 3-byte unicode characters beginning with hex EE.
The hex sequence EE 80 80
occurs a lot, similarly EE 80 81
and EE 80 82
.
It seems these chars are from the Private Use Area range of unicode. https://en.wikipedia.org/wiki/Private_Use_Areas
Stu B22 (Jun 06 2021 at 01:18):
I get the impression that these special chars are being used by the doc-gen export_json.lean
to mark-up the efmt
text output, in a way that is useful to the downstream doc-gen python scripts.
Stu B22 (Jun 06 2021 at 01:31):
It looks like most of these occurrences are delimiting the occurrences of (linkable?) symbols.
Eric Wieser (Jun 06 2021 at 06:59):
Does this help? https://github.com/leanprover-community/doc-gen/blob/7e76fe3e93fa96aa6332d3e6cbcfacd168f59d4b/print_docs.py#L354
Eric Wieser (Jun 06 2021 at 07:00):
There is no such thing as a "3 byte unicode character"; what you're describing is a unicode codepoint whose UTF-8 encoding has 3 bytes.
Stu B22 (Jun 06 2021 at 19:25):
Yes, that link helps, thanks, and good point about the unicode definitions.
Last updated: Dec 20 2023 at 11:08 UTC