Zulip Chat Archive

Stream: general

Topic: Dataset of “info” command over all of mathlib

Jason Rute (Feb 14 2020 at 22:59):

I’m interested in running the lean server “info” command for all text positions in all mathlib files. Before I go about doing this, has anyone else attempted this? Does the dataset exist somewhere? How about the code? Such information would be useful as a machine learning training data set.

Mario Carneiro (Feb 15 2020 at 02:33):

I think @Patrick Massot did something similar for the tactic tex-ifier

Jason Rute (Feb 15 2020 at 03:23):

Do you mean https://github.com/leanprover-community/format_lean? That is sort of where I got the idea (through @Stanislas Polu). However, if all I want to do right now is grab the raw data, it might be easier for me to just do it from scratch.

Jason Rute (Feb 15 2020 at 03:35):

I would be curious for any tips however. It seems that the Lean server is extraordinarily finicky.

Mario Carneiro (Feb 15 2020 at 03:37):

is there more to the problem than you've stated? It seems pretty simple (if expensive) to me

Jason Rute (Feb 15 2020 at 03:55):

It is pretty much as I said. I think in the end I will try to clean up the data, and I probably want to play with the pretty printer (like using pp.all set to true). I may even want to go back and try to inject my own tracing tactic into the proofs to pull out more information, but for now I just want to see how well this idea works. It should give me a lot of data about Lean proofs.

Mario Carneiro (Feb 15 2020 at 04:05):

I'm almost certain that pp.all on every character of every file is more than you can store, because you are likely to hit one of the printing worst cases somewhere.

Mario Carneiro (Feb 15 2020 at 04:06):

Note that asking only for "info" on every character is going to miss a lot of proofs, since you get no info about term proofs, and these make up a significant fraction of mathlib proofs

Jason Rute (Feb 15 2020 at 15:55):

@Mario Carneiro :

I shouldn't need to run on all characters, just all "words". I could also do other things to compress the dataset if needed.
pp.all maxes out at a certain (smallish) number of characters and replaces parts with .... That itself is not ideal, but I can at least get a sense for how often this happens.
I'm mostly interested right now in tactic proofs. I do agree, however, I'm going to miss all the term proofs.
More upsetting, I'm not going to get separate results for all the tactic steps combined by ; since "info" doesn't print the goals separately.
Unless this is a massively expensive computation, I can always iterate to improve this data collection method. (The next step would be to insert a custom tracing tactic into the code at certain spots, but I'm afraid this will take a lot longer to run. I could also hack the lean server code, but that involves learning C++...)

Last updated: May 02 2025 at 03:31 UTC