Zulip Chat Archive

Stream: metaprogramming / tactics

Topic: writing large files


Patrick Massot (May 31 2021 at 08:05):

Once again I'm back to trying to get a reliable leancrawler, to explore the liquid tensor experiment code. But as usual I'm hitting completely random failure to handle large datasets. I tried using tactic.traceand redirecting output to a file and I tried using the io monads. But in all cases some data simply disappear. It's very clear with io where the file simply stop in the middle a line. With trace there are missing lines. I'd be really grateful if someone competent could have a look. I put a compiled LTE at https://www.imo.universite-paris-saclay.fr/~pmassot/liquid.tar.bz2, together with stats.lean in the project root folder to see what I mean. That version tried the io with list split version that is used in dc-gen.

Patrick Massot (May 31 2021 at 08:06):

The challenge is to get a well-formed yaml file containing 73954 lines starting withName: ".

Jason Rute (May 31 2021 at 13:26):

I often use the --json flag for this sort of thing. Does that help?

Jason Rute (May 31 2021 at 13:32):

Although if both tracing and IO both loose data, that probably won't help.

Jason Rute (May 31 2021 at 13:34):

Another thing I've run across is that compiling Lean takes a lot of memory. (Are you tracing during compiling?) If your machine doesn't have enough memory, Lean might crash halfway through and you will be missing data. Have you checked that it is exiting properly? (If running from a python process you can look at the return code.)

Jason Rute (May 31 2021 at 13:43):

Forget about my compiling comment. I see you are just looping over the environment. (I got so caught up in the hacks I did to make the lean proof recording project work, that I forget there are other more standard ways to get information out of Lean.) I guess my comment about Lean exiting halfway through could still be true though.

Jason Rute (May 31 2021 at 14:17):

You may want to look at https://github.com/jesse-michael-han/lean-step-public. It does something similar, extracting a similarly large amount of data from the lean environment. I'm not aware of similar issues with missing data. It is a big project, so you may want to ask @Jesse Michael Han for pointers.

Jesse Michael Han (May 31 2021 at 14:38):

iirc that repository writes huge amounts of JSON-formatted data (up to thousands of datapoints per declaration for all the tens of thousands of declarations in mathlib) so you could probably copy-paste that code and then instantly convert it to yaml afterwards

Patrick Massot (May 31 2021 at 16:15):

Thanks, I'll have a look.

Eric Wieser (May 31 2021 at 19:54):

I remember I proposed a change to doc-gen to have it write the json to a file instead of stdout, but it was rejected as being too slow.

Eric Wieser (May 31 2021 at 19:56):

Perhaps that's not as relevant as I think it is. Gabriel alluded to a problem in this comment: https://github.com/leanprover-community/doc-gen/pull/99#issuecomment-730216929


Last updated: Dec 20 2023 at 11:08 UTC