Zulip Chat Archive

Stream: general

Topic: Blueprint for Lean 4


Terence Tao (Nov 13 2023 at 19:15):

Hi all,

I am thinking of starting a new project to formalize a recent paper of myself and coauthors https://arxiv.org/abs/2311.05762 (I already have started a bare-bones github for this at https://github.com/teorth/pfr ), with aim to showcase more the collaborative aspect of formalization than I did in my previous project (which was primarily solo author). On talking with Bhavik Mehta (who has kindly agreed to join the project), he suggested first developing a blueprint for the proof, but he mentioned that existing blueprint software might be geared more towards Lean3 rather than Lean4. What is the current status of blueprint for Lean4, and what recommendations would people have for setting this up?

Jireh Loreaux (Nov 13 2023 at 19:18):

This thread is relevant.

Terence Tao (Nov 13 2023 at 19:25):

A related question - would the Lean Zulip be an appropriate location to host a stream to publicly coordinate such a project? I can also use e.g., my personal blog or the github comments for this, but there are some features on this zulip that make it convenient to discuss Lean code in particular.

Kevin Buzzard (Nov 13 2023 at 19:25):

My understanding is that the FLT-regular project here is now using Lean 4's version of the blueprint software.

Kevin Buzzard (Nov 13 2023 at 19:27):

Re a stream: I should think that this Zulip would be a great place to host a stream for this project. We did this for the liquid tensor experiment and it worked very well. Another advantage of a stream is that you can have multiple threads to have many different conversations at once.

Riccardo Brasca (Nov 13 2023 at 19:28):

Yes, the flt-regular blueprint has been set up by @Chris Birkbeck and @Utensil Song

Bhavik Mehta (Nov 13 2023 at 19:45):

Terence Tao said:

A related question - would the Lean Zulip be an appropriate location to host a stream to publicly coordinate such a project?

The sphere eversion project, liquid tensor experiment and FLT-regular project all have dedicated streams for them which in each case worked very well, so I think this would be a very useful thing to do.

Johan Commelin (Nov 13 2023 at 20:03):

@Terence Tao If you give us a name for the stream, a one-line description (optional), and indicate whether you want the stream to be private or public, then we can easily create the stream for you.

Terence Tao (Nov 13 2023 at 20:13):

OK. If it doesn't disrupt the user experience for other Zulip users then I don't see any harm in creating a stream now, even if it might start off rather low volume as the project is only just now getting started. For a title, one can write "Polynomial Freiman-Ruzsa conjecture", and a short description would be "Project to formalize the Polynomial Freiman-Ruzsa conjecture (PFR) in Lean4".

Chris Birkbeck (Nov 13 2023 at 20:13):

Riccardo Brasca said:

Yes, the flt-regular blueprint has been set up by Chris Birkbeck and Utensil Song

I'm also happy to answer any questions about setting up the blueprint (although it really was @Utensil Song that did the heavy lifting)

Johan Commelin (Nov 13 2023 at 20:17):

The new stream is: #Polynomial Freiman-Ruzsa conjecture

Terence Tao (Nov 13 2023 at 20:39):

OK, I guess any discussion about that project can move to that stream now. Thanks!

Utensil Song (Nov 14 2023 at 02:12):

Chris Birkbeck said:

Riccardo Brasca said:

Yes, the flt-regular blueprint has been set up by Chris Birkbeck and Utensil Song

I'm also happy to answer any questions about setting up the blueprint (although it really was Utensil Song that did the heavy lifting)

Me too. The heavy lifting is actually done by doc-gen4 and the original blueprint, the Lean 4 version of blueprint is a very lightweight solution.

Some related info:

Patrick Massot (Nov 14 2023 at 02:41):

What is it so slow?

Utensil Song (Nov 14 2023 at 03:10):

The most time consuming parts are installing TeXLive full (5 min) and building doc-gen4 for Mathlib and its deps

Utensil Song (Nov 14 2023 at 03:13):

See https://github.com/YaelDillies/LeanAPAP/actions/runs/6839921356/job/18598636544 for a breakdown

Patrick Massot (Nov 14 2023 at 03:37):

Oh, that's doc-gen4 timing! I thought you were talking about plaTeX.

Henrik Böving (Nov 14 2023 at 07:31):

Also with respect to doc-gen speed, if this speed is too slow we have a little issue. After the last round of improvements i did a bit of profiling again and it spends the vast majority of its time in compiler APIs now, then a bit of time initializing the Unicode library and lastly a small bit of time on its own.

There is already some work happening on the Unicode one but unless we do some funky stuff in the compiler doc-gen is probably not going to make as large leaps in performance anymore as it used to. (We were in the multi hours back in the day, now its 15 min on the hoskinson machines)

Yaël Dillies (Nov 14 2023 at 08:32):

I am going to set up a blueprint for the project this afternoon.

Sky Wilshaw (Nov 15 2023 at 12:54):

By the way, I plan to port @Utensil Song's branch of leanblueprint to Lean itself, which would completely remove the only remaining python dependency in most Lean 4 projects. My goal is to just write my own TeX parser (for some limited subset of TeX that we actually use in blueprints), and do basically the same processing as in the python code.

Sky Wilshaw (Nov 15 2023 at 12:55):

I'll properly start writing the code after term's finished at the end of the month.

Patrick Massot (Nov 15 2023 at 13:11):

I think you are very much underestimating the difficulty of writing a TeX compiler.

Sky Wilshaw (Nov 15 2023 at 13:12):

I don't intend to support all of TeX, of course - just the idioms that we commonly use in blueprints. I'm going to use MathJax to display things like equations.

Johan Commelin (Nov 15 2023 at 13:17):

I am strongly in favour of keeping plasTeX as a dependency.

Johan Commelin (Nov 15 2023 at 13:18):

"just the idioms that we commonly use in blueprints" is not very well-defined.

Henrik Böving (Nov 15 2023 at 13:25):

Sky Wilshaw said:

By the way, I plan to port Utensil Song's branch of leanblueprint to Lean itself, which would completely remove the only remaining python dependency in most Lean 4 projects. My goal is to just write my own TeX parser (for some limited subset of TeX that we actually use in blueprints), and do basically the same processing as in the python code.

I think you should just leave it be as is right now. @David Thrane Christiansen has some stuff regarding documentation generation etc. coming up in Lean that will hopefully take care of this type of stuff.

David Thrane Christiansen (Nov 15 2023 at 13:37):

I will be reaching out when the system is a bit more complete to find out how to make sure we can serve these use cases - thanks for the ping!

Patrick Massot (Nov 15 2023 at 14:11):

Sky Wilshaw said:

I don't intend to support all of TeX, of course - just the idioms that we commonly use in blueprints. I'm going to use MathJax to display things like equations.

Ok, let me reformulate: I think you are very very much underestimating the difficulty of writing a TeX compiler that is good enough for the idioms we use in blueprints.

Shreyas Srinivas (Nov 15 2023 at 14:23):

Would it be useful to write this on top of pandoc?

Jireh Loreaux (Nov 15 2023 at 14:26):

What's the point? Why not just call TeX (or whatever variant is currently being used)?

Shreyas Srinivas (Nov 15 2023 at 14:28):

they give you an AST in a suitable algebraic data type?

Patrick Massot (Nov 15 2023 at 14:30):

plasTeX already gives you something like an AST (although it's closer to a parsed XML document).

Patrick Massot (Nov 15 2023 at 14:32):

The main issue with plasTeX is that it was started almost 25 years ago and the original author stopped maintaining it long ago. So a large part of the codebase is very obscure and very far away from modern python coding standards. But, to my knowledge, it is still the best option you have if you want to process TeX code and do something else than producing a pdf-like output.

Shreyas Srinivas (Nov 15 2023 at 14:34):

If you want it in python see : https://boisgera.github.io/pandoc/

Patrick Massot (Nov 15 2023 at 14:34):

Rewriting it in fully modern python or a fancier programming language would be a huge task. And frankly I think it is not worth the trouble. I think it would be much better to work on a better input language and keep only the math mode of TeX.

Shreyas Srinivas (Nov 15 2023 at 14:35):

you mean markdown with math?

Patrick Massot (Nov 15 2023 at 14:35):

Something like markdown with math, but markdown is clearly not powerful enough.

Patrick Massot (Nov 15 2023 at 14:35):

As far as I know pandoc is very far away from being able to parse a LaTeX file.

Sky Wilshaw (Nov 15 2023 at 14:35):

There's also https://typst.app/.

Sky Wilshaw (Nov 15 2023 at 14:36):

Although I don't think it really supports anything other than PDF export at the moment.

Patrick Massot (Nov 15 2023 at 14:37):

There various experiments like this. The thing is that LaTeX code should be at most an intermediate language between a better input language and the pdf output.

Patrick Massot (Nov 15 2023 at 14:37):

So writing a LaTeX parser is not a good idea in my opinion.

Shreyas Srinivas (Nov 15 2023 at 14:39):

Patrick Massot said:

There various experiments like this. The thing is that LaTeX code should be at most an intermediate language between a better input language and the pdf output.

latex is not a very good IR. It is too complex.

Patrick Massot (Nov 15 2023 at 14:39):

We should have a good input language with a parser to a nice AST and then backends that turn this AST into either a LaTeX file to be compiled to pdf or a web page (or epub). The AST could also be used to manipulate the input, collect statistics, create graphs and all those things that plasTeX allows on top of rendering to whatever language you want (html is not baked into plasTeX, you can write renderers to other targets).

Patrick Massot (Nov 15 2023 at 14:40):

Sure, it is an awful IR, that's why we need some nice AST.

Patrick Massot (Nov 15 2023 at 14:40):

In my dream LaTeX would be an implementation detail of the AST to pdf pipeline.

Sky Wilshaw (Nov 15 2023 at 14:41):

As far as I can tell, pandoc already does this (or something very similar).

Patrick Massot (Nov 15 2023 at 14:41):

Just like assembly code can be an intermediate between nice code and machine code.

Shreyas Srinivas (Nov 15 2023 at 14:41):

Yeah I agree @Sky Wilshaw . Pandoc does a very good job with latex files. What features are you missing in pandoc @Patrick?

Patrick Massot (Nov 15 2023 at 14:42):

My knowledge of pandoc may be outdated. I know I tried it before plasTeX but that was in 2010.

Shreyas Srinivas (Nov 15 2023 at 14:42):

Pandoc is drastically better than it used to be.

Sky Wilshaw (Nov 15 2023 at 14:42):

Pandoc's first commit was only in 2006, so I think there's quite a lot you were missing out on!

Patrick Massot (Nov 15 2023 at 14:43):

And now I feel old.

Shreyas Srinivas (Nov 15 2023 at 14:43):

MathML is not a bad language for web output output as far as I know. Arxiv even has an experiment with this. Further if you really just want to convert latex to web output, using KaTeX as a server is an option these days, so you can skip the whole parsing bit and just run a katex server in the background.

Sky Wilshaw (Nov 15 2023 at 14:44):

For what it's worth, all of the various standards we've been discussing are intercompilable with modern pandoc.

Patrick Massot (Nov 15 2023 at 14:44):

Wait, I miscalculated. I tried pandoc in 2014.

Patrick Massot (Nov 15 2023 at 14:44):

MathML is definitely not an human-input language.

Patrick Massot (Nov 15 2023 at 14:45):

And I don't think KaTeX or MathJax can parse a full TeX file, they only do the math fragment. I'm not complaining about the math fragment of TeX.

Shreyas Srinivas (Nov 15 2023 at 14:45):

Then pandoc is your best bet

Patrick Massot (Nov 15 2023 at 14:46):

I don't have time for this but I would be interested if someone can run pandoc on the LTE or Sphere eversion blueprints.

Patrick Massot (Nov 15 2023 at 14:50):

Before dreaming too much about getting rid of LaTeX, you should know that mathematicians have been badly trained at micromanaging their pdf output for 40 years, and you would have a very hard time convincing them to switch to a higher level input language where they can't control precisely the pdf output. That's the real obstacle.

Sky Wilshaw (Nov 15 2023 at 14:52):

I'm not hoping to get rid of LaTeX by any means - but blueprints are to be displayed on the web, so it makes sense to me to design blueprints to be rendered in a web-first manner.

Patrick Massot (Nov 15 2023 at 14:52):

Blueprints are also rendered in pdf!

Sky Wilshaw (Nov 15 2023 at 14:53):

At least in my experience, the fact that blueprints also compile to TeX is a nice bonus, but not where I actually spend time.

Sky Wilshaw (Nov 15 2023 at 14:53):

You lose all interactivity when you do the compilation so it makes it less convenient to work with.

Patrick Massot (Nov 15 2023 at 14:53):

If you work on a serious project then the pdf output is also useful.

Patrick Massot (Nov 15 2023 at 14:53):

By serious project I mean that you need to spend time to understand the mathematical content.

Shreyas Srinivas (Nov 15 2023 at 15:44):

If you want to get into specific details of the latex code at a high level, there's this : https://pypi.org/project/TexSoup/
But like webscraping, I can't imagine latex scraping is truly reliable and not brittle to small changes.

Utensil Song (Nov 16 2023 at 06:18):

With the coming upgrade of Lean documentation system using meta-programming envisioned by @David Thrane Christiansen , my understanding of the situation is that the investment into tools like blueprint is better kept to as lightweight as possible, in the hope that they will be integrated into the new Lean documentation system, as an interactive and drill-down-able super set of what's available in Isabelle (e.g. the proof outline and proof document generated for formal proofs on AFP that mixes math and lean proof naturally).

Typst is great on its own, but it's a completely separated ecosystem, and provides no native support for transition from LaTeX. As for an effort to rewrite the core engine of LaTeX parsing but reuse most other tools and packages in the LaTeX ecosystem, see Tectonic. Although I use and like them both, I'm still skeptical about their long-term survival.

David Thrane Christiansen (Nov 16 2023 at 06:26):

Typst looks like a really nice typesetting language!

The system that I'm working on (literally, as I take a break to answer this comment) is not a typesetting language like LaTeX, but rather a higher-level DSL built on top of Lean. The syntax is very close to Markdown, but not 100% compatible for various technical reasons (existing Markdown languages are also not 100% compatible with each other, so I think this is likely to be within the realm of the acceptable). There is an extension mechanism, similar to roles or directives in Sphinx, that allows the documentation language to get new features for specific applications like Blueprint - unlike LaTeX, where macros produce new source code strings, these extensions will work on the level of document ASTs.

This means that Lean libraries can provide documentation features, with full access to the Lean metaprogramming API. This is a fairly general framework, and a key feature will be the ability to specify a "genre" for a document. A genre is a collection of documentation features (index, figures, equations, embedded Lean, etc) along with a way to render it to some output formats of interest, like HTML or TeX or EPUB or Typst or serialized Pandoc ASTs or whatever else the genre's author needs.

After reading up on what people do with Lean and documentation today, and getting some great guidance from @Patrick Massot, I believe that this system will work for our needs here - I'll certainly revise it until it does!

My goal right now is to get something useable in your hands ASAP - so I'll head back to that now :-)

Siddhartha Gadgil (Nov 16 2023 at 08:11):

May not be relevant, but a reasonable enhanced markdown is https://github.com/Mathpix/mathpix-markdown-it.
This is the output format of Nougat which is OCR for PDF files. Compiling LaTeX to pdf and then extracting may be roundabout but is an option.


Last updated: Dec 20 2023 at 11:08 UTC