Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Advice on using PDFs with LLMs (e.g. math papers)?

Daniel Windham (Nov 26 2024 at 20:46):

I'm looking to extract formal specifications from PDFs that specify cryptography algorithms in natural language, math, and pseudocode. I don't care about diagrams or other images. I assume that the LaTeX files that generated the PDFs are the right starting point if at all possible, but in some cases I only have access to the PDFs.

Do folks here have experience working with PDF inputs in machine learning? Any advice on how to approach this and how much work to expect?

Thanks!

Kaiyu Yang (Nov 26 2024 at 20:49):

mathpix works well for me.

Daniel Windham (Nov 26 2024 at 20:53):

Great. Any sense of how reliably mathpix creates correct LaTeX, and what edge cases it fails on?

E.g. when you use mathpix, do you end up manually double-checking the conversions, or is it good enough that you trust it out of the gate?

Siddhartha Gadgil (Nov 27 2024 at 09:06):

There is also nougat

Daniel Windham (Dec 03 2024 at 15:03):

Here's been my early experience with mathpix. @Kaiyu Yang, @Siddhartha Gadgil, and others, I'd love to know how this matches your experience or compares to other tools like nougat.

✔ Text, math, and special-character translation are very good. Minor mistakes happen, but they’re rare. Both inline and block math expressions work well.
- ? Complicated math notations (summations, long division, etc.) weren't in my test set, so I haven’t tested this.
✔ Image translation is pretty good.
✘ Tabular data translation is very buggy. It’s common to lose and/or make up data, and the generated table layout is usually wrong.
✘ Code-formatted data translation is poor. This seems like a special case of struggling with formatted text. The generated Latex routinely has invalid format strings that get rendered as text.
✘ Full-pdf conversion led to a few pages being entirely dropped. It seems like this was triggered by large images on preceding pages.
✘ Page layout and text formatting are poorly translated. For example, hyperlinks are always lost, and the generated Latex doesn’t describe document hierarchy well.

I got mathpix's PDF-to-Latex conversion working, but I haven't been able to get image-to-Latex (e.g. with a Snipped screenshot) to work at all. I sent mathpix a support ticket yesterday and haven't heard back yet.

Kaiyu Yang (Dec 03 2024 at 15:29):

I have only tried text, math, and special characters. It works pretty good for me on those content.

Siddhartha Gadgil (Dec 03 2024 at 15:31):

I have never used it on scale. I used Nougat for larger documents

Last updated: May 02 2025 at 03:31 UTC