Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Nougat, academic OCR model from Meta AI

Junyan Xu (Aug 28 2023 at 15:14):

Model: https://github.com/facebookresearch/nougat
Paper: https://arxiv.org/abs/2308.13418

Nougat: Neural Optical Understanding for Academic Documents
Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.

h/t https://twitter.com/_akhaliq/status/1696107275570843667

Discussion about images of academic papers/books as data source for AI started here.

Utensil Song (Aug 28 2023 at 16:59):

Tested it with 28ch.pdf and smshort.pdf, it produces results comparable to MathPix (I'm yearly paid user), quite impressive. It's able to handle something like


but slightly less robust than MathPix as it would produce invalid stuff like \., \end{smallmatrix}, and is confused by the following matrices:


(Nougat produces "misplaced &", MathPix handles it correctly)


(Nougat produces misaligned matrices, MathPix gave up and produce an image for it instead, which means that it has some kind of process to verify the result, and when it's invalid or not close enough, it falls back to emitting an image instead )

Jason Rute (Aug 28 2023 at 20:30):

@Junyan Xu interesting but shouldn’t this be a new thread?

Notification Bot (Aug 29 2023 at 00:05):

3 messages were moved here from #Machine Learning for Theorem Proving > LLM+ITP(+AZ)? by Junyan Xu.

Min-Hsien Weng (Aug 31 2023 at 00:11):

A language model that can recognize mathematical texts is very useful for reasoning. I could not help but notice the repetition problem described in Section 5.4 that the model tends to repeat its previous sentences.

We notice that the model degenerates into repeating the same sentence over and over again. The model can not recover from this state by itself. In its simplest form, the last sentence or paragraph is repeated over and over again.

Another study, titled "Learning to Break the Loop: Analyzing and Mitigating Repetitions for Neural Text Generation" (link: https://arxiv.org/pdf/2206.02369.pdf), discussed the reason of repeating sentences. The repeated words seem to have a higher probability of appearing in the next generated text....

There were similar thread posts on Reddit about ChatGPT repeating itself. A common suggestion is to use ChatGPT's frequency_penalty parameter to help control the number of repeated words in the generated texts. Interesting!

Patrick Nicodemus (Sep 02 2023 at 16:24):

This looks really cool. Great paper!
Regarding the markup backend, in general the concerns you point out in your introduction about machine processing of information (for semantic analysis, search and retrieval, machine learning, etc.) have been studied by the KWARC group.
In terms of making connections with existing work, I think that it could be a really good opportunity if you were to look into their stuff and see if there is anything you can connect with them on.
Michael Kohlhase has been working for many years on an XML extension called OpenMath /OMDoc which is intended to store additional semantic content of mathematical symbols, equations, etc. for better corpus analysis, searching, machine processing and so on. There is also MathML which is an industry standard adopted by the W3C.

Junyan Xu (Dec 01 2023 at 20:39):


marker - a pdf to markdown converter that is 10x faster than nougat, more accurate outside arXiv, and has low hallucination risk. Marker is optimized for throughput, like converting LLM pretrain data. https://github.com/VikParuchuri/marker

Last updated: Dec 20 2023 at 11:08 UTC