Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: tokenization of latex for use in sequence models

Stan111 (May 22 2023 at 15:12):

I have been experimenting with GPTs trained on latex texts. The tokenization does not respect the semantics of the text. While we know that the paper structure is driven by the logic of the proofs that underlie the theorems. It seems to be reasonable to seek a version of attention mechanism based on the math symbols rather than statistics of bytes ( as it is done now, using tools like SentencePiece). I.e., I am looking for something like combination of attention and Syntax Tree ( in proof theory sense), as applied to math texts as usually found on arxiv. Was there work in this direction, and if so what are the references?

Notification Bot (May 22 2023 at 15:48):

This topic was moved here from #maths > tokenization of latex for use in sequence models by Kyle Miller.

Kyle Miller (May 22 2023 at 15:51):

@Stan111 I moved your message to this stream since it seems like it's a better match (#maths is generally for discussions about formalizing pure math)

Fabian Glöckle (May 22 2023 at 16:20):

I guess you are thinking of something like this: https://arxiv.org/pdf/1905.10006.pdf
But as it turns out transformers with dense attention (to all other tokens) generalize this and are very good on syntactical tasks, so there is no need to spend too much effort on this part.
Clearly, tokenizers based on the statistics of mathematical text would be more economic for mathematical texts, but the statistics should be optimized on the pretraining distribution where >99% of the computation is spent.

Last updated: May 02 2025 at 03:31 UTC