Zulip Chat Archive

Stream: Equational

Topic: Machine learning, first results

Jose Brox (Oct 18 2024 at 21:50):

Today we managed to produce our first deep learning model for the implications. This is joint work with Fernando Vaquerizo (who for the moment being prefers to let the Zulip interaction, etc. to me).

Our first ideas were on the line of applying natural language processing (NLP) to the problem, what would imply tokenization + vectorization before training some network, since we think that using a pretrained network for some natural language would give a poor response.

But to calibrate the difficulty we have started with just tokenization (no vectorization) and a small convolutional neural network with few training epochs.

The results have been surprisingly good! With a 20% of the data for testing, the accuracy is 99.44% (meaning that the model only mis-classifies around 123 thousand of the 22 million implications).

This has even made us doubt the training data (perhaps the file I generated from the Github output was wrong?), but as far as I can tell it seems correct.

We will not be able to keep working on this until Monday. Since there still is some room for improvement in this model, at least we have an optimization running.

Next steps to improve the accuracy even more can be using a deeper CNN, using vectorization, and tweaking some parameters like the reduction of dimensionality. We will try them next week. I'm also curious to try decision trees and see how they fare against CNNs.

After we find the best possible model, we will use explainable AI (XAI) in order to try to see what patterns is the model seeing, and hopefully extract some metatheorems from them. This may take some time.

In case you are wondering: we haven't asked the model how does it classify our most interesting questions (Astérix/Obélix, Dupont/Dupond, 5105 is Austin...), will do on Monday, hopefully also in a better model.

(I end with an anecdote. This morning, a department colleague asked me:
-- Machine learning, what do you want that for?
-- With it I will get my last paycheck. And the rest of you will never get paid again!
(Half joking? :scream:))

Adam Topaz (Oct 18 2024 at 22:02):

Nice! I should mention that I also did something similar over the last couple of days, except that I built and trained a transformer, not a CNN. My model was trained on 70% of the available implication data, with the other 30% for testing. No training was done on conjectures. The model itself is still training, and I plan to do some more experimentation. I plan to share more once that's done.

The current version does predict all of the conjectures from the repo to be nonimplications (as expected). I'm still evaluating the accuracy on the test set, but it seems to be around 97% accurate on that. (Of, course, it's probably the case that if one "knows" some random collection of 70% of the implications, then 97% or more of the remaining implications are "obvious".)

As mentioned above, I don't expect much for the unknown implications, but I am attaching the current predictions (see below) for those to this message, just in case. I would be interested to see what percentage of these happen to be correct! The data came from a slightly outdated copy of the repo, and since I extracted the data two conjectures were resolved (as non-implications, and hence correctly predicted by the model).

unknown_predictions

Johan Commelin (Oct 19 2024 at 04:08):

So, are these models only predicting whether an implication is true/false or do the also cook up a proof/counterexample?

Jose Brox (Oct 19 2024 at 05:59):

Johan Commelin ha dicho:

So, are these models only predicting whether an implication is true/false or do the also cook up a proof/counterexample?

In our case, only classifying the implication status. To produce a proof we would presumably need a LLM with high dimension, so many space and time resources.

Adam Topaz (Oct 19 2024 at 12:12):

Johan, what I did was actually train an encoder transformer to give vector representations associated to any equation -- I did this with about 2000000 randomly generated equations. Those vector embeddings were then fine-tuned to classify whether one implies the other using the implication data, but in principle those embeddings could be used for other tasks as well (whether or not they end up being useful is another question altogether).

Terence Tao (Oct 19 2024 at 14:22):

Given the slogan that machine learning is a form of compression, I am curious as to how large your model is compared to the 22 million bit size of the raw implication graph.

Adam Topaz (Oct 19 2024 at 14:33):

How many bits of information are needed to encode the equations themselves? Do we know that?

Terence Tao (Oct 19 2024 at 14:40):

https://github.com/teorth/equational_theories/blob/main/data/equations.txt is 158KB. The Python script https://github.com/teorth/equational_theories/blob/main/scripts/generate_eqs_list.py to generate it is 4.46KB.

Daniel Weber (Oct 19 2024 at 14:41):

We could post a question at https://codegolf.stackexchange.com/ to see how much bits are really needed

Eric Taucher (Oct 19 2024 at 14:54):

Daniel Weber said:

We could post a question at https://codegolf.stackexchange.com/ to see how much bits are really needed

Out of curiosity which tag would you use? I did not find one that seemed appropriate but then again do not do golfing regularly.

Daniel Weber (Oct 19 2024 at 15:01):

[code-golf] [kolmogorov-complexity] [string] [binary-tree] [abstract-algebra] seem good

Amir Livne Bar-on (Oct 19 2024 at 15:25):

generate_eqs_list.py

Here's a somewhat smaller version, that only generates exactly equations.txt, with dead code removed. (No change in the algorithm to remove wasteful comparisons)

Amir Livne Bar-on (Oct 19 2024 at 15:29):

the complexity of the implications matrix is much more interesting tho

Terence Tao (Oct 19 2024 at 15:30):

Lol I'd love to see how the codegolf community would handle the entire EquationalTheories codebase needed to generate the final implication graph

Eric Taucher (Oct 19 2024 at 15:56):

Daniel Weber said:

[code-golf] [kolmogorov-complexity] [string] [binary-tree] [abstract-algebra] seem good

Ahh

Correct me if this is wrong. Your question would be to generate the txt file but with as little code as possible.

I was thinking of compressing the size of the data into another format that could be converted back into the txt file.

For example a long time ago I asked this Code Golf question

https://codegolf.stackexchange.com/questions/112874/enumerate-binary-trees

and was happy to see the idea by @xnor which only used various brackets to encode the data.

BNF             xnor       Christian   Ben
b(t, b(t, t))   [{}{{}{}}] (0(00))     (1, -1, 1, -1)
b(t, u(u(t)))   [{}{(())}] (0((0)))    (1, -1, 0, 0)
b(u(t), u(t))   [{()}{()}] ((0)(0))    (1, 0, -1, 0)
b(b(t, t), t)   [{{}{}}{}] ((00)0)     (1, 1, -1, -1)
b(u(u(t)), t)   [{(())}{}] (((0))0)    (1, 0, 0, -1)
u(b(t, u(t)))   [({}{()})] ((0(0)))    (0, 1, -1, 0)
u(b(u(t), t))   [({()}{})] (((0)0))    (0, 1, 0, -1)
u(u(b(t, t)))   [(({}{}))] (((00)))    (0, 0, 1, -1)
u(u(u(u(t))))   [(((())))] ((((0))))   (0, 0, 0, 0)

Terence Tao (Oct 19 2024 at 16:05):

Techically "code to generate the text file" is a format to compress the data.

Eric Taucher (Oct 19 2024 at 16:14):

Terence Tao said:

Techically "code to generate the text file" is a format to compress the data.

Yes, understood.

The reason I noted it is that the code golfers go to some amazing lengths to save a byte or character when counting the size of their code and allowing them to change the representation as @xnor did can get them a lower number.

To put it another way, would asking this as a Code Golf question allow for other representations that can be shown to be equivalent or should all results generate the txt file? Also, should it be posed as two seperate questions, one for only txt output and one that allows another representation? These questions are for @Daniel Weber as he suggested the idea.

EDIT

The Code Golf task:

Take a single positive integer n as input and output all of the distinct binary trees with n nodes.

The current shortest posted code is at 19 characters/bytes using Pyth

f!|sTf<sY0._T^}1_1t

Adam Topaz (Oct 19 2024 at 16:29):

I guess if you really want some ML method that would compress this, one could just train some embedding for the ~5000 equations. It would take C * 5000 + D bits to store this for some constants C and D. Such a model wouldn’t be able to say anything about equations that are not part of the list of 5000 though.

Daniel Weber (Oct 19 2024 at 16:47):

Eric Taucher said:

Terence Tao said:

Techically "code to generate the text file" is a format to compress the data.

Yes, understood.

The reason I noted it is that the code golfers go to some amazing lengths to save a byte or character when counting the size of their code and allowing them to change the representation as @xnor did can get them a lower number.

To put it another way, would asking this as a Code Golf question allow for other representations that can be shown to be equivalent or should all results generate the txt file? Also, should it be posed as two seperate questions, one for only txt output and one that allows another representation? These questions are for Daniel Weber as he suggested the idea.

I think it would be best to ask for exactly one equation from each equivalence class (where the relation consists of renaming and swapping LHS and RHS), and allowing any reasonable format (which could be strings like a * (b * c) = (a * b) * c, prefix/postfix, any kind of built-in binary tree representation with labelled leaves, etc.)

Terence Tao (Oct 19 2024 at 16:49):

It occurs to me that the neural network may largely be reacting to the number of variables present in each law, see the discussion at equational#278 and associated links. As a general rule of thumb, we do not expect equations with fewer variables to imply equations with more variables, for instance.

Jared green (Oct 19 2024 at 16:55):

(except when the new variables are part of a function application to each side of the antecedent)

Amir Livne Bar-on (Oct 19 2024 at 17:32):

If we want to learn new math from the models, it may be worth to over-train them until they reach the "grokking" regime. There's evidence that grokking works by finding approximate group structures in the training set.

Adam Topaz (Oct 19 2024 at 19:01):

Amir Livne Bar-on said:

If we want to learn new math from the models, it may be worth to over-train them until they reach the "grokking" regime. There's evidence that grokking works by finding approximate group structures in the training set.

Do you have a reference for this? Sounds interesting!

Michael Bucko (Oct 19 2024 at 19:02):

Adam Topaz schrieb:

Amir Livne Bar-on said:

If we want to learn new math from the models, it may be worth to over-train them until they reach the "grokking" regime. There's evidence that grokking works by finding approximate group structures in the training set.

Do you have a reference for this? Sounds interesting!

https://arxiv.org/pdf/2201.02177

Amir Livne Bar-on (Oct 19 2024 at 19:04):

Yes, that's the paper where they found this phenomenon. There was a blog post after that, that dug into the matrices in the network, and found that the embedding uses characters of the group:
https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking

EDIT: also on arxiv
https://arxiv.org/abs/2301.05217

Adam Topaz (Oct 19 2024 at 19:07):

Thanks. So in this case they explicitly learned a group structure, right? I interpreted your message as saying that there is evidence that grokking in general works by (somehow) finding (approx) group structures.

Amir Livne Bar-on (Oct 19 2024 at 19:13):

They only investigated in detail what happens with modular addition. But the original paper showed that the same phenomenon happens with no-as-nice operations. We can guess that modular division also uses characters. And operations such as x^2+y^2 can conceivably use a similar approach to arrive at sparse matrices. So I'm extrapolating that this happens. I don't know how it handles multiplication in S5, maybe using some higher-dimensional representation? It's all guesswork. But I believen in it, since such embedding has very low regularization loss.

Jose Brox (Oct 20 2024 at 12:56):

Terence Tao ha dicho:

Given the slogan that machine learning is a form of compression, I am curious as to how large your model is compared to the 22 million bit size of the raw implication graph.

Nice question! Our CNN file without any compression weights 1.43MB (compare with the 22 million bit table, 2.75MB). It is just a 4-layer CNN.

As a first approximation and when the accuracy is high, it is sound to say that a model is a form of data compression. But the comparison is not completely fair, in three fronts:
1) In a way we are mistaking the output for the machine generating it: the model can generate the data, not just store it (compare the factors of a number with the algorithm to find them). In this sense it would be more accurate to compare with the Kolmogorov complexity as already mentioned above. As a rough measure, the zipped file with standard compression for our CNN weights 737kB.
2) The model does not provide lossless compression: some implications are incorrectly derived, so we can think of those as corrupted data in our table. So it would be needed to compare with machines giving approximate outputs up to a confidence degree.
3) The model a priori is useful to extrapolate (e.g. for n=5 it could be a guidance for, given any implication, if one should run first Prover9 or Mace4 on the input in order to reduce computing time).

Jose Brox (Oct 20 2024 at 13:03):

Terence Tao ha dicho:

It occurs to me that the neural network may largely be reacting to the number of variables present in each law, see the discussion at equational#278 and associated links. As a general rule of thumb, we do not expect equations with fewer variables to imply equations with more variables, for instance.

This kind of question is what XAI (explainable AI) should be able to settle. With it we should be able to extract the reasons why the CNN is returning YES or NO on a given pair with a local method, and on clusters of pairs with a global method. Then we may be able to decide if each pattern established by the model is an actual theorem, or just a heuristic which is better than a random guess for some reason (which should be some kind of probabilistic theorem). But I think there should be several kinds of patterns, anything that the human eye can catch, is catchable by a NN, plus subtler things (a priori, it is true that our CNN has only 4 layers, but also true that there are only 120 thousand mistakes).

Jose Brox (Oct 20 2024 at 13:03):

In fact, now we have optimized the CNN (run for more epochs), and the accuracy has gone up to 99.70%, so only 66 thousand wrong implications.

Vlad Tsyrklevich (Oct 20 2024 at 17:05):

Adam Topaz said:

My model was trained on 70% of the available implication data, with the other 30% for testing.

How were the 70%/30% split? Is it totally randomized or is there some other method? The implication graph is transitive, so if in the training data, the model can see that equation A implies equations 1-100, and equation B implies 200-300, if it's asked if equation A implies B in the testing set, it has a pretty good bet that it doesn't. Conversely the implication is likely to be true if all/most of equation B's implications are a subset of equation As. I think the way to structure this to make the two sets independent of each other is to take a cut of the graph, put all nodes on one side of the cut into the training set, and the edges in the cut and on the other side into the testing set, to make them totally independent of each other. Otherwise the model may just be inferring what transitivity means.

Vlad Tsyrklevich (Oct 20 2024 at 17:10):

Also most of the implication data is extremely redundant. The largest equivalence class (Equation 2) is ~1500 nodes, so since each one implies every other, the edges in the closure of that equivalence class is ~1500^2=2.2m, or about a quarter of the total implications. Furthermore, that equivalence class implies all others, so that's another ~1500*(4694-~1500)=4.8m edges. That's 85% of all implications right there. I'm not super familiar with AI/ML and how to structure experiments, so I don't know if that's an issue or not.

Terence Tao (Oct 20 2024 at 19:11):

I guess could try to generate a random poset of the same size and equivalence class structure as a comparison (but i actually dont have a reasonable random poset model to propose here )

Vlad Tsyrklevich (Oct 20 2024 at 19:14):

Not sure I follow, what is that comparison used for?

Vlad Tsyrklevich (Oct 20 2024 at 19:17):

Also had an idea that is probably a bit of a silly oversimplification of how NNs really work, but given that an NN can probably easily learn the basic rewriting rules, I wonder how well the performance of an x-level NN matches the reachability of implications using x rewrites.

Adam Topaz (Oct 20 2024 at 19:19):

I’m doing another training run now training on implications only involving a fixed randomly chosen set of 75% of the equations (also with more aggressive regularization).

Vlad Tsyrklevich (Oct 20 2024 at 19:21):

Curious to see if it performs just as well! Would also be interesting to measure the success rate of guessing all 'implication'/'non-implication' for the remaining 25% as a baseline measure for the most naive strategy.

Adam Topaz (Oct 20 2024 at 19:22):

Note that the testing implication set will now include about 50% of all implications.

Terence Tao (Oct 20 2024 at 19:24):

Its to see if the neural network is really learning specific features of the equational poset or it is giving the same performance it would give on a random poset.

another option is just to randomly permute the vertices and equation labels of the equational poset and see if the accuracy of the neural net training on this shuffled model performs noticeably worse

Jose Brox (Oct 21 2024 at 10:40):

(EDITED because of wrong answer)

For now we have generated a random label (yes/no) for each pair of equations, then trained the CNN with 60/20/20 percentages (training, validation, test) just as before. After 80 epochs (early stopping), the result is 49.99% accuracy.

Daniel Weber (Oct 21 2024 at 10:43):

It should be a poset, not completely random

Jose Brox (Oct 21 2024 at 10:51):

Daniel Weber ha dicho:

It should be a poset, not completely random

You are completely right, will fix soon.

Jose Brox (Oct 21 2024 at 11:11):

Related to transitivity, etc. Since we have only 600k explicit implications (2.7% of the total), if the CNN is learning transitivity it should perform well with a really small training dataset. According to that, we have taken 5/5/90 percentages (training, validation, test), with resulting 99.6% accuracy on test.

Next we will try 1/1/98 (so less implications that we have needed to build the whole poset).

Jose Brox (Oct 21 2024 at 11:25):

With 1% of the implications for the training set (1/1/98), the accuracy of the CNN with 4 layers is 99.3%.

Jose Brox (Oct 21 2024 at 11:51):

Terence Tao Daniel Weber ha dicho:

It should be a poset, not completely random

What if we just use numbers for the labels? We maintain exactly the same poset, just with random numbers as labels (not the equation numbers, which may carry some information because of the order in which the equations are generated). This is easy to generate from the project files. This would be enough, right?

Alex Meiburg (Oct 21 2024 at 14:20):

For graphs, there's a wide variety of models used in practice for random graphs that capture real world scenarios better than the Erdos-Renyi random graph. Like "sample vertices as points in R^n from a mixture of k Gaussians, then add edges with a probability based on their distance". A popular question then is finding a model that seems to fit a given dataset well. (Ideally this would be maximizing the likelihood of the dataset under the model, but in practice it's looking for something that reproduces a few select statistics well.)

I doubt there's established models like this for posets. (Of course we can talk about taking just the minimal edges and forgetting directedness, but this also loses a lot of information.) But we could probably come up with something.

Alex Meiburg (Oct 21 2024 at 14:21):

For example: how large of an "n" do we need to embed the poset in R^n, equipped with the elementwise ≥ relation?

Alex Meiburg (Oct 21 2024 at 14:22):

And, if we built a random poset by sampling points randomly in [0,1]^n, does our poset look "typical"?

Alex Meiburg (Oct 21 2024 at 14:23):

Claim: n is at least 3, otherwise Graphiti could give a planar presentation :smile:

Vlad Tsyrklevich (Oct 21 2024 at 14:33):

Seems easier to just segment the implication data set and see how well it performs, unless there's some additional insight we get by sampling random graphs that we don't from looking at the one we have.

Alex Meiburg (Oct 21 2024 at 15:41):

Continuing on the earlier thoughts, there are (distinct from the above) ideas of Order Dimension, k-Dimension (same article), and Interval dimension which are also good measures how 'complicated' a poset is. It would be great to try to calculate such an embedding. (Maybe start with the sublattice for just 3-application equations first.)

Daniel Weber (Oct 21 2024 at 15:42):

Alex Meiburg said:

For example: how large of an "n" do we need to embed the poset in R^n, equipped with the elementwise ≥ relation?

The order dimension is this, right?

Alex Meiburg (Oct 21 2024 at 15:49):

(I've been trying to read more about it since I sent the link) I think so, I think I initially misunderstood it. It's kind of surprising that it's NP-complete to test for order dimension 3, since the "left-right algorithm" sounds so much it like it should generalize immediately to higher dimensions.

Adam Topaz (Oct 21 2024 at 18:14):

Vlad Tsyrklevich said:

Curious to see if it performs just as well! Would also be interesting to measure the success rate of guessing all 'implication'/'non-implication' for the remaining 25% as a baseline measure for the most naive strategy.

This seems to again have about 97% accuracy on the testing set, after training for only one epoch. In this case I chose a random set of 75% of the equations, and trained only on the implications whose lhs and rhs are both contained in that fixed set of equations. I'm surprised that the accuracy is that high, to be honest. The model seems to be learning something. I'm going to try again with only 25% of the equations to see what happens.

Vlad Tsyrklevich (Oct 21 2024 at 18:53):

Interesting

Adam Topaz (Oct 21 2024 at 18:59):

To clarify, it ended up being ~97% accuracy both on the implications which involve some equation from the remaining 25%, and also on the implications which only involve equations from the remaining 25%.

Alex Meiburg (Oct 21 2024 at 23:53):

The fact that it scored a similar amount on each is impressive! That makes it sound like it really didn't overfit much at all.
So this is a transformer that you trained to give meaningful embeddings (= embeddings that could give good predictions), and it produced meaningful embeddings on the remaining equations too?

How big is the embedding dimension? Can you give us an example of what are some "maximally distinct" equations (by some metric of, far-apart embeddings)?

Adam Topaz (Oct 25 2024 at 22:23):

Hi all. I uploaded the data I have been using to huggingface. It can be found here: https://huggingface.co/datasets/adamtopaz/equational_dataset

The README.md has a description of how this data was generated, and how the training/validation/testing split was created.

Jose Brox (Oct 25 2024 at 22:51):

Adam Topaz ha dicho:

In this case I chose a random set of 75% of the equations, and trained only on the implications whose lhs and rhs are both contained in that fixed set of equations.

How do you compare your results with ours? With just 1% of the pairs one can get 99.3% accuracy, without embedding.

Adam Topaz (Oct 25 2024 at 22:52):

Can you say how you chose that 1%? Is it just a random 1% across all implications?

Adam Topaz (Oct 25 2024 at 22:53):

If so, then the data seems to be fundamentally different

Adam Topaz (Oct 25 2024 at 22:54):

I would be curious to see how your CNN model does on the split in terms of equations as opposed to randomly choosing one implications

Jose Brox (Oct 25 2024 at 22:54):

Adam Topaz ha dicho:

If so, then the data seems to be fundamentally different

Yes, perhaps you had already thought of a way of comparison.

Adam Topaz (Oct 25 2024 at 22:55):

No, I haven’t. I don’t see any reasonable way to compare the two models

Jose Brox (Oct 25 2024 at 23:02):

Adam Topaz ha dicho:

I would be curious to see how your CNN model does on the split in terms of equations as opposed to randomly choosing one implications

We will try to follow your recipe in our model (I will convert your files to our format). Thank you for sharing your data!

Adam Topaz (Oct 25 2024 at 23:04):

I’m also working on cleaning up my model code and making it more user friendly. I’ll share that as well when it’s ready.

Michael Bucko (Oct 26 2024 at 06:28):

Adam Topaz schrieb:

I’m also working on cleaning up my model code and making it more user friendly. I’ll share that as well when it’s ready.

Are you fine tuning an existing one? (if yes, which one) Or training from scratch?

Adam Topaz (Oct 26 2024 at 16:01):

No, I didn’t fine tune anything. I just made a transformer in PyTorch. It’s quite small (only about 1mb to store the weights).

Adam Topaz (Oct 26 2024 at 17:51):

(And I trained it on my laptop, which has an underpowered 4gb discrete gpu.)

Michael Bucko (Oct 26 2024 at 19:30):

Adam Topaz schrieb:

(And I trained it on my laptop, which has an underpowered 4gb discrete gpu.)

I know this pain. Looking forward to testing it.
In the meantime, I have a prompt-completion implication dataset, but I was looking into fine tuning o1-mini, which requires the chat format. Well, perhaps I'll just generate a synthetic dataset for that.

Siddhartha Gadgil (Oct 27 2024 at 06:40):

Adam Topaz said:

No, I didn’t fine tune anything. I just made a transformer in PyTorch. It’s quite small (only about 1mb to store the weights).

How did you encode the Magma?

Vlad Tsyrklevich (Oct 27 2024 at 06:46):

Does it matter if the test/validation overlap? Also, this is just the data set, there's no way to run inference on the model ourselves, right?

Michael Bucko (Oct 27 2024 at 09:53):

In my case, if I fine tune something, I'll make the dataset open. Will just share it here.
Btw. we could also use something like W&B for managing the ml lifecycle (or use tools like Vertex or Sagemaker). We'd then have all such experiments very well documented and structured.

Adam Topaz (Oct 27 2024 at 12:38):

Siddhartha Gadgil said:

Adam Topaz said:

No, I didn’t fine tune anything. I just made a transformer in PyTorch. It’s quite small (only about 1mb to store the weights).

How did you encode the Magma?

This is roughly explained in the dataset readme. Here are some more details: Magma expressions are serialized using prefix notation. The LHS and RHS of an equation are tokenized, appended, padded as needed, and the model gets the padded tokens, it computes (learned, absolute) positional embeddings. It also gets information about which tokens correspond to the LHS and which to the RHS (and which to padding) -- I call these "kinds".

Adam Topaz (Oct 27 2024 at 12:40):

Also, for training, there are some transformations being done: variable names are randomly shuffled, and the LHS/RHS of an equation are flipped randomly with 0.5 probability.

Adam Topaz (Oct 27 2024 at 12:43):

Anyway, I tried to make the code as usable as possible and put it here: https://github.com/adamtopaz/equational_transformer
Note that this was still only ever tested on my own hardware, so if you try to run this, and run into issues, please do let me know (by DM, so that this channel doesn't get too noisy).

Michael Bucko (Oct 27 2024 at 12:49):

@Adam Topaz will try it out. Thank you for the sharing the repo.

Adam Topaz (Oct 27 2024 at 12:49):

If you just want to use the model, I suggest looking at the play.ipynb notebook first.

Michael Bucko (Oct 27 2024 at 12:50):

Adam Topaz schrieb:

If you just want to use the model, I suggest looking at the play.ipynb notebook first.

Yeah, will skip the pretraining for now. Great stuff!

Michael Bucko (Oct 27 2024 at 19:22):

Didn't know about lake exe tokenized_data generate -h. That's pretty cool.
Checked out the notebook, the dataset on HF, etc. Was thinking of training a transformer too, but I guess there's no need to do double work.
Are you planning to upload the model to HF, too? We could have a public colab.

Adam Topaz (Oct 27 2024 at 19:29):

The tokenized_data exe is only in the fork of the repo ( I could PR it to the main repo, but it’s not clear there is enough interest.)

Michael Bucko (Oct 27 2024 at 19:43):

Adam Topaz schrieb:

The tokenized_data exe is only in the fork of the repo ( I could PR it to the main repo, but it’s not clear there is enough interest.)

Since you already wrote the code, and pre-training works, I think it's not necessary. We can use your pretrained model.
But it'd be really cool if you could upload the model to HF.
That'd make many things easier (we could deploy it to Spaces, SageMaker or Azure ML). And have a very simple, no-installation-required colab for testers.

Adam Topaz (Oct 27 2024 at 19:50):

Ok. I’ll try to do that at some point soon.

Adam Topaz (Oct 30 2024 at 16:46):

Alex Meiburg said:

The fact that it scored a similar amount on each is impressive! That makes it sound like it really didn't overfit much at all.
So this is a transformer that you trained to give meaningful embeddings (= embeddings that could give good predictions), and it produced meaningful embeddings on the remaining equations too?

How big is the embedding dimension? Can you give us an example of what are some "maximally distinct" equations (by some metric of, far-apart embeddings)?

To follow up on this, I computed the top three principal components of the embeddings, and plotted them. It seems that there are four distinct "chunks" but some additional structure within two of the chunks. If you want to see the visualization, here it is:
equation_embeddings.html
And if you want to play with the embeddings on your own, here they are:
embeddings.pickle (this is a dictionary with keys of the form "EquatuionN" and value a numpy vector of the 64-dimensional embedding)

Adam Topaz (Oct 30 2024 at 16:49):

(A direct URL if you don't want to download the html file directly: https://sites.ualberta.ca/~topaz/equation_embeddings.html )

Michael Bucko (Oct 30 2024 at 16:56):

Adam Topaz schrieb:

Alex Meiburg said:

The fact that it scored a similar amount on each is impressive! That makes it sound like it really didn't overfit much at all.
So this is a transformer that you trained to give meaningful embeddings (= embeddings that could give good predictions), and it produced meaningful embeddings on the remaining equations too?

How big is the embedding dimension? Can you give us an example of what are some "maximally distinct" equations (by some metric of, far-apart embeddings)?

To follow up on this, I computed the top three principal components of the embeddings, and plotted them. It seems that there are four distinct "chunks" but some additional structure within two of the chunks. If you want to see the visualization, here it is:
equation_embeddings.html
And if you want to play with the embeddings on your own, here they are:
embeddings.pickle (this is a dictionary with keys of the form "EquatuionN" and value a numpy vector of the 64-dimensional embedding)

Awesome! Based on your pickle file-

embeddings

Adam Topaz (Oct 30 2024 at 16:56):

yeah, that's the projection onto the x-y axis of the 3d plot above.

Michael Bucko (Oct 30 2024 at 17:00):

Btw. the old umap also in 3d: https://leanprover.zulipchat.com/#narrow/channel/458659-Equational/topic/3.20categories.20in.20UMAP.20after.20tokenization.20of.20theorems/near/476903869

2d looks perhaps clearer?

Michael Bucko (Oct 30 2024 at 17:01):

And here's the one more.
pickle2.png

Pietro Monticone (Nov 10 2024 at 18:44):

Might be of interest here too #Equational > Graph ML: Directed link prediction on the implication graph

Last updated: May 02 2025 at 03:31 UTC