Zulip Chat Archive
Stream: Equational
Topic: 3 categories in UMAP after tokenization of theorems
Michael Bucko (Oct 15 2024 at 06:00):
I tokenized the theorems (as of Oct 14th) (sentences truncated to 512, used bert-base-uncased), and then used UMAP to visualize the result.
I see three categories (points: 245, dimension: 768).
Bildschirmfoto 2024-10-15 um 07.46.07.png
Do we know how to interpret these categories? I'd be happy to create a PR for this.
Vlad Tsyrklevich (Oct 15 2024 at 06:07):
I think you're talking looking at the actual theorem bodies in text? Without a mapping of the points to particular theorems there is no way to interpret if it's interesting or just an artifact. Most theorems are auto-generated by only a few methods so it would make sense that they cluster.
Michael Bucko (Oct 15 2024 at 06:17):
Yes, the actual theorem bodies. 245 points in 768 dimensions.
I am just wondering why 3 clusters and not 20 (esp given their sizes). I was trying to explore some unexpected patterns. Perhaps there're some hidden ways, or higher-order concepts in the way we're proving things.
I once did a similar analysis based on k-prime numbers and got this (only computed for the first 10,000 numbers -- the compute would not allow more back in the day).
Bildschirmfoto 2024-10-15 um 08.13.47.png
Fan Zheng (Oct 15 2024 at 11:28):
Why is there a 0 in the label?
Michael Bucko (Oct 15 2024 at 11:45):
Fan Zheng schrieb:
Why is there a 0 in the label?
I'm getting all the token embeddings from the last hidden layer. If the input text contains the digit 0, it will be included as a token. Then I am aligning bert and spacy tokens. The script currently considers 0 as one of the words for which embeddings are calculated.
Zoltan A. Kocsis (Z.A.K.) (Oct 15 2024 at 11:57):
Instead of theorems and their text, one could use a similarity metric on the finite magmas and plot those. I think that might make for a picture with interesting features.
A potential similarity metric: cosine similarity between the ordered sets of satisfied equations. I guess one could also come up with some difference metric on the operation tables that accounts for isomorphism.
Michael Bucko (Oct 15 2024 at 12:38):
Zoltan A. Kocsis (Z.A.K.) schrieb:
Instead of theorems and their text, one could use a similarity metric on the finite magmas and plot those. I think that might make for a picture with interesting features.
A potential similarity metric: cosine similarity between the ordered sets of satisfied equations. I guess one could also come up with some difference metric on the operation tables that accounts for isomorphism.
I am already trying to run the script on the zipped implications from Vlad, but not sure if my compute will be enough.
Michael Bucko (Oct 15 2024 at 15:30):
Fan Zheng schrieb:
Why is there a 0 in the label?
Here's the (very short) code for this:
!pip install spacy transformers torch
!pip install chardet
import sys
import spacy
from transformers import BertTokenizer, BertModel
import torch
import numpy as np
import chardet
def main():
#Spacy and BERT
nlp = spacy.load('en_core_web_sm')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
model.eval()
# I set it up in Colab
from google.colab import files
uploaded = files.upload()
input_file = next(iter(uploaded))
with open(input_file, 'rb') as f:
raw_data = f.read()
result = chardet.detect(raw_data)
encoding = result['encoding']
confidence = result['confidence']
if confidence >= 0.5 and encoding is not None:
text = raw_data.decode(encoding)
else:
print("Defaulting to 'utf-8' with errors handled.")
text = raw_data.decode('utf-8', errors='replace')
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]
word_embeddings = {}
word_counts = {}
for sentence in sentences:
doc_sent = nlp(sentence)
words_in_sent = [token.text for token in doc_sent if not token.is_punct and not token.is_space]
# Tokenize the sentence with BERT tokenizer
inputs = tokenizer(
sentence,
return_tensors='pt',
add_special_tokens=True,
max_length=512, # Set max length to 512 tokens
truncation=True # Truncate
)
# Check if the sentence was truncated
if tokenizer.model_max_length and len(inputs['input_ids'][0]) >= tokenizer.model_max_length:
print(f"Warning: Sentence truncated to {tokenizer.model_max_length} tokens.")
with torch.no_grad():
outputs = model(**inputs)
hidden_states = outputs.hidden_states # Tuple of 13 layers
token_embeddings = hidden_states[-1].squeeze(0) # Shape: [seq_len, hidden_size]
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'].squeeze())
spacy_idx = 0
bert_word = ''
for idx, token in enumerate(tokens):
if token == '[CLS]' or token == '[SEP]':
continue
if token.startswith('##'):
bert_word += token[2:]
else:
if bert_word:
spacy_idx += 1
bert_word = token
if spacy_idx < len(words_in_sent):
word = words_in_sent[spacy_idx]
# Remove special characters and lowercase for comparison
cleaned_bert_word = bert_word.lower().strip()
cleaned_word = word.lower().strip()
if cleaned_bert_word == cleaned_word:
vector = token_embeddings[idx].numpy()
if word in word_embeddings:
word_embeddings[word] += vector
word_counts[word] += 1
else:
word_embeddings[word] = vector
word_counts[word] = 1
# Average the embeddings for each word
for word in word_embeddings:
word_embeddings[word] /= word_counts[word]
# Convert the embeddings to a TSV file for word2vec processing
with open('output.tsv', 'w', encoding='utf-8') as tsv_file:
for word, vector in word_embeddings.items():
vector_str = '\t'.join(map(str, vector))
tsv_file.write(f'{word}\t{vector_str}\n')
print("Done")
if __name__ == "__main__":
main()
Michael Bucko (Oct 15 2024 at 15:38):
@Fan Zheng btw. here's the colab: https://colab.research.google.com/drive/1GToDglCGggXKnFuIchz3SlASAVN2zmkA?usp=sharing
Michael Bucko (Oct 17 2024 at 15:31):
Can't really run it in the colab env. Turned out to be too heavy. But I could re-run it (and optimize) with more compute.
Last updated: May 02 2025 at 03:31 UTC