Zulip Chat Archive

Stream: Equational

Topic: 3 categories in UMAP after tokenization of theorems

Michael Bucko (Oct 15 2024 at 06:00):

I tokenized the theorems (as of Oct 14th) (sentences truncated to 512, used bert-base-uncased), and then used UMAP to visualize the result.

I see three categories (points: 245, dimension: 768).
Bildschirmfoto 2024-10-15 um 07.46.07.png

Do we know how to interpret these categories? I'd be happy to create a PR for this.

Vlad Tsyrklevich (Oct 15 2024 at 06:07):

I think you're talking looking at the actual theorem bodies in text? Without a mapping of the points to particular theorems there is no way to interpret if it's interesting or just an artifact. Most theorems are auto-generated by only a few methods so it would make sense that they cluster.

Michael Bucko (Oct 15 2024 at 06:17):

Yes, the actual theorem bodies. 245 points in 768 dimensions.

I am just wondering why 3 clusters and not 20 (esp given their sizes). I was trying to explore some unexpected patterns. Perhaps there're some hidden ways, or higher-order concepts in the way we're proving things.

I once did a similar analysis based on k-prime numbers and got this (only computed for the first 10,000 numbers -- the compute would not allow more back in the day).
Bildschirmfoto 2024-10-15 um 08.13.47.png

Fan Zheng (Oct 15 2024 at 11:28):

Why is there a 0 in the label?

Michael Bucko (Oct 15 2024 at 11:45):

Fan Zheng schrieb:

Why is there a 0 in the label?

I'm getting all the token embeddings from the last hidden layer. If the input text contains the digit 0, it will be included as a token. Then I am aligning bert and spacy tokens. The script currently considers 0 as one of the words for which embeddings are calculated.

Zoltan A. Kocsis (Z.A.K.) (Oct 15 2024 at 11:57):

Instead of theorems and their text, one could use a similarity metric on the finite magmas and plot those. I think that might make for a picture with interesting features.
A potential similarity metric: cosine similarity between the ordered sets of satisfied equations. I guess one could also come up with some difference metric on the operation tables that accounts for isomorphism.

Michael Bucko (Oct 15 2024 at 12:38):

Zoltan A. Kocsis (Z.A.K.) schrieb:

Instead of theorems and their text, one could use a similarity metric on the finite magmas and plot those. I think that might make for a picture with interesting features.
A potential similarity metric: cosine similarity between the ordered sets of satisfied equations. I guess one could also come up with some difference metric on the operation tables that accounts for isomorphism.

I am already trying to run the script on the zipped implications from Vlad, but not sure if my compute will be enough.

Michael Bucko (Oct 15 2024 at 15:30):

Fan Zheng schrieb:

Why is there a 0 in the label?

Here's the (very short) code for this:

!pip install spacy transformers torch
!pip install chardet

import sys
import spacy
from transformers import BertTokenizer, BertModel
import torch
import numpy as np
import chardet

def main():

    #Spacy and BERT
    nlp = spacy.load('en_core_web_sm')

    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
    model.eval()

    # I set it up in Colab
    from google.colab import files
    uploaded = files.upload()

    input_file = next(iter(uploaded))

    with open(input_file, 'rb') as f:
        raw_data = f.read()

    result = chardet.detect(raw_data)
    encoding = result['encoding']
    confidence = result['confidence']

    if confidence >= 0.5 and encoding is not None:
        text = raw_data.decode(encoding)
    else:
        print("Defaulting to 'utf-8' with errors handled.")
        text = raw_data.decode('utf-8', errors='replace')

    doc = nlp(text)

    sentences = [sent.text for sent in doc.sents]

    word_embeddings = {}
    word_counts = {}

    for sentence in sentences:

        doc_sent = nlp(sentence)
        words_in_sent = [token.text for token in doc_sent if not token.is_punct and not token.is_space]

        # Tokenize the sentence with BERT tokenizer
        inputs = tokenizer(
            sentence,
            return_tensors='pt',
            add_special_tokens=True,
            max_length=512,        # Set max length to 512 tokens
            truncation=True        # Truncate
        )

        # Check if the sentence was truncated
        if tokenizer.model_max_length and len(inputs['input_ids'][0]) >= tokenizer.model_max_length:
            print(f"Warning: Sentence truncated to {tokenizer.model_max_length} tokens.")

        with torch.no_grad():
            outputs = model(**inputs)
            hidden_states = outputs.hidden_states  # Tuple of 13 layers

        token_embeddings = hidden_states[-1].squeeze(0)  # Shape: [seq_len, hidden_size]
        tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'].squeeze())

        spacy_idx = 0
        bert_word = ''
        for idx, token in enumerate(tokens):
            if token == '[CLS]' or token == '[SEP]':
                continue
            if token.startswith('##'):
                bert_word += token[2:]
            else:
                if bert_word:
                    spacy_idx += 1
                bert_word = token
            if spacy_idx < len(words_in_sent):
                word = words_in_sent[spacy_idx]
                # Remove special characters and lowercase for comparison
                cleaned_bert_word = bert_word.lower().strip()
                cleaned_word = word.lower().strip()
                if cleaned_bert_word == cleaned_word:
                    vector = token_embeddings[idx].numpy()
                    if word in word_embeddings:
                        word_embeddings[word] += vector
                        word_counts[word] += 1
                    else:
                        word_embeddings[word] = vector
                        word_counts[word] = 1

    # Average the embeddings for each word
    for word in word_embeddings:
        word_embeddings[word] /= word_counts[word]

    # Convert the embeddings to a TSV file for word2vec processing
    with open('output.tsv', 'w', encoding='utf-8') as tsv_file:
        for word, vector in word_embeddings.items():
            vector_str = '\t'.join(map(str, vector))
            tsv_file.write(f'{word}\t{vector_str}\n')

    print("Done")

if __name__ == "__main__":
    main()

Michael Bucko (Oct 15 2024 at 15:38):

@Fan Zheng btw. here's the colab: https://colab.research.google.com/drive/1GToDglCGggXKnFuIchz3SlASAVN2zmkA?usp=sharing

Michael Bucko (Oct 17 2024 at 15:31):

Can't really run it in the colab env. Turned out to be too heavy. But I could re-run it (and optimize) with more compute.

Last updated: May 02 2025 at 03:31 UTC