Zulip Chat Archive
Stream: Equational
Topic: Implications - dimensionality reduction
Michael Bucko (Oct 25 2024 at 13:53):
I ran two dimensionality reduction experiments:
- umap
- t-sne
with the new dataset from Oct 24th.
Got this:
Bildschirmfoto 2024-10-25 um 15.45.58.png
Bildschirmfoto 2024-10-25 um 15.48.05.png
Especially, the UMAP thing is interesting -- because it indicates there exist certain pathways that can perhaps re-discover.
Eric Taucher (Oct 25 2024 at 13:58):
As the two terms were new to me asked ChatGPT
Prompt
Please explain this statement related to math magma and implications.
two dimensionality reduction experiments: umap t-sne
Reply
Two-dimensionality reduction techniques like UMAP (Uniform Manifold Approximation and Projection) and t-SNE (t-distributed Stochastic Neighbor Embedding) are used for visualizing and simplifying high-dimensional data in a lower-dimensional (typically 2D) space. They both have particular strengths and are often used to explore complex datasets.
Here's how each works and the main implications of using them:
UMAP (Uniform Manifold Approximation and Projection)
- Goal: UMAP is focused on maintaining the global structure of data while preserving local neighbor relations, aiming to capture the general shape of the data manifold.
- Mechanism: UMAP constructs a high-dimensional graph representation of data, preserving both local and global structures, and then approximates that structure in a lower dimension.
- Speed: UMAP is generally faster than t-SNE, especially for large datasets, making it favorable in practical applications with vast amounts of data.
- Interpretation: UMAP tends to cluster data points in ways that can be slightly more interpretable in terms of neighborhood relationships.
- Hyperparameters: It has fewer tunable parameters compared to t-SNE, with the primary ones being the number of neighbors (which controls local connectivity) and minimum distance (which adjusts the compactness of clusters).
t-SNE (t-distributed Stochastic Neighbor Embedding)
- Goal: t-SNE focuses heavily on preserving local similarities (i.e., small distances between points) rather than global structure.
- Mechanism: It minimizes the difference between probability distributions of pairwise distances in high-dimensional space versus lower-dimensional space, emphasizing nearby data points and creating clusters or separation based on neighborhood similarities.
- Speed: t-SNE is computationally intensive, especially for larger datasets, which can make it slow for certain applications.
- Interpretation: The clusters in t-SNE are often visually separated but should not be interpreted as clear-cut "clusters." Instead, they represent local density regions.
- Hyperparameters: Key parameters include perplexity (a measure of the neighborhood size) and learning rate, which influence how points are clustered or spread out in the lower-dimensional space.
Key Differences and Implications
- Interpretability: UMAP’s embedding often aligns better with the data’s global structure, making it more informative for recognizing larger patterns. t-SNE, however, can provide clearer local clusters but sacrifices global patterns.
- Speed: UMAP is generally faster, making it a practical choice for large datasets.
- Use Case: If the objective is to identify distinct clusters based on local similarities, t-SNE is advantageous. For general structure, data overview, or exploratory data analysis with large datasets, UMAP is often preferred.
Using both methods together can sometimes provide complementary insights into the data structure, making them popular for exploratory data analysis in fields like bioinformatics, NLP, and image processing.
Daniel Weber (Oct 25 2024 at 14:04):
What was the reduction on? The implication graph?
Michael Bucko (Oct 25 2024 at 14:17):
Daniel Weber schrieb:
What was the reduction on? The implication graph?
Yes, exactly.
Michael Bucko (Oct 25 2024 at 14:28):
And here's the projection to 2d - clearly indicates certain paths.
perfect.png
A. (Oct 25 2024 at 16:55):
Probably stupid question: how are you running those things against a graph?
Michael Bucko (Oct 25 2024 at 16:57):
A. schrieb:
Probably stupid question: how are you running those things against a graph?
Not stupid at all. You can generate implication datasets using tools in this project. There's the extract_implications tool, then a tool that allows you to generate edges and so on.
A. (Oct 25 2024 at 17:00):
I mean, don't they take sets of vectors as input?
Michael Bucko (Oct 25 2024 at 17:02):
A. schrieb:
I mean, don't they take sets of vectors as input?
Yes, they do. You need to convert the data into a TSV, or a similar format.
A. (Oct 25 2024 at 17:03):
How are the vectors constructed from the graph?
Michael Bucko (Oct 25 2024 at 17:07):
In my case, I took equation numbers. So implications Eq1 -> Eq2, i.e. their numbers.
Edward van de Meent (Oct 25 2024 at 17:08):
So that example would give (1,2) as vector?
Michael Bucko (Oct 25 2024 at 17:10):
Yes. Check this - that's the dataset I'm using for this.
implications.tsv
Will Sawin (Oct 25 2024 at 17:31):
But then wouldn't they all be vectors in , not needing dimensionality reduction?
Michael Bucko (Oct 25 2024 at 17:34):
Will Sawin schrieb:
But then wouldn't they all be vectors in , not needing dimensionality reduction?
They are . But you can still extract insights. You notice certain structures that you otherwise would not be able to see.
For instance, that's what I did some years ago for numbers. There are some structures one can uncover (that one sometimes does not immediately see).
Will Sawin (Oct 25 2024 at 17:38):
But why not just consider a matrix whose i,j entry is 1 if equation i implies equation j and 0 otherwise and do dimensionality reduction on the columns of the matrix?
Michael Bucko (Oct 25 2024 at 17:39):
It could be done in many ways. Essentially, in my case, I needed a TSV.
Michael Bucko (Oct 25 2024 at 17:40):
I am using different approaches in the "Database and new graph.." ticket -- with a graph database and network analysis.
Michael Bucko (Oct 25 2024 at 17:53):
Btw. that's the kind of graph I got for the Equational Theories project through network analysis.
Last updated: May 02 2025 at 03:31 UTC