Zulip Chat Archive

Stream: Equational

Topic: Database and new graph visualizations


Michael Bucko (Oct 21 2024 at 19:38):

I imported the graph into neo4j, and can now query it with Cypher. It'd be great if we could make this kind of db available to all of us.

Here're some of the visulizations:

Bildschirmfoto 2024-10-21 um 21.28.18.png

Bildschirmfoto 2024-10-21 um 21.29.07.png

Btw. some nodes seem disconnected from the main graph:

Bildschirmfoto 2024-10-21 um 21.31.44.png

Eric Taucher (Oct 21 2024 at 19:59):

Nice to some other ways of visualizing the graph. I also like Neo4J because of Cypher. FYI my graphs with Cytoscape.js using an Euler layout are also putting the nodes on top of each other and look quite the same as yours, misery loves company.

Did you publish your code for others to try?

Michael Bucko (Oct 21 2024 at 20:03):

The code is already available. I used extract_implications, converted the output into a CSV, and then simply used their UI (not a script) to separately import nodes with the last column containing _true (starting from _implicit_true).

Michael Bucko (Oct 21 2024 at 20:06):

But I guess as soon as we decide on a db (that can be shared), there should be a simple script in the CI.
In this case, it was just a tiny experiment. I reused the scripts that we already have in the repo.

Michael Bucko (Oct 21 2024 at 20:13):

With source (red) and target nodes (green)
Bildschirmfoto 2024-10-21 um 22.10.11.png

When pulling 511 from a pretty big graph:

Bildschirmfoto 2024-10-21 um 22.11.06.png

Michael Bucko (Oct 21 2024 at 20:18):

Then for every relationship one can visualize those subgraphs (in this case MATCH p=()-[:implies]->() RETURN p LIMIT 100;)

Bildschirmfoto 2024-10-21 um 22.16.05.png

Michael Bucko (Oct 21 2024 at 21:15):

And that's the 2nd biggest visualization I've been able to do. This one is truly beautiful, and reveals some structures. One could theoretically run this a Cypher script in the CI and gain new insights with more data.

Bildschirmfoto 2024-10-21 um 23.08.27.png

And this is the biggest one, and I find it truly fascinating.

Bildschirmfoto 2024-10-21 um 23.14.24.png

Amir Livne Bar-on (Oct 22 2024 at 04:14):

Are the "clusters" large anti-chains?

Michael Bucko (Oct 22 2024 at 06:22):

Are the "clusters" large anti-chains?

I think so. In more detail:

  • Magmas within clusters are potentially equally powerful or valid in terms of implications.
  • Clusters may represent distinct classes or families of magmas.
  • They must have some deeper properties due to this shape. But I don't understand this shape at the moment.

Michael Bucko (Oct 22 2024 at 07:23):

One more, with slightly more data & reorganized.

visualisation.png

Michael Bucko (Oct 22 2024 at 15:07):

What equations 2 and 5 imply (calculating for up to 10,000 nodes) -
Bildschirmfoto 2024-10-22 um 17.05.43.png
Bildschirmfoto 2024-10-22 um 17.06.20.png

Harald Husum (Oct 22 2024 at 15:13):

What information do you think can be extracted from these plots? Have you gained any insights you think could be valuable?

Michael Bucko (Oct 22 2024 at 15:24):

We could:

  • cluster different paths to certain equations and identify new properties / equation types (that could potentially help us to develop new methods and tactics)
  • paint a bigger graph with more data (eventually), and have a db that many researchers could use for their own research,
  • identify anomalies and glitches in the results
  • we could also develop checks (think TDD) on the graph level

Some other ideas include:

  • improving visualization capabilities
  • having the cypher and graphql capability
  • perhaps deveoloping atp techniques on top of graphs
  • somehow connecting those graphs to egg?

Michael Bucko (Oct 22 2024 at 15:32):

(deleted)

Michael Bucko (Oct 23 2024 at 07:00):

Also, what I am presenting here are clusters.

But neo4j also allows community detection (but my neo4j instance does not allow that).

It'd look like this or something similar.

CALL gds.louvain.write({
  nodeProjection: '*',
  relationshipProjection: '*',
  writeProperty: 'communityId'
})
YIELD communityCount, modularity, modularities;

Eric Taucher (Oct 23 2024 at 09:43):

FYI
Neo4j has plugins, some of which can apply different layout algorithms.

https://neo4j.com/developer-blog/15-tools-for-visualizing-your-neo4j-graph-database/

While I have not reviewed the steps noted in the following link, from previous experience with Neo4j these look correct or did work at some time and would be able to help others try other such layout algorithms.

Explore New Worlds — Adding Plugins to Neo4j


@Michael Bucko
Really enjoy seeing where you are going with the use of Neo4j, please keep posting.

Eric Taucher (Oct 23 2024 at 10:07):

Michael Bucko said:

community detection

Had to look that up :smile:

https://neo4j.com/docs/graph-data-science/current/algorithms/community/

Community detection algorithms are used to evaluate how groups of nodes are clustered or partitioned, as well as their tendency to strengthen or break apart.

Michael Bucko (Oct 23 2024 at 10:15):

Yes, I am basically approaching this problem from the network analysis perspective. Check this out: https://towardsdatascience.com/community-detection-algorithms-9bd8951e7dae

Amir Livne Bar-on (Oct 23 2024 at 14:09):

I had a thought about visualizations - what if the distance between laws indicated the degree of their similarity? That is, the fraction of magmas for which they have the same truth value. We'd need to weigh by the order of the magmas. And Austin pairs should still be some distance apart. So I'm not sure how to define it exactly. But maybe it would be easier to visualize a metric than a poset.

Michael Bucko (Oct 23 2024 at 14:28):

Amir Livne Bar-on schrieb:

I had a thought about visualizations - what if the distance between laws indicated the degree of their similarity? That is, the fraction of magmas for which they have the same truth value. We'd need to weigh by the order of the magmas. And Austin pairs should still be some distance apart. So I'm not sure how to define it exactly. But maybe it would be easier to visualize a metric than a poset.

Weighted fraction of magmas could be done directly in the cypher query (or say BigQuery) and checked. Certain values would have to be precomputed, though. Another thing is that we could normalize those networks, use activation functions and treat them a bit like nns.

Otherwise, things like Jaccard index could be used (can be computed more locally, as compared to WF).

Michael Bucko (Oct 23 2024 at 14:35):

Normalizing and treating them a bit like a nn could give us a chance to interpret different patterns in terms of probabilities.
We'd calculate gradients of the loss function wrt the embeddings of the laws.

Zoltan A. Kocsis (Z.A.K.) (Oct 23 2024 at 16:06):

Amir Livne Bar-on said:

I had a thought about visualizations - what if the distance between laws indicated the degree of their similarity? That is, the fraction of magmas for which they have the same truth value. We'd need to weigh by the order of the magmas. And Austin pairs should still be some distance apart. So I'm not sure how to define it exactly. But maybe it would be easier to visualize a metric than a poset.

@Michael Bucko I believe I made a nigh-identical (dual?) suggestion here in the context of clustering the magmas themselves and you responded that you were running it? Did you get that visualization in the end? The information that could be gleaned from there would be very similar.

Daniel Weber (Oct 23 2024 at 16:10):

That could likely be extracted from the AI models predicting implications, using e.g. cosine similarity

Michael Bucko (Oct 23 2024 at 16:44):

@Zoltan A. Kocsis (Z.A.K.) I was doing UMAP and dimensionality reduction back then. That "blue" chart was mostly about reducing dimensions. But I didn't have enough compute to run it. I'll try to continue soon.

Michael Bucko (Oct 23 2024 at 16:45):

@Zoltan A. Kocsis (Z.A.K.) Here I am using mostly a graph rep in neo4j. I just want to figure out higher-order pathways.

Michael Bucko (Oct 23 2024 at 16:47):

When it comes to the neo4j graph, the best result (close to 10k nodes, some limitations of the free instance too) is currently this--
a.png

Michael Bucko (Oct 23 2024 at 16:48):

@Zoltan A. Kocsis (Z.A.K.) Now I am looking into community detection and network analysis. The problem is my neo4j instance does not support it.

Michael Bucko (Oct 23 2024 at 16:49):

@Zoltan A. Kocsis (Z.A.K.) (I also wrote the weighting algorithm and the related cypher query, but it's not yet fully working)

Michael Bucko (Oct 23 2024 at 17:13):

Just for the reference. Here's my current similarity code. Something is broken, i.e. it runs but returns no rows. So in case if you spot the error, please let me know.

MATCH (l1:Source)-[e1]->(t:Target)<-[e2]-(l2:Source)
WHERE l1.name < l2.name
WITH
  l1,
  l2,
  t,
  e1.truthValue AS truthValue1,
  e2.truthValue AS truthValue2,
  t.order AS magmaOrder
WITH
  l1,
  l2,
  CASE
    WHEN truthValue1 = truthValue2 THEN magmaOrder
    ELSE 0
  END AS weightedAgreement,
  magmaOrder AS totalMagmaOrder
WITH
  l1,
  l2,
  SUM(weightedAgreement) AS sumWeightedAgreements,
  SUM(totalMagmaOrder) AS sumTotalWeights
WITH
  l1,
  l2,
  CASE
    WHEN sumTotalWeights = 0 THEN 0
    ELSE toFloat(sumWeightedAgreements) / toFloat(sumTotalWeights)
  END AS similarityScore

MERGE (l1)-[s:SIMILAR_TO]-(l2)
SET s.similarity = similarityScore
RETURN
  l1.name AS Source1,
  l2.name AS Source2,
  s.similarity AS SimilarityScore
ORDER BY
  SimilarityScore DESC
LIMIT 100

Michael Bucko (Oct 23 2024 at 18:24):

@Zoltan A. Kocsis (Z.A.K.) When it comes to colab, I'm always running out of compute. So I guess we'll need a real server with some compute to run bigger experiments.

Bildschirmfoto 2024-10-23 um 20.22.28.png

Michael Bucko (Oct 23 2024 at 21:39):

I got the SIMILAR_TO relationship in the graph now as well as the similarity metric property, but the similarity data is only a placeholder.
I'd need at least the actual equations (not only their labels) directly in outcomes.csv (well, I could update the graph myself too based on equations.txt). Currently, it's only the labels.
Then we could use cosine_similarity (Daniel mentioned this), levenshtein_distance, or their implementation of jaro_winkler_similarity.

Douglas McNeil (Oct 24 2024 at 04:37):

@Michael Bucko : these are very pretty!

Vlad Tsyrklevich (Oct 24 2024 at 12:20):

Are these of the full implication graph? It doesn't look like it based on the number of edges, but I do see outcomes.json so thought to confirm. The condensed graph is _much_ smaller, and you only lose information you may not be interested in (e.g. within an equivalence class.)

Michael Bucko (Oct 24 2024 at 12:24):

Vlad Tsyrklevich schrieb:

Are these of the full implication graph? It doesn't look like it based on the number of edges, but I just figured I'd confirm because the condensed graph is _much_ smaller, and you only lose information you may not be interested in (e.g. within an equivalence class.)

It was 22,000 relationships, 9,388 nodes (currently, 114,283 relationships -- due to similarity metric layers).
The free instance (and the browser) had some limitations. Currently working on a new instance, and hope to be able to get an even better one.

Michael Bucko (Oct 24 2024 at 12:27):

This is the most complete so far --
v3.png

I'll share a new one soon.

Vlad Tsyrklevich (Oct 24 2024 at 12:33):

I still don't follow, what is the underlying data? Is it implications with edges as implications and vertices as equations? If so, how are there ~9k vertices? I'm missing something.

Vlad Tsyrklevich (Oct 24 2024 at 12:37):

Ah wait, 9388=4694*2, so this is doing the dual map of implications and non-implications then I assume

Vlad Tsyrklevich (Oct 24 2024 at 12:38):

In that case, you'd be far better off on the condensed graph representation, it has ~1450 vertices, and 2 orders of magnitude fewer edges than the full implication graph

Michael Bucko (Oct 24 2024 at 12:39):

Vlad Tsyrklevich schrieb:

I still don't follow, what is the underlying data? Is it implications with edges as implications and vertices as equations? If so, how are there ~9k vertices? I'm missing something.

I used the extract_implications script to extract the implications, then the generate_edgelist script, then kept integrating the implications one by one using the info from the 3rd column (is it implicit? explicit and so on).

Michael Bucko (Oct 24 2024 at 12:40):

Check this out.
Bildschirmfoto 2024-10-24 um 14.39.21.png

Vlad Tsyrklevich (Oct 24 2024 at 12:45):

OK so the nodes are every pair of equations, and then the relationships are just based on whether they are implicit/explicit proof/disproof or unknown? So in that cluster graph above, the blue nodes are showing 1 of explicit_conjecture_false,explicit_proof_false,explicit_proof_true,implicit_conjecture_false,implicit_proof_false,implicit_proof_true? (with unknown so few I presume it's not shown at all given the limited set displayed?)

Michael Bucko (Oct 24 2024 at 12:46):

Think of GROUP by Source, Target here. It's not divided -- ie. anything that has _true is considered true.

Vlad Tsyrklevich (Oct 24 2024 at 12:49):

my naive understanding of what 'GROUP BY Source, Target' would do lines up with the idea that pink vertices are pairs of equations and that there are 6 more blue vertices for the types of implication status with the edges being from implication status to the pair "source,target", so I'm still not sure how to interpret the graph otherwise

Michael Bucko (Oct 24 2024 at 12:51):

It's essentially all the sources and the targets, and their implications (from target to source), from a day minus (the limitations of free neo4j + limitations of the browser).

I generated multiple graphs and the biggest one so far is the light purple one.

Vlad Tsyrklevich (Oct 24 2024 at 12:54):

Ah ok, here's a different idea then: That is the graph of all relationships between equation ~1-6 and all the other equations, e.g. the first ~20k loaded implications from the full list

Vlad Tsyrklevich (Oct 24 2024 at 12:56):

Just an FYI the condensed graph is ~1.4k vertices and ~4.8k edges in the reduced form, so that may be much easier to handle and load

Michael Bucko (Oct 24 2024 at 13:18):

I just used the data from October 24th.

Got this. Sharing just for the reference --

Neo.ClientError.Transaction.TransactionHookFailed
You have exceeded the logical size limit of 400000 relationships in your database (attempt to add 17117 relationships would reach 413963 relationships). Please consider upgrading to the next tier.

(the same applies to transformer training, that UMAP experiment, and so on)

Michael Bucko (Oct 24 2024 at 14:20):

Browser-crashing results. 396,846 (implicit only as an approx) based on the data from Oct 24th.

10k rels - we know it

visualisation-10k.png

30k rels - same, but clearer and more symmetrical

visualisation-30k.png

60k - neo4j re-connecting multiple times, browser struggling (my favorite)

visualisation-60k.png

..and 120k - tough to capture, maxx is ca. 400k, but that takes ages to load and browser shuts down atm

visualisation-120k.png

I'll try to look at 120k from a different perspective and share something that is more informative.

Michael Bucko (Oct 24 2024 at 14:36):

400k (current max) - I could only take a screenshot before the browser shut down.

Bildschirmfoto 2024-10-24 um 16.23.57.png
Bildschirmfoto 2024-10-24 um 16.34.26.png

Michael Bucko (Oct 24 2024 at 14:52):

Longest implication chain - can be computed per node,

Example:
visualisation-longest.png

Michael Bucko (Oct 24 2024 at 15:14):

I tried to find isolated subgraphs and got:

Neo.TransientError.General.MemoryPoolOutOfMemoryError

`The allocation of an extra 2.0 MiB would use more than the limit 250.0 MiB. Currently using 248.4 MiB. dbms.memory.transaction.total.max threshold reached`

A pipeline with Vertex could get features from something like this

https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/community/neo4j/graph_paysim.ipynb

We could create new features (ml features), have them on top on the graph db, and then train & deploy in no time. Similar thing is ofc possible with Sagemaker too.

Michael Bucko (Oct 24 2024 at 18:51):

This one's interesting. For 50k rels. When you zoom out, it feels like a biological cell structure, or a star map.

Bildschirmfoto 2024-10-24 um 20.46.54.png

Michael Bucko (Nov 10 2024 at 19:11):

btw. @Pietro Monticone have a look at this implication graph visualization

beauty.png


Last updated: May 02 2025 at 03:31 UTC