Zulip Chat Archive

Stream: Equational

Topic: Graph ML: Directed link prediction on the implication graph

Pietro Monticone (Nov 10 2024 at 18:11):

A few days ago, @Claudio Moroni and I decided to explore Directed Link Prediction (DLP) on the implication graph, where nodes represent equational laws and edges indicate implications. We experimented with various Graph Neural Network (GNN) autoencoders to predict missing edges, providing a way to estimate the truth value of unproven implications. To assess these models, we defined three test sets focusing on edge existence, directionality, and bidirectionality. The results give insight into how these models handle dense, directed graphs like ours. A more detailed report follows below.

Pietro Monticone (Nov 10 2024 at 18:11):

Motivation

Directed Link Prediction [1] is a method enabling machine / deep learning models to predict missing edges in a directed graph. For our implication graph, this translates to predicting the truth values of unproven implications. This task serves as a necessary first step for advancing in the following directions:

Reasoning over mathematical knowledge graphs: Recent advancements allow language models to integrate information from multiple modalities. For example, [2] and [3], share information between corresponding layers of Language Models (LMs) and Graph Neural Networks (GNNs), enabling simultaneous learning from text corpora and graph-structured data within the same expert domain. By leveraging both modalities, the language model can better structure its knowledge and respond to complex queries. This dual learning process combines masked language modelling for text with link prediction on the graph, highlighting the importance of link prediction for robust reasoning.
Higher-order implication graphs: Our implication graph currently represents only implications of the form $p \implies q$ , not more complex ones like $(p \land r) \implies q$ . Extending to such higher-order edges would likely involve connecting sets of nodes, thereby requiring hypergraph representations. For a systematic overview, see [4]. While specific hypergraph neural architectures exist [5], we believe it is still conceptually important to apply Directed Link Prediction to simpler implication graphs first, providing insights and guiding principles that can anticipate challenges in higher-order graph representation learning.

Pietro Monticone (Nov 10 2024 at 18:12):

Data

We used this edge list, generated on October 20, 2024, with the following commands:

lake exe extract_implications outcomes > data/tmp/outcomes.json
scripts/generate_edgelist_csv.py

The structure of the implication graph is summarised below:

graph_summary = {
    "total_nodes": 4694,
    "total_directed_edges": 8178279,
    "edge_density_percentage": 37,  # Percentage of possible edges that exist
    "bidirectional_edges": 2475610,
    "bidirectional_percentage": 30  # Percentage of all edges that are bidirectional
}

Below is a summary of the edge types in the graph:

edge_counts = {
    "explicit_conjecture_false": 92,
    "explicit_proof_false": 582316,
    "explicit_proof_true": 10657,
    "implicit_conjecture_false": 142,
    "implicit_proof_false": 13272681,
    "implicit_proof_true": 8167622,
    "unknown": 126
}

Edges are labeled according to the following scheme:

edge_labels = {
    "implicit_proof_true": 1,
    "explicit_proof_false": 0,
    "implicit_proof_false": 0,
    "explicit_proof_true": 1,
    "explicit_conjecture_false": 0,
    "implicit_conjecture_false": 0,
    "unknown": 0
}

The unknown class contains a very small number of edges (126), so their impact on the training phase is expected to be negligible.
Future approaches might address this class by excluding unknown edges from the training set.

Pietro Monticone (Nov 10 2024 at 18:12):

Methods

Setup

Consider a directed graph $G = (V, E)$ where $E = \{(u,v) \mid u, v \in V\}$ is the edge set and $|V| = n$ . We assume that each node is associated with a feature vector resulting in a $X \in \mathbb{R}^{n \times f}$ feature matrix.

We define the existing edges $(a, b) \in E$ as positives and the non-existing edges $(c, d) \notin E$ as negatives.

Intuitively, performing DLP on $G$ involves randomly splitting $E$ into three disjoint sets: $E_{\text{train}}$ , $E_{\text{val}}$ , and $E_{\text{test}}$ , such that:

$E_{\text{train}}$ is the training set,
$E_{\text{val}}$ is the validation set,
$E_{\text{test}}$ is the test set,
$E = E_{\text{train}} \dot{\cup} E_{\text{val}} \dot{\cup} E_{\text{test}}$ .

The model then learns from $G_{\text{train}} = (V, E_{\text{train}})$ to map the topological and feature-related information of two nodes $u$ and $v$ to a probability $p_{uv}$ that $(u, v) \in E_{\text{test}}$ .

However, this setup presents two key issues among others:

The model learns only from positives, so it cannot recognise negatives.
The model is evaluated only on positives, preventing us from measuring its ability to identify negatives.

To address these limitations, we adopted the setup proposed in [6]. Specifically, we redefined $E_{\text{train}}$ to $E_{\text{train}}^p$ (positives) and introduced:

$E_{\text{train}} = E_{\text{train}}^p \dot{\cup} E_{\text{train}}^n$

where $E_{\text{train}}^n$ includes all possible negatives in $G_{\text{train}} = (V, E_{\text{train}}^p)$ . The model is now required to predict the non-existence of edges in $E_{\text{train}}^n$ .

Similarly, if we redefine $E_{\text{test}}$ as follows:

$E_{\text{test}} = E_{\text{test}}^p \dot{\cup} E_{\text{test}}^n$

where $E_{\text{test}}^n$ is a random sample of negatives, the model’s evaluation would fail to capture two crucial aspects:

The model’s ability to distinguish $(u,v)$ from $(v,u)$ for all $(u,v) \in E_{\text{test}}^p$ .
The model’s ability to identify bi-implications.

These limitations arise from the random selection of negative edges in $E_{\text{test}}^n$ . To address this, we define three distinct test sets: $E_{\text{test}}^G$ , $E_{\text{test}}^D$ , and $E_{\text{test}}^B$ , to evaluate different facets of the model’s performance:

General Test Set ( $E_{\text{test}}^G$ ): Here, $E_{\text{test}} = E_{\text{test}}^p \dot{\cup} E_{\text{test}}^n$ , where $E_{\text{test}}^n$ is a random sample of non-existent edges with the same cardinality as $E_{\text{test}}^p$ . This set assesses the model's ability to detect the presence of edges, regardless of direction. A model that cannot distinguish edge direction may still perform well on this set, highlighting the need for the following two additional test sets.
Directional Test Set ( $E_{\text{test}}^D$ ): Defined as $E_{\text{test}}^{\text{up}} \dot{\cup} \tilde{E}_{\text{test}}^{\text{up}}$ , where $E_{\text{test}}^{\text{up}}$ consists of unidirectional edges in $E_{\text{test}}^p$ , and $\tilde{E}_{\text{test}}^{\text{up}}$ contains their reverses (negatives by construction). This set evaluates the model's ability to recognise edge direction, making it suitable for assessing direction-sensitive models.
Bidirectional Test Set ( $E_{\text{test}}^B$ ): Defined as $E_{\text{test}}^{\text{bp}} \dot{\cup} E_{\text{B}}^{\text{n}}$ , where $E_{\text{test}}^{\text{bp}}$ contains one direction of all bidirectional edges in $E_{\text{test}}^p$ , and $E_{\text{B}}^{\text{n}} \subset \tilde{E}$ includes a subset of their reverses. This set evaluates the model's ability to identify bi-implications, as each edge in $E_{\text{test}}^B$ has a reverse that is positive, but only half are bidirectional in practice.

Pietro Monticone (Nov 10 2024 at 18:14):

Models

We tested the following models:

GAE [1]
Gravity-GAE [6]
Source/Target-GAE [6]
DiGAE [7]
MagNet [8]

All these models are graph-based autoencoders with distinct encoder-decoder architectures. Notably, GAE is the only model structurally unable to differentiate edge directions. Each model outputs the probability that an ordered pair of nodes has a directed edge between them, with nodes represented using one-hot encodings as features.

We trained the models using Binary Cross Entropy as the loss function, balancing the contribution of positive and negative edges through re-weighting. On the General test set, we evaluated the following metrics:

AUC (Area Under the ROC Curve): Measures the probability that the model ranks a random positive edge higher than a random negative edge. Higher values indicate better discrimination between positive and negative edges.
AUPRC (Area Under Precision-Recall Curve): Assesses model performance, particularly in cases of class imbalance. AUPRC is more robust to imbalanced data than AUC.
Hits@K: Evaluates the fraction of times a positive edge is ranked within the top $K$ positions among personalised negative samples [9]. Briefly, given a positive edge, its $M$ personalised negative samples are $M$ negative edges with the same head but different tails. We calculate Hits@K for $K = 1, 3, 10$ to assess ranking quality, and set $M = 100$ . Therefore, Hits@K $= 0.1$ means that on average, the model ranks a positive edge within the highest-ranked $K$ personalised negatives $10\%$ of the times.
MRR (Mean Reciprocal Rank): Computes the average reciprocal rank of positive edges among their personalised negative samples [9] (the same as those used for Hits@K) providing an overall measure of ranking accuracy. For instance, $MRR= 0.1$ means that on average, the model ranks a positive edge as $10^{th}$ among $M$ personalised negative samples.

Each metric ranges from $0$ to $1$ , with higher values reflecting improved performance. Based on prior work, we expect AUC and AUPRC scores to approach $1$ , while MRR and Hits@K often yield values around 0.15 for similar undirected tasks [9]. Directional and Bidirectional performances were evaluated using only AUC and AUPRC, since Hits@K and MRR are hardly definable in those scenarios. The number of training epochs was optimised through Early Stopping on the General validation set performance (given by the sum of AUC and AUPRC).

Pietro Monticone (Nov 10 2024 at 18:15):

Results

The results below represent average performance over six random splits of $E_{\text{train}}$ , $E_{\text{val}}$ , and $E_{\text{test}}$ while keeping the model's seed fixed for fair comparison. The training / validation / test split proportions are:

85 / 5 / 10 for unidirectional edges
65 / 15 / 30 for bidirectional edges

Model	G_ROC_AUC	G_AUPRC	G_Hits@1	G_Hits@3	G_Hits@10	G_MRR	D_ROC_AUC	D_AUPRC	B_ROC_AUC	B_AUPRC
gae	0.8484 ± 9e-04	0.8558 ± 6e-04	6e-05 ± 4e-05	6e-05 ± 4e-05	0.0001 ± 4e-05	0.0165 ± 2e-04	0.5 ± 0	0.5 ± 0	0.941 ± 5e-03	0.964 ± 3e-03
gravity_gae	0.9806 ± 3e-04	0.9753 ± 4e-04	0.069 ± 6e-03	0.101 ± 5e-03	0.17 ± 3e-03	0.112 ± 5e-03	0.9958 ± 1e-04	0.9874 ± 2e-04	0.99717 ± 4e-05	0.99431 ± 7e-05
sourcetarget_gae	0.99976 ± 1e-05	0.999736 ± 8e-06	0.077 ± 4e-03	0.147 ± 7e-03	0.279 ± 9e-03	0.152 ± 5e-03	0.999982 ± 1e-06	0.999983 ± 1e-06	0.999989 ± 2e-06	0.999987 ± 3e-06
mlp_gae	0.99315 ± 1e-05	0.99409 ± 1e-05	0.181 ± 7e-03	0.299 ± 7e-03	0.53 ± 2e-03	0.289 ± 6e-03	0.99671 ± 2e-05	0.9973 ± 1e-05	0.99692 ± 2e-05	0.99736 ± 2e-05
digae	0.9978 ± 3e-04	0.998 ± 3e-04	0.035 ± 6e-03	0.068 ± 1e-02	0.18 ± 2e-02	0.091 ± 1e-02	0.9991 ± 2e-04	0.9993 ± 2e-04	0.9994 ± 1e-04	0.9995 ± 1e-04
magnet	0.989 ± 1e-04	0.99076 ± 3e-05	0.151 ± 1e-02	0.26 ± 2e-02	0.38 ± 2e-02	0.24 ± 1e-02	0.9962 ± 1e-03	0.9969 ± 6e-04	0.9976 ± 4e-04	0.9979 ± 2e-04

We observe consistently high General AUC and AUPRC scores, close to 1. These high values are expected, as similar neural architectures have demonstrated strong performance in graphs of comparable size [1]. The high ratio of existing to non-existing edges in the implication graph (approximately 37%) likely contributes to the near-perfect General AUC and AUPRC scores. For context, benchmark datasets such as Cora and Citeseer (e.g., directed and undirected) contain fewer than 1% of all possible edges.

Interestingly, the GAE model, though structurally unable to distinguish edge direction, performs well on the General task (if we look at AUC and AUPRC only). This experimentally confirms the need for including Directional and Bidirectional test sets mentioned above, which allow comprehensive evaluation across all facets of DLP.

All other models demonstrate high AUC and AUPRC scores across the General, Directional, and Bidirectional test sets, indicating strong predictive capabilities even when accounting for directionality and bidirectionality.

Notably, the mlp_gae and magnet models also achieve competitive scores in MRR and Hits@K metrics.

Pietro Monticone (Nov 10 2024 at 18:15):

Conclusions

We evaluated the performance of six different graph autoencoders on a DLP task. By adopting a multi-task evaluation framework, we assessed the models comprehensively across various aspects of DLP. All models demonstrated strong performance on AUC and AUPRC metrics, with some also achieving high scores on MRR and Hits@K.

Node features were represented using one-hot encodings, meaning that the models received no explicit information about the equations represented by the nodes. Instead, they relied entirely on the topological structure encoded during training. This approach aligns with previous research suggesting that one-hot encodings can promote asymmetric embeddings [6]. However, future experiments could explore alternative representations, such as encoding the equations with text-based embeddings like Word2Vec, to potentially enhance the models' understanding of the nodes’ semantic content.

In summary, our findings highlight the adaptability and robustness of graph autoencoders for DLP tasks in dense, directed graphs. This approach not only demonstrates robustness in predicting directed links but also suggests promising potential for future improvements through enhanced feature representations, thereby advancing the capabilities of link prediction in complex mathematical knowledge graphs.

Pietro Monticone (Nov 10 2024 at 18:16):

References

Terence Tao (Nov 10 2024 at 18:25):

I wonder if it is possible to use the github versioning features to recover an early stage of the implication graph - in particular, before the big Vampire run that obtained almost all known positive implications - and see how that could have been used to predict the actual graph, and in particular how they perform the last 1000 or so implications that were resistant to these techniques. Or to try to randomly generate a testing/training set using "cheap" techniques such as finite magma counterexamples and ATPs with short run time and see what their prediction rate is on various sets of implications (either the graph as a whole, or subsets based on which implications were open at which dates).

Terence Tao (Nov 10 2024 at 18:28):

In principle, in a future project of this type, one might be able to use these ML methods to produce some potential "inpaintings" or "completions" of incomplete implication graphs at an intermediate stage of a project, which would have absolutely no guarantee of being close to the ground truth (which would not yet be known at that stage), but might serve as a tool for making conjectures for instance.

Last updated: May 02 2025 at 03:31 UTC