1. Introduction
Graph similarity computation aims to calculate the similarity between graphs, which is essential to a number of downstream applications such as biological molecular similarity search [
1], malware detection [
2] and knowledge graph fusion [
3,
4]. Graph edit distance (GED) [
5] and maximum common subgraph (MCS) [
6] are frequently used metrics for evaluating graph similarity (these two metrics are inter-related [
5], and we mainly discuss GED in this paper.). Specifically, GED is defined as the minimum cost taken to transform one graph to the other via a sequence of graph edit operations of nodes or edges [
7].
Figure 1a depicts a simple example of GED computation, where the graph
is transformed into
via edge deletion, node addition, and edge addition, resulting in the GED value of 3. Usually, a lower GED value between two graphs implies a higher similarity [
7].
Traditional methods design either exact or approximate algorithms to calculate the GED values [
8,
9,
10]. Nevertheless, computing GED is widely known to be a NP-complete problem [
9]. Hence, these algorithms usually consume high computation resources and have poor scalability. To mitigate the aforementioned issues, graph neural networks (GNNs) [
11,
12] are utilized to calculate the approximate graph similarity scores [
7,
13]. As shown in
Figure 1b, the main idea is to project the symbolic graph representations into the embedding space via GNNs, where similar graphs (i.e., graphs with low GED values) are placed close in the embedding space. To this end, state-of-the-art solutions first embed the graphs, and then aggregate the node-level and graph-level interactions between the embeddings to generate the final similarity scores [
7,
14].
Nonetheless, we still observe several issues from existing schemes:
Existing neural methods still require excessive running time and memory space due to the intricate neural network structure design, which might fail to work in real-life settings;
They heavily rely on the labeled data for training an effective model and neglect the information hidden in the vast unlabeled data. Practically, the labeled data are difficult to obtain.
To address these issues, we put forward a contrastive neural graph matching network, i.e., Conga, to compute graph similarity scores in an effective and efficient manner. The backbone of Conga is a vanilla multilayer graph convolutional network (GCN), followed by attention-based pooling layers, which generate the representations for the two graphs, respectively. The graph representations generated by each layer are concatenated and sent to a multilayer perceptron to produce the similarity score between two graphs. To further mine useful signals from the graphs, we use an unsupervised contrastive loss to discriminate the graph embeddings and guide the training process by learning more expressive entity representations. Our proposed model is lightweight, as the complicated multilevel interaction modeling process is removed. That being said, it achieves state-of-the-art performance in existing graph similarity evaluation benchmarks.
Contribution. The contribution of this work can be summarized into three ingredients:
To the best of our knowledge, we are among the first attempts to utilize contrastive learning to facilitate the neural graph similarity computation process.
We propose to simplify the graph similarity prediction network, and the resultant model attains higher efficiency and maintains competitive performance.
We compare our proposed model, Conga, against state-of-the-art methods on existing benchmarks, and the results demonstrate that our proposed model attains superior performance.
Organization. Section 2 overviews related works.
Section 3 introduces the graph similarity prediction network and the contrastive learning module.
Section 4 introduces the experimental results, followed by discussion in
Section 5.
2. Related Works
In this section, we first present existing solutions to graph similarity computation. Next, we briefly mention contrastive learning and its applications. Finally, we introduce the graph matching task, which is closely related to graph similarity computation.
Graph similarity computation. Computing the similarity between graphs is a long-standing and challenging problem with many real-world applications [
15,
16,
17,
18]. Graph similarity is usually defined based on structural similarity measures such as GED or MCS [
19]. Traditional exact GED calculation is known to be NP-complete and cannot scale to graphs with more than tens of nodes. Thus, classic approximation algorithms are proposed to mitigate this issue. They essentially estimate the minimum cost (i.e., a set of edit operation) of transforming one graph to another, and the resultant edit operations can be used to explain the similarity score [
20]. Nevertheless, the computation and storage costs of these algorithms are still very high [
10,
15,
21,
22].
To fill in this gap, recently, an increasing number of studies employ end-to-end GNNs to calculate the graph similarities in the embedding space [
23,
24,
25]. The goal is to learn the parameters that can model graph similarity from empirical data, which are then used to predict graph similarity scores given new graphs. Specifically,
SimGNN models node-level and graph-level cross-graph interactions using histogram features and neural tensor networks [
7], while
GraphSim instead utilizes convolutional neural networks to capture the node-level interactions [
13].
GMN computes the similarity score through a cross-graph attention mechanism to associate nodes across graphs [
26].
MGMN devises a multilevel graph matching network for computing graph similarity, including global-level graph–graph interactions, local-level node–node interactions, and cross-level interactions [
14].
models substructure similarities across the graphs by introducing the concept of hypergraphs [
27]. Different from these studies, in this work, we remove the node-level and substructure-level interactions (whose complexity can grow sharply given larger graphs) and merely model the graph-level interactions. The resultant framework can achieve both high effectiveness and efficiency, which is demonstrated in the experiment section.
Additionally, a few works aim to increase the interpretability of the graph similarity learning process.
GOTSim formulates graph similarity as the minimal transformation from one graph to another in the node embedding space, where an efficient differentiable model training strategy is offered [
20]. Wang et al. present a hybrid approach that combines the interpretability of traditional search-based techniques for generating the edit path and the efficiency and adaptivity of neural models to reduce the cost of the GED solver [
28]. Yang and Zhou propose to combine the
search algorithm and graph neural networks to compute approximate GED [
29]. Specifically, the estimated cost function and the elastic beam size are learned through the neural networks.
A summary of existing methods can be found in
Table 1.
Contrastive learning. Contrastive learning is a machine learning paradigm where unlabeled data points are juxtaposed against each other to teach a model which points are similar and which are different. The key is to create positive and negative samples, and samples that belong to the same distribution are pushed towards each other in the embedding space, while those belonging to different distributions are pulled against each other. By learning to encode the similarities or dissimilarities among a set of unlabeled examples, more expressive data representations can be learned [
30]. Following a recent trend that uses contrastive learning to facilitate graph-related tasks [
31,
32], in this work, we also devise an unsupervised contrastive loss to guide graph similarity learning.
Graph matching. Discovering the equivalent nodes among graphs, i.e., graph matching, has been intensively studied by different domains, such as social networks [
33,
34], computer vision [
35,
36] and knowledge graphs [
37,
38,
39]. Since the graphs are usually in a large scale, state-of-the-art solutions utilize GNN embeddings to approximate the matching results. To learn the parameters of neural models, either node-level or graph-level matching label is used, and graph matching approximates the GED when GED is used as a graph-level annotation [
20]. However, it should be noted that the goal of neural graph matching is to generate the fine-grained node matching result, while graph similarity learning merely needs to predict the similarity score.
4. Experiments and Results
In this section, we first introduce the experimental settings. Next, we provide the empirical results and discuss the performance. Finally, we conduct further experiments.
4.1. Experimental Settings
In this subsection, we introduce the experimental settings, including the datasets, evaluation metrics, parameter settings and the baseline models.
4.1.1. Datasets
Following the state-of-the-art works [
27], we adopt three real-world datasets for evaluation, including:
AIDS [
49], a dataset consisting of molecular compounds from the antiviral screen database of active compounds. The molecular compounds are converted to graphs, where atoms are represented as nodes and covalent bonds are regarded as edges. It comprises 42,687 chemical compound structures, where 700 graphs were selected and used for evaluating graph similarity search;
LINUX [
50], a dataset consisting of program dependence graphs in the Linux kernel. Each graph represents a function, where statements are represented as nodes and the dependence between two statements are represented as edges. A total of 1000 graphs were selected and used for evaluation;
IMDB [
51], a dataset consisting of ego-networks of film stars, where nodes are people and edges represent that two people appear in the same movie. It consists of 1500 networks.
More statistics of the dataset can be found in
Table 2 (The datasets can be obtained from
https://github.com/yunshengb/SimGNN, (accessed on 1 September 2021).) For each dataset, we create the training, validation and testing sets by following the 60%:20%:20% ratio. For
AIDS and
LINUX, the ground-truth GED distance is obtained by using the
algorithm [
8]. Since the
IMDB dataset contains larger graphs and the exact algorithms fail to work, approximate algorithms, i.e.,
HED [
10],
Hungarian [
21] and
VJ [
22] are used to calculate the GEDs, where the minimum value is regarded as the ground-truth label [
7].
4.1.2. Evaluation Metrics
We follow previous works and use mean squared error (mse), Spearman’s rank correlation coefficient () and precision@10 (p@10) as the evaluation metrics. Among them, mse measures the average squared difference between the ground-truth and the predicted graph similarity scores. The remaining two are used to evaluate the ranking results, where database graphs are ranked according to the similarities to the query graph. Graphs with higher similarities are assigned with lower ranks. characterizes the correlations between the ground-truth and predicted ranking results, and p@10 is computed by dividing the intersection of ground-truth and predicted top-10 results by 10.
4.1.3. Parameter Settings
Regarding the graph similarity prediction network, we use three convolutional layers and use ReLU as the activation function . The dimension of node embeddings generated by all layers is set to 100. For attention-based pooling, is set to the sigmoid function, and the dimension of graph-level embedding is set to 100. consists of four fully connected layers with ReLU activation function, which reduces the dimensions from 600 to 200, 200 to 100, 100 to 50 and 50 to 1, which is then processed by a sigmoid function to predict the similarity score. As to the contrastive learning module, we set , and to 0.1 and randomly choose one of the three methods when augmenting the original graph. consists of two fully connected layers with ReLU activation function, where the dimension of embeddings is kept as 300. is set to 0.5. We set the batch size to 1024 and the learning rate to 0.001. All the hyperparameters are tuned on the validation set. The experiments are conducted on a personal computer with the Ubuntu system, an Intel Core i7-4790 CPU, an NVIDIA GeForce GTX TITAN X GPU and 32 GB memory.
It is noteworthy that, since the nodes in
AIDS are labeled, we use one-hot embeddings as the initial node embeddings. For
LINUX and
IMDB, where nodes are unlabeled, we use a constant encoding scheme [
7].
4.1.4. Baseline Models
We compare with three categories of methods. The first one includes the classical methods, including
[
8], which computes exact GED distance, and
Beam [
15],
Hungarian [
21],
VJ [
22] and
HED [
10], which calculate approximate GED values.
The second one comprises state-of-the-art GNN-based methods, including:
SimGNN [
7], which uses a histogram function to model the node–node interactions and supplement with the graph-level interactions for predicting the graph similarity;
GMN [
26], which computes the similarity score through a cross-graph attention mechanism to associate nodes across graphs;
GraphSim [
13], which estimates the graph similarity by applying a convolutional neural network (CNN) on the node–node similarity matrix of the two graphs;
GOTSim [
20], which formulates graph similarity as the minimal transformation from one graph to another in the node embedding space, where an efficient differentiable model training strategy is designed;
MGMN [
14], which models global-level graph–graph interactions, local-level node–node interactions, and cross-level interactions, constituting a multilevel graph matching network for computing graph similarity. It has two variants, i.e.,
SGNN, a siamese graph neural network to learn global-level interactions and
NGMN, a node–graph matching network for effectively learning cross-level interactions;
[
27], which learns the graph representations from the perspective of hypergraphs, and thus captures rich substructure similarities across the graph. It has two variants, i.e.,
, which uses random walks to construct the hypergraphs and
, which uses node neighborhood information to build the hypergraphs.
The third category combines traditional and neural models for computing the graph similarity:
GENN- [
28], which uses dynamic node embeddings to improve the branching process of the traditional A* algorithm;
Noah [
29], which combines A* search algorithm and graph neural networks to compute approximate GED, where the estimated cost function and the elastic beam size are learned through the neural networks.
4.2. Experimental Results
In this subsection, we first report the main results. Then, we conduct the ablation study. Finally, we compare the efficiency of algorithms.
4.2.1. Main Results
The main experimental results are presented in
Table 3. As the classical
algorithm can calculate the exact GED on smaller datasets, i.e.,
AIDS and
LINUX, it attains perfect results on these two datasets. However, it cannot scale to the
IMDB dataset. The classical algorithms that compute the approximate GED values, i.e.,
Beam,
Hungarian,
VJ and
HED, attain relatively poorer results compared with methods involving neural networks. Nevertheless, it should be noted that these classical methods can explicitly predict the edit distance, while the pure neural methods cannot. Additionally, the solutions generated by these methods are upper bounds of the optimal GED values, while neural models predict the values at both sides of the optimal ones.
As for the GNN-based regression methods, i.e., SimGNN, GMN, GraphSim, GOTSim, SGNN, NGMN, MGMN and , they achieve better results than the traditional ones. Among them, is the best performing solution, validating the usefulness of modeling substructure interactions. After combining the classical and neural models, the ranking performance of Noah and GENN- increases, while the mse drops slightly. Additionally, both Noah and GENN- cannot effectively handle the IMDB dataset.
Our proposed method, Conga, attains the best overall performance, especially on the larger dataset IMDB. On smaller datasets, it also attains leading results. Its ranking results on AIDS is lower than GENN- and GOTSim, which can be partially attributed to the fact that it focuses more on estimating the correct GED distance. Additionally, its mse results is slightly inferior to on the LINUX dataset. That being said, its performance is more robust and consistent across all datasets/metrics.
4.2.2. Ablation Results
We further conduct an ablation study to validate the effectiveness of our proposed modules. Specifically, we remove the contrastive learning module and report the results of the graph similarity prediction network, i.e.,
Conga w/o CL in
Table 3. It is evident that without the contrastive learning, the results of most metrics drop across all datasets, especially on the larger dataset
IMDB. This can be ascribed to the more expressive entity representations induced by the unsupervised contrastive loss.
Furthermore, we can observe that even without the contrastive learning module, the vanilla graph similarity prediction network can also attain state-of-the art performance compared with existing methods. This shows that merely modeling the graph-level interactions can already achieve superior prediction results, despite that the intricate node-level and substructure-level information is discarded. Additinally, we reveal in the following section that our design is much more efficient than present methods that aim to capture fine-grained information. It is noteworthy that, instead of claiming that node–node and substructure–substructure are useless, we aim to demonstrate that only modeling the graph-level interaction can already achieve both high effectiveness and efficiency.
4.2.3. Efficiency Comparison
We further compare the efficiency of these approaches. It can be seen from
Table 4 that our proposed framework,
Conga, requires the least running time, which is over 30× faster than the classic methods and 3× faster than the neural methods. This is due to its simple architecture that merely comprises convolutional layers and pooling functions, where the resource-consuming fine-grained interactions between graphs are abandoned. In addition, it shows that the contrastive learning module brings extra running time.
From the table, we can also observe that, on average, the neural methods (on the left) are faster than the traditional ones or the ones that build upon traditional frameworks (on the right).
4.3. Further Analysis
We conduct further experiments to validate the effectiveness of our proposed components.
4.3.1. On GNN Layers
First, we test the influence of the number of GNN layers on the results. We change the number of GNN layers from 1 to 4 and report the mse values generated by our proposed graph similarity prediction network. As shown in
Figure 3a, the number of layers does affect the overall performance. Specially, fewer layers cannot extract useful information from the graphs, leading to larger mse values on
AIDS and
LINUX, while too many layers would cause the overfitting issue, as pointed out by previous study, which in turn results in a higher mse on
IMDB. Hence, the number of GNN layers should be set to an appropriate value.
4.3.2. On Graph Augmentation in Contrastive Learning
Next, we examine the influence of different data augmentation strategies in contrastive learning on the results. Specifically, we aim to compare with two variants, i.e.,
CL1, which combines two augmentation strategies to augment the graph at each time, and
CL2, which additionally introduces an augmentation strategy (that samples a subgraph from the original graph using random walk) [
47].
It is observed from
Figure 3b that our proposed augmentation strategy can generate the lowest mse value. This demonstrates that, for the graph similarity computation task, the augmented graph should not deviate too much from the original graph (
Ours vs.
CL1), and the subgraph sampling augmentation strategy does not align with the goal of GED computation (
Ours vs.
CL2), since a subgraph might significantly modify the original graph’s GED distance to another graph.
4.3.3. On Loss Function
Furthermore, we analyze the hyperparameter
in Equation (
5), which controls the contribution of supervised similarity prediction loss and the unsupervised contrastive loss. As shown in
Figure 3c, generally speaking, a lower
is preferred, since the contrastive learning module mainly helps to learn more expressive embeddings and does not directly influence the similarity computation. Hence, importance should still be placed on the similarity prediction. Nevertheless, setting
to 0.25 brings higher mse compared with 0.5 on
IMDB, showing that contrastive learning still plays a significant role in the overall loss function.
4.3.4. On Learning Rate
Finally, we investigate the influence of the learning rate of the training process. It is observed from
Figure 3d that the learning rate is critical to the overall performance. As thus, in our experiment, we carefully tune the hyperparameter on the validation set and set the learning rate to 0.001.
5. Conclusions and Future Works
In this work, we put forward a contrastive neural graph similarity calculation framework, i.e., Conga, so as to mitigate the reliance on labeled data and meanwhile enhance the computation efficiency. Experimental results on three public datasets validate that Conga can attain the best overall performance across different evaluation metrics. In the meantime, Conga largely speeds up the graph similarity computation process. This reveals that our proposed strategy can be a competitive candidate for real-world applications that involve graph similarity computation.
For future works, we plan to explore more useful augmentation strategies for contrastive learning. Additionally, how to make the prediction results explainable is also worthy of investigation.