Interpreting Deep Graph Convolutional Networks with Spectrum Perspective

Zhang, Sisi; Li, Fan; Zhang, Tiancheng; Yu, Ge

doi:10.3390/math11102256

Open AccessArticle

Interpreting Deep Graph Convolutional Networks with Spectrum Perspective

School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(10), 2256; https://0-doi-org.brum.beds.ac.uk/10.3390/math11102256

Submission received: 6 April 2023 / Revised: 1 May 2023 / Accepted: 8 May 2023 / Published: 11 May 2023

Download

Browse Figures

Versions Notes

Abstract

:

Graph convolutional network (GCN) architecture is the basis of many neural networks and has been widely used in processing graph-structured data. When dealing with large and sparse data, deeper GCN models are often required. However, the models suffer from performance degradation as the number of layers increases. The mainstream attribution of the current research is over-smoothing, and there are also gradient vanishing, training difficulties, etc., so a consensus cannot be reached. In this paper, we theoretically analyze the degradation problem by adopting spectral graph theory to globally consider the propagation and transformation components of the GCN architecture, and conclude that the over-smoothing problem caused by the propagation matrices is not the key factor for performance degradation. Afterwards, in addition to using conventional experimental methods, we proposed an experimental analysis strategy under the guidance of random matrix theory to analyze the singular value distribution of the model weight matrix. We concluded that the key factor leading to the degradation of model performance is the transformation component. In the context of a lack of consensus on the problem of model performance degradation, the paper proposes a systematic analysis strategy, as well as theoretical and empirical evidence.

Keywords:

mining graphs; graph learning; deep graph convolutional networks; semi-supervised learning

MSC:

68T07

1. Introduction

In recent years, graph neural networks (GNNs) [1] have exhibited outstanding performance in processing graph-structured data, and graph convolutional networks (GCNs) [2] have emerged as the most popular and widely used due to their efficiency, scalability, and ease of application in various domains such as social networks [3], chemical molecules [4], etc. Furthermore, GCN serves as the fundamental building block of numerous complex models [5,6,7]. Although GCN has been highly successful, the increasing need to process large, sparse, and non-linearly separable graphs, as well as higher demands from various applications, has created the requirement for models that can aggregate multi-hop neighbors at a farther distance and improve representation power with a larger number of parameters. Specifically, when processing large and sparse graph-structured data, the limitations of shallow GCN models become more apparent, primarily manifested as an inability to access and aggregate information from more distant neighbors and insufficient parameters, which restricts their feature extraction ability. Deep GCN models have potential advantages in perceiving more graph-structured information and obtaining higher expressive power, and are one of the current important research topics. Many real-world applications involve large-scale graph data, such as social networks, recommendation systems [8], bioinformatics, and transportation networks. In these applications, deep GCN models can better capture the complex relationships between nodes, thereby improving the effectiveness of graph data analysis. In addition, deep GCN models can also be applied to some fields that require the processing of time-series data, such as traffic prediction and financial risk prediction. Therefore, the motivation for studying deep GCN models is to improve the efficiency and accuracy of large-scale graph data analysis and achieve more extensive applications. It is worth noting that deep residual networks for image classification tasks can have up to 152 layers [9], while the Transformer model for natural language processing can have up to 1000 layers [10]. The studies have provided evidence that deeper architectures can enhance the representation power of models. Consequently, it motivates us to investigate deeper GCN models to address the requirements.

However, current research indicates that deep GCN models often suffer from performance degradation as depth increases [2]. In practical applications, shallow GCN models with only 2 to 4 layers typically achieve the best generalization performance. By comparing the node classification accuracy of the two most representative GCN architectures, vanilla GCN and its variant ResGCN [11], the degradation phenomenon of the deep model can be observed in detail. As shown by the solid curves in Figure 1, on different datasets, as the depth of the GCN models increases, their testing set accuracy continues to decline from the two-layer model, and ResGCNs always maintain a stable and very slow decline. The results indicate that although ResGCNs exhibits better generalization performance compared to vanilla GCNs, ResGCNs cannot consistently improve the generalization performance as the model depth increases. While recent studies have proposed more powerful and deeper models that can overcome the degradation problem, achieving effectiveness often requires strong assumptions or tuning many parameters specifically for a given dataset [6,7]. Therefore, identifying the key factors that cause model performance degradation is crucial in guiding the development of models that can enhance generalization performance, which remains a significant challenge.

The degradation of the model’s performance as it goes deeper is widely attributed to “over-smoothing” [12,13,14,15,16,17], which is reflected on the model output as the indistinguishability of node representations. Current theoretical studies on the over-smoothing problem focus on the low-pass filter effect of the graph, which refers to the phenomenon that after multiple graph convolution operations (i.e., propagation operations), the augmented normalized Laplacian matrix of the model converges mathematically in the spectral space [5,18,19]. Specifically, the node representations approach an invariant subspace determined by node degree information alone. This is consistent with the experimental observation that the outputs of the model have indistinguishable properties. Based on the understanding, they have proposed strategies to mitigate or address the problem of over-smoothing. Meanwhile, some studies attribute the degradation problem to aspects such as gradient vanishing and training difficulty [11,20,21], and have proposed corresponding optimization strategies. Despite the effectiveness of these strategies in enhancing model representation power, there is still no consensus on the understanding of the degradation problem. Furthermore, current experimental studies on the degradation problem of deep GCN models tend to focus on shallow layers with less than 10 layers, which is insufficient for studying the problem comprehensively. As a result, in the absence of consensus, we conducted both theoretical and experimental investigations into the degradation of model performance with increasing numbers of layers.

In this paper, we specifically conduct research from the following aspects: 1. The GCN architecture includes not only the graph convolution aggregation operation (i.e., propagation operation) affected by the augmented normalized Laplacian matrix, but also the transformation operation affected by the weight matrix. However, the weight matrix is often ignored in investigations of the over-smoothing problem. We conduct a theoretical analysis of both operations of GCN from a global perspective. 2. As the key factors leading to model performance degradation still have no consensus, we continue to investigate this problem by conducting comparative experiments between deep GCN models, designed reasonably. Motivated by the aforementioned thoughts, the contributions of this paper are as follows.

Firstly, we integrate the aforementioned studies on graph signals and extend the findings on the over-smoothing problem under reasonable assumptions. Rather than adopting the theoretical analysis perspective of inertial decoupling, we analyze the changing trend of graph signals in the spectral space from a unified global perspective, combined with transformation operations. Our analysis shows that the GCN architecture naturally avoids the over-smoothing problem and does not undergo the process of converging to the invariant subspace. Additionally, we show that the random noise in the graph signals have a decreasing impact on the model after passing through the low-pass filter as the number of layers increases.

Secondly, in addition to using conventional experimental methods such as comparing node classification accuracy, we propose an experimental analysis strategy to analyze the singular value distribution of the model weight matrix under the guidance of random matrix theory (RMT). The experimental results explicitly show the process of multiple transformation operations leading to the gradual capture of less information by deep models. This enables us to better understand how the representational power of the model degrades as the number of layers increases. The results indicate that the transformation component in the GCN architecture is a key factor leading to the degradation in model performance. This lays the foundation for our subsequent research.

Overall, our analytical theory is well-suited to explain the degradation of deep GCNs in multi-layer architectures. To further support our theoretical analysis, we employ more angles of experimental analysis strategies to confirm the theoretical analysis conclusions in deeper models.

2. Related Work

Research suggests that a deeper model with increasing complexity can improve accuracy in computer vision or natural language processing tasks [22,23,24,25]. This has been demonstrated through the Weisfeiler–Lehman (WL) graph isomorphism test, which shows that deep models have a better capacity to distinguish subgraphs compared to shallow models [21]. Additionally, ref. [26] conducts an image classification experiment and finds that the model with the largest number of parameters, reaching 550 M, achieved the highest top-1 classification accuracy. Research has shown that increasing the depth and complexity of models can improve the accuracy of computer vision or natural language processing tasks. However, in our research, we have extended the problem to the graph domain, which involves operating on structured data represented as graphs and learning messages through the iterative structure of the graph. However, we have encountered several difficulties in this process. Firstly, graphs have two types of information: topological information and node information, which differ from the information handled in computer vision or natural language processing. Secondly, the performance degradation phenomenon exhibited by deep graph neural networks is unique. There is no consensus on the attribution of the degradation phenomenon in deep models, and it remains an area for exploration. We will provide specific examples below. Finally, many studies have addressed deep problems in a framework that corresponds to shallow models [27,28]. However, analyzing the degradation of model performance in shallow models cannot solve the problems in deep models.

It is known that the single-layer GCN is decomposed into three operations (components): propagation operation (i.e., graph convolution aggregation operation) with augmented normalized Laplacian matrix, transformation operation with weight matrix, and non-linear operation with ReLU activation function. We extend the setting to a multi-layer GCN for a semi-supervised node classification task [2], which requires the model to learn a hypothesis to extract node features and topology information from the graph and to predict the labels of nodes.

Firstly, the lack of consensus in research on the degradation phenomenon is mainly manifested in the controversial and even contradictory attribution of the cause of degradation. For example, ref. [13] proposed a GPR-GNN model with Generalized PageRank techniques to trade-off node features and topological features of the graph, thus preventing the over-smoothing issue of node representations. Ref. [29] shows that anti-over-smoothing processes can occur via GCN models during transformation operations, and they suggest that overfitting is a major contributor to model deterioration. Ref. [30] further refutes the insights such as overfitting and gradient vanishing, and asserts that it is the weight matrix multiplication. Secondly, there is a diversity in optimization strategies, and their effectiveness is limited. If the problem is attributed to a specific type of propagation operation, transformation operation, or non-linear operation, it would be reasonable to propose a decoupling structure to solve the problem, such as increasing propagation operations and reducing transformation operations. However, these optimization strategies are often controlled by various hyperparameters (such as learning rate, number of training iterations, etc.), output at each layer, gradient distribution, and other settings. However, the analysis remains insufficiently thorough and governed by the setting of various hyper-parameters (i.e., the learning rate, iterations of training, etc.), the distribution of each layer output, gradient distribution, and other settings. For instance, ref. [19] shows that the representation capacity of the model cannot be improved with the increasing number of layers and nonlinear operations. Ref. [5] proposed a SGC model, which consists of a fixed low-pass filter followed by a linear classifier, to remove the excessive complexity caused by the nonlinear activation function and excessive weight matrix multiplication. At the same time, the experimental evaluation shows that this variant will not hurt the classification accuracy. Actually, these research ideas violate the fundamental needs of the continuous deepening of neural networks. The architectures have strict requirements on data sets, which are required to be linearly separable, and also have strict restrictions on the assumptions of the model itself. Processing real sparse big data generally does not meet these conditions. The research on the phenomena that a modeląŕs performance degrades as its depth increases is still very controversial, as the above summary makes clear. If there is consensus on the underlying factors of this issue, that would be helpful. We need newer perspectives, more detailed experiments, and a more systematic evaluation inductive inspection of this problem.

3. Preliminary

3.1. Notation

We consider an undirected graph

G = (V, E, X),

where

V

is the node set with

| V | = N

. Each node

v_{i} \in V

has a d-dimensional feature vector

x_{i} \in R^{d}

, and

X = {x_{1}, x_{2}, \dots, x_{n}}

.

E

is the edge set with

| E | = M

, and

(v_{i}, v_{j}) \in E

. Let

A \in R^{N \times N}

denote the adjacency matrix of

G

, where

A_{i j} = 1

if

(v_{i}, v_{j}) \in E

and 0 otherwise, and

D \in R^{N \times N}

denote the degree matrix of

G

, where

D_{i i} = \sum_{j \in V} A_{i j}

and 0 otherwise.

3.2. Single-Layer Graph Convolutional Network

We first demonstrate a number of settings and inherent properties of a single-layer GCN architecture, and then extend it to multi-layer GCNs architecture. As originally developed by [2,31], the vanilla GCN is a Chebyshev polynomial-based one-order approximation of spectral GCN. The Laplacian matrix of graph

G

is defined as

L = D - A \in R^{N \times N}

, and the normalized Laplacian matrix is

L_{sym} : = I_{N} - D^{- \frac{1}{2}} A D^{- \frac{1}{2}}

. We set the augmented propagation matrix as

P = I_{N} + D^{- \frac{1}{2}} A D^{- \frac{1}{2}} = {\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}}

, where the adjacency matrix augmented with self-loops as

\tilde{A} = A + I_{N}

,

\tilde{D} = D + I_{N}

. We define the augmented normalized Laplacian

{\tilde{L}}_{sym} = I_{N} - {\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}}

, thus

P = I_{N} - {\tilde{L}}_{sym} .

(1)

The single-layer GCN architecture is formulated as the following:

f (X) = σ (P X W),

(2)

where

σ (\cdot)

is ReLU activation function, i.e.,

R e L U (\cdot) = m a x (0, \cdot)

and

W

is the weight matrix of transformation operation, which can be thought of as the standard multi-layer perceptron (MLP)’s learnable parameters.

3.3. Multi-Layer Graph Convolutional Networks

Next, we examine an architecture with multiple GCN hidden layers. Let

[K] : = {1, 2, \dots, K}

. The multi-layer GCNs architecture is defined as

H = softmax (σ (P \dots σ (P σ (P X W^{(1)}) W^{(2)}) \dots W^{(K)})),

(3)

where

W^{(k)} \in R^{d_{k} \times d_{k - 1}}

is the k-th layer weight matrix, and

k \in [K]

. Figure 2 is a schematic diagram of the Multi-layer Graph Convolutional Networks architecture.

We make the following assumptions about the properties of Graph

G

.

Assumption 1.

Graph

G

is connected, that is

G

has 1 connected component.

U

is a space of graph Fourier Transform with non-negative eigenvectors, which has orthonormal bases

{(e_{m})}_{m \in [M]}

.

Assumption 2.

The observed graph signals

{(x_{i})}_{i \in V}

are composed with true signals and noise signals. The random noise follows a Gaussian distribution

N (0, σ^{2})

.

Intuitively, the stability of the multi-layer GCNs architecture depends on the largest absolute eigenvalue of its propagation matrix. The following theorem holds under Assumption 1.

Theorem 1

(Laplacian Spectral Property [5,19]). Let

μ_{t} \leq μ_{t - 1} \leq \dots \leq μ_{1}

be the eigenvalue of

P

, sorted in ascending order. Suppose the multiplicity of the largest eigenvalue

μ_{N}

is M, then we have

- 1 < μ_{t}, μ_{M + 1} < 1

, and

μ_{M} = \dots = μ_{1} = 1

. Further,

e_{m} : = {\tilde{D}}^{\frac{1}{2}} u_{m}

is the eigenvector associated with eigenvalue 1. Under Assumption 1, we have

M = 1

;

- 1 < μ_{t}, μ_{2} < 1

;

μ_{N} = 1

;

e_{1} = {\tilde{D}}^{\frac{1}{2}} u_{1}

.

Under Theorem 1, there is a fact that the Laplacian of the graph can generalize the eigenpair (i.e., eigenvalue and eigenvector) through

L_{sym} u = λ \tilde{D} u

, and the corresponding eigenpair of the augmented normalized Laplacian

{\tilde{L}}_{sym}

is

(λ, {\tilde{D}}^{\frac{1}{2}} u)

. Further,

{\tilde{L}}_{sym}

has the eigenvalue 0 with eigenvectors

{\tilde{D}}^{\frac{1}{2}} 1

. Ref. [18] shows that the convolution operation by iteratively multiplying the sugmented propagation matrix corresponds to a low-pass filter. Combining Equation (1), we realize that the graph filter

μ = h (λ) = 1 - λ

is the first-order Taylor Approximation of a Laplacian regularized least squares [32]. So, the graph filter with K-th propagation operations is

μ = h (λ) = {(1 - λ)}^{K}

.

Let

Λ = d i a g (μ_{1}, μ_{2}, \dots, μ_{t})

. According to spectral graph theory [29], the Laplace Spectral Decomposition of the propagation matrix can be indicated as

P = U Λ U^{T} = \sum_{i = 1}^{t} u_{i} μ_{i} u_{i}^{T}, x = \sum_{i = 1}^{t} α_{i} u_{i},

(4)

so the output after multiple propagation operations is

\begin{matrix} y = P^{K} x = {(\sum_{i = 1}^{t} u_{i} μ_{i} u_{i}^{T})}^{K} (\sum_{i = 1}^{t} α_{i} u_{i}) = \sum_{i = 1}^{t} α_{i} μ_{i}^{K} u_{i}, \end{matrix}

(5)

\begin{matrix} lim_{K \to + \infty} P^{K} x \propto α_{1} μ_{1}^{K} u_{1} = {\tilde{D}}^{\frac{1}{2}} 1 . \end{matrix}

(6)

With the foregoing deduction, multiple propagation operations will cause the input graph structure data to converge to invariant subspaces. This has the effect of making node representations more similar and difficult to distinguish, which causes an over-smoothing issue. However, does the over-smoothing really lead to a gradual degradation in model performance as the number of layers increases? We provide our own views through theoretical and empirical investigation.

3.4. Supplementary Baseline Architecture

ResGCN is a variant model of GCN [11], which adds residual connections to solve the gradient vanishing problem. It can guarantee that training can still be completed when the depth of the model is more than 32 layers without taking the model’s performance into account. It is defined as

H^{(k + 1)} = σ (P H^{(k)} W^{(k)}) + H^{(k)} .

(7)

The ResGCN architecture is specifically described as combining the output of the current layer with the output of the previous layer to create the input of the next layer.

4. Analysis

4.1. Empirical Analysis of Over-Smoothing

In the previous section, we provided theoretical evidence that multiple convolution operations can cause over-smoothing. However, it is currently unclear whether the performance on real graph data is consistent with the theoretical analysis. To address this question, we compare and analyze the relative norms of the output of each layer in shallow and deep GCN models, as well as ResGCN models, to explicitly observe smoothness on the Cora dataset.

In the following, we explicitly show the indistinguishability of the model output (i.e., over-smoothing) and indicate problems. Specifically, the empirical results show that the outputs of the model at each layer have a low-rank structure, indicating that deep GCN models exhibit over-smoothing in practice. However, by observing the effectiveness of deep ResGCN in addressing the over-smoothing problem, we can conclude that convergence is not only related to propagation operations, but also to transformation operations. In previous analyses, transformation operations were often not considered.

We evaluate the model’s degree of convergence during forward pass using

c o n (X) = X - 1 x^{T}, w h e r e x = a r g m i n_{x} ∥ X - 1 x^{T} ∥

(8)

and

1 x^{T}

is a rank-1 matrix [33]. Then, the closer the output of each layer is to it, the greater the degree of convergence

c o n (X)

. As shown in Figure 3 and Figure 4, curves represent the change process of the relative norm for each layer output in different models (shallow and deep). The curves depict the variation of the relative norm for the convergence degree of the output of each layer in different models, with the horizontal axis representing the layer index of the models. The vertical axis represents

{∥ c o n {(X)}_{k} ∥}_{1, \infty} / {∥ X_{k} ∥}_{1, \infty}

, where

{∥ X ∥}_{1, \infty} = \sqrt{{∥ X ∥}_{1} {∥ X ∥}_{\infty}}

.

Firstly, Figure 3a shows that the shallow GCN model performs normally, with curves staying at a high level and achieving good node classification results. Figure 3a and Figure 4a show that the curve changes from a stable high level to a stable low level, indicating that the deep GCN model does suffer from over-smoothing. This seems to confirm the conclusions of the theoretical derivation in Section 3. Figure 4b demonstrates the superior effectiveness of the deep ResGCN model, which uses residual connection technology, with the relative norm curve changing from a low level to a stable high level, indicating its superior effectiveness in addressing the over-smoothing problem in the forward pass process. As a supplement, Figure 3a,b primarily show that the performance of the shallow ResGCN models, which use residual connection technology, is more stable. Based on the experimental observations above, we propose the following idea: inputting the original graph information into each layer of the model instead of just the first layer, which is considered to be a practical and traditional solution for addressing the convergence problem [34,35]. However, as shown in Equation (7), the ResGCN architecture combines the output of the current layer with the output of the previous layer to create the input for the following layer. This approach is actually different from the traditional view and, by acting on the weight matrix of each layer, has the effect of reducing convergence, which was originally proposed as a strategy for alleviating gradient vanishing.

This makes us comprehend that merely considering the property of propagation operation converging to the invariant subspace is not enough to address the over-smoothing problem. A holistic perspective that takes into account both the propagation and transformation operations, with particular attention on the weight matrix’s existence, is required.

4.2. Graph Noise Signal Analysis

As the propagation operation acts as a low-pass filter obtained through Graph Fourier Transform, it is necessary to investigate the impact of random noise signals on the model before proceeding with further derivations to ensure that the subsequent results are not affected by noise. The conclusion is presented in Theorem 2.

Theorem 2

(Informal Noise Signal Bound). Let

δ \in (0, 1)

, and define Q as the random noise of the signal. Then, with the probability at least

1 - δ

over the model depth K increasing, we have

∥ P^{K} Q ∥ \leq O (1 / d e g^{K / 2} (\sqrt{2 \log (1 / δ)} + 1)) E [{∥ Q ∥}_{2}] .

(9)

It is known that the observed signal is composed of real signal features and noise signals. The following holds under Assumption 2. Theorem 2 guarantees that the observed signals are probably approximately correct (PAC) [36,37] estimates of the true features excluding the noise signals.

Figure 3. Reflection of how each layer of the shallow GCN (8-layers) and ResGCN models’ output converged on the Cora dataset. The vertical axis represents the relative norm of the degree of conversion. (a) The curve of shallow GCNs remains stable at high points. (b) The curve of shallow ResGCNs remains stable at high points for comparison.

Figure 4. Reflection of how each layer of the deep GCN and ResGCN models’ output converged on the Cora dataset. The vertical axis represents the relative norm of the degree of conversion. (a) The curve of deep GCNs remains stable at low points. (b) The curve of deep ResGCNs remains stable at high points for comparison.

Armed with Lemma 5 in [18], it mainly demonstrates that the filtered noise with a probability at least

1 - δ

over the choice of

λ

is small enough through an exponential inequality for chi-square distributions [38]. Thus, we theoretically guarantee the probability of the distribution of signals that contain graph topology information. It provides the premise for the subsequent perceptron analysis and constitutes a rigorous analysis process.

The inspiration for the derivation comes from [18], but it should be noted that we only exclude the interference of random noise because we argue that there is node feature information in deterministic noise. Specifically, we argue that the node expression after the filter more closely represents the topology features of the graph. If the distance between the true features and the topology features is directly regarded as noise, then the information in the noise will also contain node features, resulting in an excessive amount of noise being observed. In reality, however, the random noise continuously decreases as the model depth increases (by Theorem 2). As a result, the model will incorrectly believe that it has been overfit to the noise, and will attempt to avoid overfitting in the following operations.

Proof of Theorem 2.

We assume the noise signals to be i.i.d. Gaussian variables with zero-mean and the same diagonal variable

σ^{2}

, i.e.,

Q \sim N (0, σ^{2})

. Then, the filter noise

{∥ P^{k} Q ∥}^{2} = \sum_{λ} {(1 - λ)}^{2 k} {∥ q (λ) ∥}_{2}^{2} .

(10)

According to Lemma 1 in [38], we adopt the exponential inequality of chi-square distributions for any positive c. Through the logarithm of the Laplace transform of

{∥ q (λ) ∥}_{2}^{2} / σ^{2} - 1

, we set

P \{\sum_{λ} {(1 - λ)}^{2 k} ({∥ q (λ) ∥}_{2}^{2} / σ^{2} - 1) \geq 2 \sqrt{c \sum_{λ} {(1 - λ)}^{4 k}} + 2 c\} \leq e^{- c} .

(11)

Then, for any

δ > 0

, by substituting

c = \log (1 / δ)

, we have

P \{\sum_{λ} {(1 - λ)}^{2 k} {∥ q (λ) ∥}_{2}^{2} \leq [\sum_{λ} {(1 - λ)}^{2 k} (\sqrt{2 \log (1 / δ)} + 1) + 2 \log (1 / δ)] d σ^{2}\} \geq 1 - δ .

(12)

To be concrete, we note that

E [{∥ Q ∥}_{2}^{2}] = σ^{2} d n

. The well-known Dirichlet energy [39] property for normalized Laplacian

L_{sym}

is calculated through

\frac{1}{2} \sum_{i, j}^{N} A_{i j} {∥\frac{x_{i}}{\sqrt{d_{i} + 1}} - \frac{x_{j}}{\sqrt{d_{j} + 1}}∥}^{2} = \frac{1}{2} tr (x^{T} L_{sym} x) .

(13)

We have

P [i, j] = \frac{1}{\sqrt{\tilde{d e g (i)}} \sqrt{\tilde{d e g (j)}}}

, and

\sum_{λ} {(1 - λ)}^{2 k} = tr (P^{2 k})

decreases in

O (1 / \deg^{k})

,

∥ P^{k} Q ∥ \leq O (1 / d e g^{k / 2} (\sqrt{2 \log (1 / δ)} + 1)) E [{∥ Q ∥}_{2}] .

(14)

□

4.3. Spectrum Analysis and Rethinking Over-Smoothing

In the previous subsections, we suggested analyzing the GCN architecture from a holistic perspective. In this subsection, we rethink the over-smoothing problem by examining the trend of the distance between the node embedding and the invariant subspace after propagation and transformation operations. If the distance does not decrease, it means that the over-smoothing caused by propagation operations has not actually occurred. To address this issue, we introduce Theorem 3.

Remember that

μ = \sup_{i = 1, \dots, t} | μ_{i} |

is the eigenvalue of

P

, and

s = \sup_{j = 1, \dots, K} | s_{j} |

is the maximum singular values of weight matrices. The distance mentioned in Section 3 differs from the distance used here in that it just takes the propagation operation into account and does not take the transformation operation into account.

Theorem 3.

The distance between node representations of K-th layer

H^{(K)}

and invariant subspace

P^{k} X

has an exponential variation that is dependent on

({(s μ)}^{k} - 1)

as the number of layers increasing, that is,

{∥ H^{(K)} - P^{k} X ∥}_{F} - {∥ X - P^{k} X ∥}_{F} \leq ({(s μ)}^{k} - 1) {∥ X ∥}_{F}

. Here,

{∥ X - P^{k} X ∥}_{F}

denotes the distance between the original inputs of the model and the invariant subspace.

Proof of Theorem 3.

We denote the distance between

PX

and

P^{k} X

by

{∥ PX - P^{k} X ∥}_{F}

, and the distance between

H^{(K)}

and

P^{k} X

by

{∥ H^{(K)} - P^{k} X ∥}_{F}

. By Lemma 1 of [19], we have

{∥ P X - P^{k} X ∥}_{F} \leq μ {∥ X - P^{k} X ∥}_{F},

(15)

where

{∥ \cdot ∥}_{F}

denotes the Frobenius norm, and

PX

refers to the node observed features after one propagation operation.

Note that

σ

is a ReLU activation function and acts as a contraction mapping. We have the output of the K-th layer

H^{(K)}

, which satisfies

\begin{matrix} {∥ H^{(K)} ∥}_{F} = & {∥ σ (\dots σ (P σ (P X W^{(1)}) W^{(2)}) \dots W^{(k)}) ∥}_{F} \\ \leq & ∥ σ {(\dots σ (σ (P^{k} X W^{(1)}) W^{(2)}) \dots W^{(k)} ∥}_{F} . \end{matrix}

(16)

Combining Equation (16), we have

\begin{matrix} {∥ H^{(K)} - P^{k} X ∥}_{F} = & {∥ σ (\dots σ (P σ (P X W^{(1)}) W^{(2)}) \dots W^{(k)}) - P^{k} X ∥}_{F} \\ \leq & ∥ σ {(\dots σ (σ (P^{k} X W^{(1)}) W^{(2)}) \dots W^{(k)} - P^{k} X ∥}_{F} \\ \leq & ∥ s_{k} σ {(\dots σ (σ (P^{k} X W^{(1)}) W^{(2)}) \dots W^{(k - 1)} - P^{k} X ∥}_{F} \\ \dots \\ \leq & {∥(\prod_{i = 1}^{K} s_{i}) P^{k} X - P^{k} X∥}_{F} \\ \leq & {∥ s^{k} P^{k} X - P^{k} X ∥}_{F} . \end{matrix}

(17)

Combining Equations (15)–(17), we have

\begin{matrix} {∥ H^{(K)} - P^{k} X ∥}_{F} - {∥ X - P^{k} X ∥}_{F} \leq & {∥ s^{k} P^{k} X - P^{k} X ∥}_{F} - {∥ X - P^{k} X ∥}_{F} \\ \leq & {∥ s^{k} μ^{k} X - X ∥}_{F} \\ \leq & ({(s μ)}^{k} - 1) {∥ X ∥}_{F} . \end{matrix}

(18)

□

We know that the maximum frequency of the propagation matrix

μ

is always below and close to 1 [21], and according to Gordonąŕs theorem for Gaussian matrices in [40], the maximum singular value of the weight matrix s is usually greater than 1. Therefore, in general,

s μ > 1

. Some studies impose a strong assumption on the model to make

s μ < 1

[19].

By Equation (18), we can know that if

s μ < 1

, the distance between the model’s output

H^{(K)}

and the invariant subspace

P^{k} X

is less than the distance between the original input

X

and the invariant subspace

P^{k} X

, and their relative size decreases exponentially with respect to the number of layers. If

s μ > 1

, the upper bound of the relative distance will increase exponentially with respect to the number of layers. As the number of layers increases, the constraints are continuously relaxed, which contradicts the absolute convergence derived in Section 3.3.

Theorem 3 indicates that GCNs working with real-world data do not suffering the over-smoothing problem brought on by multiple propagation operations, as this problem can be resolved by combining multiple transformation operations, while the degradation of model performance is due to multiple transformation operations in high possibility. In fact, we investigate and determine the statistical distribution of s using the Random Matrix Theory in the following section, and then empirically analysis the transformation operation.

5. Experiments

In this section, we aim at researching the relationship between the problem of model performance degradation and multiple transformation operations. The problem of multiple transformation operations arises from the multiplication of weight matrices of each GCN layer, which poses a challenge to the GCN architecture, and is strongly related to the problem of model performance degradation.

5.1. Experiment Setup

We conducted experiments on three widely used real-world benchmark citation network datasets [41], namely Cora, Citeseer, and PubMed, to verify the findings and ideas proposed in our research. Table 1 provides a statistical summary of the datasets.

We conduct semi-supervised node classification experiments on different datasets and analyze the classification accuracy of models with different depths. The goal is to classify each document into one class. For each dataset, we randomly sample 20 instances for each class as the training set. For example, in the Cora dataset, the size of the training set is 140 (20 × classes), and the size of the testing set is 1000 instances. Each layer of the model’s hidden dimension is set to 64, the dropout rate to 0.1, and we use the Adam optimizer [42] with a learning rate of 0.01 or 0.001.

As the ResGCN architecture guarantees to some extent the gradient updating of weight matrices, it is more conducive for analyzing the impact of transformation operations on the model. Therefore, the ResGCN model is the main experimental object. As a comparative model, we incorporate LayerNorm (LN) technology into the ResGCN model to modify the distribution of the output from each layer. LN [43] has been shown to enhance model performance by optimizing the landscape of gradient updates.

5.2. Analyzing the Singular Value Distribution of the Weight Matrices

We propose an experimental analysis strategy to understand the impact of multiple transformation operations on model performance degradation by analyzing the changes in the singular value distribution of weight matrices as the model becomes deeper.

According to random matrix theory, we analyze the singular value distribution of the weight matrices from a statistical perspective. Theorem 1 in [44] provides theoretical support for the fact that singular value distributions are heavy-tailed. The goal of optimization is to improve this heavy-tailed distribution as much as possible.

As shown in Figure 5 and Figure 6, we observe heavy-tailed distributions in different datasets as the model depth increased. The four subgraphs in Figure 5, for instance, represent the statistical distribution of singular values for the weight matrices of ResGCN models with increasing depths. As can be observed, the singular value distribution gradually becomes more heavy-tailed as the model becomes deeper and multiplies more matrices. This implies that the maximum singular value tends to become much larger than the others and gradually becomes dominant, and the condition number of the model gradually increases, where the condition number

κ

is the ratio of the maximum singular value to the minimum singular value. We are all aware that these factors, such as the singular value distribution and the condition number, are indicative of a model’s complexity. A relatively flat singular value distribution and a smaller condition number are desirable as they promote stronger representation capabilities and deeper layers.

In general, we observe heavy-tailed singular value distributions as the model becomes deeper across different datasets. This often indicates that the model output is only sensitive to a few directions of the input, which is not conducive to enhancing the model’s representation power. Moreover, the condition number tends to increase after multiple transformation operations, which is not conducive to improving performance. This actually explains the degradation problem of ResGCN.

As shown in Figure 7 and Figure 8, we have observed that, on different datasets, the heavy-tailed phenomenon of the singular value distribution in the LN-optimized ResGCN models at different depths is alleviated (i.e., the orange part), and the condition numbers decrease accordingly, indicating an improvement in the model’s representation power. As shown in Table 2, the longitudinal data comparison reveals that the LN-optimized ResGCN model exhibits the optimal performance (bold value), while the horizontal data comparison reveals that the degradation phenomenon of the LN-optimized ResGCN model, although not completely solved, is alleviated as the depth increases. Figure 9 provides a more detailed view of the improved representation power exhibited by the ResGCN model with LN optimization, particularly on larger datasets (i.e., the right subfigure, PubMed dataset), compared to the ResGCN model without LN optimization.

In general, the ResGCN model with LayerNorm (LN) optimization improves the learning process of weight matrices, reduces the model’s condition number, further enhances the model performance and alleviates the degradation problem. This demonstrates that the transformation operation is indeed a key factor for the degradation problem. We propose to shift attention from the propagation operation in the GCN architecture to the transformation operation when studying the degradation problem.

6. Conclusions

This study mainly analyzes the issue that the generalization performance of Deep GCN models continues to decrease as the number of layers increases.

Firstly, we integrate the existing studies on graph signals under our appropriate assumptions. This is manifested as the smooth behavior of the input data converging to the invariant subspace after multiple propagation operations. These serve as the theoretical basis for studying the degradation issue later.

Afterwards, by comparing the node classification accuracy of the GCN and ResGCN models of different layers, as well as the relative norms reflecting the convergence of the outputs at each layer, we observe that the ResGCN model with residual connections does not suffer from the convergence to the invariant subspace (i.e., over-smoothing) problem. Based on the above research, we conclude from a global perspective in the spectral space by considering the transformation and propagation operations of the GCN architecture that GCN does not suffer from over-smoothing due to multiple propagation operations, as it can be resolved through multiple transformation operations. The degradation of the model is more likely to be affected by the transformation operation.

Finally, the ResGCN model only alleviates the degradation phenomenon, and it fails to realize the ideal situation where the model performance improves with increasing depth. We proposed an experimental analysis strategy to investigate the change in the distribution of singular values of the weight matrices of deep GCN models with increasing depth, with the goal of directly observing the impact of multiple transformation operations on model performance. The experimental results confirmed our theoretical speculations that multiple transformation operations are indeed a key factor leading to the degradation phenomenon of the model performance. Therefore, we suggest shifting attention from the propagation operation to the transformation operation in the GCN architecture when studying the degradation problem.

Author Contributions

Writing—original draft, S.Z.; Writing—review & editing, F.L., T.Z. and G.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62272093, 62137001).

Data Availability Statement

The research data presented in this paper are available on PyTorch Geometric (PyG) library built upon PyTorch.

Acknowledgments

The authors are grateful to Zhilin Yang, William W. Cohen and Ruslan Salakhutdinov for providing the datasets used in the experiments.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, Z.; Cui, P.; Zhu, W. Deep learning on graphs: A survey. IEEE Trans. Knowl. Data Eng. 2020, 34, 249–270. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2017, arXiv:1609.02907. [Google Scholar]
Deng, S.; Rangwala, H.; Ning, Y. Learning dynamic context graphs for predicting social events. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1007–1016. [Google Scholar]
Do, K.; Tran, T.; Venkatesh, S. Graph transformation policy network for chemical reaction prediction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 750–760. [Google Scholar]
Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; Weinberger, K. Simplifying graph convolutional networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6861–6871. [Google Scholar]
Klicpera, J.; Bojchevski, A.; Günnemann, S. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv 2018, arXiv:1810.05997. [Google Scholar]
Chen, M.; Wei, Z.; Huang, Z.; Ding, B.; Li, Y. Simple and deep graph convolutional networks. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1725–1735. [Google Scholar]
Jiang, Y.; Cheng, Y.; Zhao, H.; Zhang, W.; Miao, X.; He, Y.; Wang, L.; Yang, Z.; Cui, B. Zoomer: Boosting retrieval on web-scale graphs by regions of interest. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 9–12 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2224–2236. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 July–1 June 2016; pp. 770–778. [Google Scholar]
Wang, H.; Ma, S.; Dong, L.; Huang, S.; Zhang, D.; Wei, F. Deepnet: Scaling transformers to 1000 layers. arXiv 2022, arXiv:2203.00555. [Google Scholar]
Li, G.; Muller, M.; Thabet, A.; Ghanem, B. Deepgcns: Can gcns go as deep as cnns? In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9267–9276. [Google Scholar]
Chen, D.; Lin, Y.; Li, W.; Li, P.; Zhou, J.; Sun, X. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 34, pp. 3438–3445. [Google Scholar]
Chien, E.; Peng, J.; Li, P.; Milenkovic, O. Adaptive Universal Generalized PageRank Graph Neural Network. In Proceedings of the International Conference on Learning Representations, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Feng, W.; Zhang, J.; Dong, Y.; Han, Y.; Luan, H.; Xu, Q.; Yang, Q.; Kharlamov, E.; Tang, J. Graph random neural networks for semi-supervised learning on graphs. Adv. Neural Inf. Process. Syst. 2020, 33, 22092–22103. [Google Scholar]
Godwin, J.; Schaarschmidt, M.; Gaunt, A.L.; Sanchez-Gonzalez, A.; Rubanova, Y.; Veličković, P.; Kirkpatrick, J.; Battaglia, P. Simple gnn regularisation for 3d molecular property prediction and beyond. In Proceedings of the International Conference on Learning Representations, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Rong, Y.; Huang, W.; Xu, T.; Huang, J. DropEdge: Towards Deep Graph Convolutional Networks on Node Classification. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Zhao, L.; Akoglu, L. PairNorm: Tackling Oversmoothing in GNNs. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Nt, H.; Maehara, T. Revisiting graph neural networks: All we have is low-pass filters. arXiv 2019, arXiv:1905.09550. [Google Scholar]
Oono, K.; Suzuki, T. Graph Neural Networks Exponentially Lose Expressive Power for Node Classification. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Zhou, K.; Dong, Y.; Lee, W.S.; Hooi, B.; Feng, J. Effective Training Strategies for Deep Graph Neural Networks. arXiv 2020, arXiv:2006.07107. [Google Scholar]
Cong, W.; Ramezani, M.; Mahdavi, M. On provable benefits of depth in training graph convolutional networks. Adv. Neural Inf. Process. Syst. 2021, 34, 9936–9949. [Google Scholar]
Zhou, D.X. Universality of deep convolutional neural networks. Appl. Comput. Harmon. Anal. 2020, 48, 787–794. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Susnjak, T. ChatGPT: The End of Online Exam Integrity? arXiv 2022, arXiv:2212.09292. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Huang, Y.; Cheng, Y.; Bapna, A.; Firat, O.; Chen, D.; Chen, M.; Lee, H.; Ngiam, J.; Le, Q.V.; Wu, Y.; et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Adv. Neural Inf. Process. Syst. 2019, 32, 103–112. [Google Scholar]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Philip, S.Y. A comprehensive survey on graph neural networks. IEEE Trans. Neural Networks Learn. Syst. 2020, 32, 4–24. [Google Scholar] [CrossRef]
Zhang, W.; Shen, Y.; Lin, Z.; Li, Y.; Li, X.; Ouyang, W.; Tao, Y.; Yang, Z.; Cui, B. Pasca: A graph neural architecture search system under the scalable paradigm. In Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 1817–1828. [Google Scholar]
Yang, C.; Wang, R.; Yao, S.; Liu, S.; Abdelzaher, T. Revisiting Over-smoothing in Deep GCNs. arXiv 2020, arXiv:2003.13663. [Google Scholar]
Zhang, W.; Sheng, Z.; Yin, Z.; Jiang, Y.; Xia, Y.; Gao, J.; Yang, Z.; Cui, B. Model Degradation Hinders Deep Graph Neural Networks. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Washington, DC, USA, 14–18 August 2022; pp. 2493–2503. [Google Scholar]
Li, Q.; Han, Z.; Wu, X.M. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Belkin, M.; Niyogi, P. Semi-supervised learning on Riemannian manifolds. Mach. Learn. 2004, 56, 209–239. [Google Scholar] [CrossRef]
Dong, Y.; Cordonnier, J.B.; Loukas, A. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 2793–2803. [Google Scholar]
Neal, R.M. Bayesian Learning for Neural Networks; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 118. [Google Scholar]
Duvenaud, D.; Rippel, O.; Adams, R.; Ghahramani, Z. Avoiding pathologies in very deep networks. In Proceedings of the Artificial Intelligence and Statistics, Reykjavik, Iceland, 22–25 April 2014; pp. 202–210. [Google Scholar]
Liao, R.; Urtasun, R.; Zemel, R. A PAC-Bayesian Approach to Generalization Bounds for Graph Neural Networks. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Valiant, L.G. A theory of the learnable. Commun. ACM 1984, 27, 1134–1142. [Google Scholar] [CrossRef]
Laurent, B.; Massart, P. Adaptive estimation of a quadratic functional by model selection. Ann. Stat. 2000, 28, 1302–1338. [Google Scholar] [CrossRef]
Belkin, M.; Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. Adv. Neural Inf. Process. Syst. 2001, 14, 585–591. [Google Scholar]
Davidson, K.R.; Szarek, S.J. Local operator theory, random matrices and Banach spaces. Handb. Geom. Banach Spaces 2001, 1, 131. [Google Scholar]
Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; Eliassi-Rad, T. Collective classification in network data. AI Mag. 2008, 29, 93. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Xu, J.; Sun, X.; Zhang, Z.; Zhao, G.; Lin, J. Understanding and improving layer normalization. Adv. Neural Inf. Process. Syst. 2019, 32, 4383–4393. [Google Scholar]
Bjorck, N.; Gomes, C.P.; Selman, B.; Weinberger, K.Q. Understanding batch normalization. Adv. Neural Inf. Process. Syst. 2018, 31, 7705–7716. [Google Scholar]

Figure 1. (a) Demonstration of training accuracy and test accuracy with increasing depth on Cora (left) dataset. (b) Demonstration of training accuracy and test accuracy with increasing depth on Citeseer (right) dataset. They indicate that ResGCN can be deeper than GCN, while both GCN and ResGCN exhibit a continuous degradation in performance.

Figure 2. Diagram of Multi-layer Graph Convolutional Networks Architecture.

Figure 5. Comparison of singular value distributions of weight matrices for Vanilla GCN models with different depths on Cora dataset. (a) Cora-4 layers. (b) Cora-8 layers. (c) Cora-16 layers. (d) Cora-32 layers.

Figure 6. Comparison of singular value distributions of weight matrices for Vanilla GCN models with different depths on Citeseer dataset. (a) Citeseer-4 layers. (b) Citeseer-8 layers. (c) Citeseer-16 layers. (d) Citeseer-32 layers.

Figure 7. Comparison of singular value distributions of weight matrices for ResGCN models with different depths with LN (orange histogram) and without LN (blue histogram) on Cora. (a) Cora-4 layers-LN. (b) Cora-8 layers-LN. (c) Cora-16 layers-LN. (d) Cora-32 layers-LN.

Figure 8. Comparison of singular value distributions of weight matrices for ResGCN models with different depths with LN (orange histogram) and without LN (blue histogram) on Citeseer. (a) Citeseer-4 layers-LN. (b) Citeseer-8 layers-LN. (c) Citeseer-16 layers-LN. (d) Citeseer-32 layers-LN.

Figure 9. (a) Demonstration of training accuracy and test accuracy with increasing depth on Citeseer (left) dataset. (b) Demonstration of training accuracy and test accuracy with increasing depth on PubMed (right) dataset. They indicate that ResGCN+LN can have better generalization performance than ResGCN, while both of them exhibit a continuous degradation in performance.

Table 1. Real-world benchmark datasets for node classification task.

Datasets	Type	Nodes	Edges	Classes	Features
CORA	Citation network ¹	2708	10,556	7	1433
CITESEER	Citation network	3327	9104	6	3703
PUBMED	Citation network	19,717	88,648	3	500

¹ A citation network is a collection of citing and cited relationships between documents.

Table 2. Statistics on the accuracy of node classification for models with different depths.

Datasets	Models	2 Layers(epoch) ¹	4 Layers	8 Layers	16 Layers	32 Layers	64 Layers
Cora	GCN	0.8010(600)	0.8011(600) ³	0.7471(600)	FTT ²	–
	ResGCN	0.8106(600)	0.7993(600)	0.7735(600)	0.7885(600)	0.7793(600)	FTT
	ResGCN+LN	0.7928(600)	0.8010(600)	0.7968(600)	0.7956(600)	0.7959(600)	0.5908(600)
Citeseer	GCN	0.6298(600)	0.6010(600)	0.5745(600)	FTT	–	–
	ResGCN	0.6927(600)	0.6801(600)	0.6994(600)	0.6889(600)	0.6353(600)	FTT
	ResGCN+LN	0.6947(600)	0.6837(600)	0.6792(600)	0.6826(600)	0.6854(600)	0.4421(600)
PubMed	GCN	0.7301(600)	0.7292(600)	0.6748(600)	FTT	–	–
	ResGCN	0.7642(600)	0.7677(600)	0.7581(600)	0.7585(600)	0.7564(600)	FTT
	ResGCN+LN	0.7730(600)	0.7949(600)	0.7709(600)	0.7624(600)	0.7760(600)	0.5824(600)

¹ epoch: Number of iterations required for training. ² FTT: Failure to Train, that is, the model cannot complete training within acceptable iterations. ³ The bold numbers in the table represent the optimal values for each row.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, S.; Li, F.; Zhang, T.; Yu, G. Interpreting Deep Graph Convolutional Networks with Spectrum Perspective. Mathematics 2023, 11, 2256. https://0-doi-org.brum.beds.ac.uk/10.3390/math11102256

AMA Style

Zhang S, Li F, Zhang T, Yu G. Interpreting Deep Graph Convolutional Networks with Spectrum Perspective. Mathematics. 2023; 11(10):2256. https://0-doi-org.brum.beds.ac.uk/10.3390/math11102256

Chicago/Turabian Style

Zhang, Sisi, Fan Li, Tiancheng Zhang, and Ge Yu. 2023. "Interpreting Deep Graph Convolutional Networks with Spectrum Perspective" Mathematics 11, no. 10: 2256. https://0-doi-org.brum.beds.ac.uk/10.3390/math11102256

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Interpreting Deep Graph Convolutional Networks with Spectrum Perspective

Abstract

1. Introduction

2. Related Work

3. Preliminary

3.1. Notation

3.2. Single-Layer Graph Convolutional Network

3.3. Multi-Layer Graph Convolutional Networks

3.4. Supplementary Baseline Architecture

4. Analysis

4.1. Empirical Analysis of Over-Smoothing

4.2. Graph Noise Signal Analysis

4.3. Spectrum Analysis and Rethinking Over-Smoothing

5. Experiments

5.1. Experiment Setup

5.2. Analyzing the Singular Value Distribution of the Weight Matrices

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI