Next Article in Journal
Thermo-Economic Evaluation of Aqua-Ammonia Solar Absorption Air Conditioning System Integrated with Various Collector Types
Next Article in Special Issue
Computation in Complex Networks
Previous Article in Journal
Emergence of Organisms
Previous Article in Special Issue
Spreading Control in Two-Layer Multiplex Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Active Learning for Node Classification: An Evaluation

Department of Computer Science, Tokyo Institute of Technology, Tokyo 152-8552, Japan
*
Author to whom correspondence should be addressed.
Submission received: 30 August 2020 / Revised: 6 October 2020 / Accepted: 12 October 2020 / Published: 16 October 2020
(This article belongs to the Special Issue Computation in Complex Networks)

Abstract

:
Current breakthroughs in the field of machine learning are fueled by the deployment of deep neural network models. Deep neural networks models are notorious for their dependence on large amounts of labeled data for training them. Active learning is being used as a solution to train classification models with less labeled instances by selecting only the most informative instances for labeling. This is especially important when the labeled data are scarce or the labeling process is expensive. In this paper, we study the application of active learning on attributed graphs. In this setting, the data instances are represented as nodes of an attributed graph. Graph neural networks achieve the current state-of-the-art classification performance on attributed graphs. The performance of graph neural networks relies on the careful tuning of their hyperparameters, usually performed using a validation set, an additional set of labeled instances. In label scarce problems, it is realistic to use all labeled instances for training the model. In this setting, we perform a fair comparison of the existing active learning algorithms proposed for graph neural networks as well as other data types such as images and text. With empirical results, we demonstrate that state-of-the-art active learning algorithms designed for other data types do not perform well on graph-structured data. We study the problem within the framework of the exploration-vs.-exploitation trade-off and propose a new count-based exploration term. With empirical evidence on multiple benchmark graphs, we highlight the importance of complementing uncertainty-based active learning models with an exploration term.

1. Introduction

Supervised learning is an important technique used to train machine learning models that are deployed in multiple real-world applications [1]. In a supervised classification problem, data instances with ground truth labels are used for training a model that can predict the labels of unseen data instances. Therefore, the performance of a supervised learning model depends on the quality and quantity of training data, often requiring a huge labeling effort. Usually, the labeling of data instances is done by humans. Labeling large amounts of data leads to a huge cost in both time and money. The labeling cost is significantly high when the labeling task needs to be done by domain experts. For example, potential tumors in medical images can be labeled only by qualified doctors [2,3].
With ever-increasing amounts of data, active learning (AL) is gaining the attention of researchers as well as practitioners as a way to reduce the effort spent on labeling data instances. Usually, a fraction of data instances are selected randomly and their labels are queried from an oracle (e.g., human labelers). This set of labeled instances are used for training the classifier. This process is known as passive learning [4] as the training data is selected at the beginning of the training process and it is assumed to stay fixed. An alternative approach is to iteratively select a small set of training instances, retrieve their labels, and update the training set. Then, the classification model is retrained using the acquired labeled instances and this process is repeated until a good level of performance (e.g., accuracy) is achieved. This process is known as active learning [5]. The objective of AL can be expressed as acquiring instances that maximize model performance. An acquisition function evaluates the informativeness of each unlabeled instance and selects the most informative ones. As quantifying the informativeness of an instance is not straightforward, a multitude of approaches have been proposed in AL literature [5]. For example, selecting the instance the model is most uncertain about is a commonly used acquisition function [6].
In this paper, we study the problem of applying AL for classifying nodes of an attributed graph (The term “network” is used as an alternative term in the literature. We use the term graph since the usage of the term network can be confused with neural networks in this paper.). This task is known as node classification. Reducing the number of labeled nodes required in node classification can benefit a variety of practical applications such as in recommender systems [7,8] and text classification [9] by selecting only the most informative nodes for labeling. Parisot et al. [3] demonstrated the importance of representing the associations between brain scan images of different subjects as a graph for the task of predicting if a subject has Alzheimer’s disease. Features extracted from images are represented as node attributes. This is an example for a node classification problem where labeling is expensive as labeling a brain scan image is time-consuming and it can only be done by medical experts.
Node classification is an important task in learning from relational data. The objective of this problem is to predict the labels of unlabeled nodes given a partially labeled graph. Different approaches have been used for node classification including iterative classification algorithm (ICA) [10], label propagation [11], and Gaussian random fields (GRF) [12]. Approaching node classification as a semisupervised problem has contributed to state-of-the-art in classification performance [13,14,15]. In a semisupervised learning problem, the learning algorithm can utilize the features of all data instances including the unlabeled ones. Only the labels of unlabeled instances are not known. Semisupervised learning is a technique that utilizes unlabeled data to improve the label efficiency. Combining AL with semisupervised learning can increase the label efficiency further [16]. Graph neural network (GNN) models have achieved state-of-the-art performance in node classification [17].
Similar to other neural network-based models, GNN models are sensitive to the choice of hyperparameters. The common hyperparameters of a GNN model are learning rate, number of hidden layers, and the size of hidden units of each hidden layer. Unlike model parameters, the hyperparameters are not directly optimized to improve the model performance. Finding the most suitable set of values for hyperparameters is known as hyperparameter tuning. It is usually performed based on the performance of the model on a separate held-out labeled set known as the validation set. It is possible to leave a fraction of labeled data as the validation set when labeled data is abundant. However, in a label scarce setting, it is realistic to use all the available labeled instances for training the model. Therefore, we further reduce the size of the labeled set by not using a validation set and using fixed standard values for hyperparameters.
With the recent popularity of GNNs, several surveys on GNNs have been done [17,18,19]. These works provide a comprehensive overview of recent developments in graph representation learning and its applications. Surveys on AL research have been done separately [20,21]. However, as far as the authors know, a survey and a systematic comparison of existing AL approaches for the task of node classification have not been done yet. Moreover, only a handful of graph datasets are used for benchmarking such models. Most of the benchmark graphs are similar as they come from the same domain. In this paper, we study commonly used AL acquisition functions on the problem of node classification using a multitude of graph datasets belonging to different domains. As shown in previous work [22], the performance of AL algorithms is not consistent across different datasets.
Our main contributions are
  • we discuss the importance of performing AL experiments in a more realistic setting where an additional labeled dataset is not used for hyperparameter tuning;
  • we perform a thorough evaluation of existing AL algorithms on the task of node classification of attributed graphs in a more realistic setting; and 
  • with empirical evidence on an extensive set of graphs with different characteristics, we highlight that graph properties should be considered in selecting an AL approach.

2. Background

2.1. Node Classification

Node classification plays an important part in learning problems when the data is represented as a graph. A graph G consists of V nodes and E edges connecting pairs of nodes. Edges of a graph can be directional as well. However, we limit our study to undirected graphs. Node classification is widely used in practical applications such as recommender systems [8,23], applied chemistry [24], and social network analysis [25]. In a node classification problem, an attributed graph G = ( V , E ) with N nodes is given as an adjacency matrix A R N × N and a node attribute matrix X R N × F . Here, F is the number of attributes. An element a i j A represents the edge weight between two nodes v i and v j . If there is no edge connecting v i and v j , a i j = 0 . If the graph is undirected, the adjacency matrix A is symmetric. The degree matrix D is a diagonal matrix defined as D = { d 1 , , d N } , where each diagonal element d i is the row-sum of the adjacency matrix such that d i = j = 1 N a i j . Each node v i has a real-valued feature vector x i R N × F and v i belongs to one of the C class labels.
The objective of this problem is to predict the labels of unlabeled nodes V U given a small set of labels V L . Earlier attempts for solving this problem relied on classifiers based on the assumption that nodes connected by an edge are likely to share the same label [26,27]. A major weakness of such classifiers is that this assumption restricts the modeling capacity and the node attributes are not used in the learning process. The use of node attributes of an attributed graph significantly improves the classification performance.

2.2. Graph Neural Networks (GNNs)

A GNN is a neural network architecture specifically designed for learning with attributed graphs. GNN models [14,28,29] achieve state-of-the-art performance on the node classification problem providing a significant improvement over previously used embedding algorithms [30,31]. What sets GNNs apart from previous models is their ability to jointly model both structural information and node attributes. In principle, all GNN models consist of a message passing scheme that propagates feature information of a node to its neighbors. Most GNN architectures use a learnable parameter matrix for projecting features to a different feature space. Usually, two or more of such layers are used along with a nonlinear function (e.g., ReLU). Let G be an undirected attributed graph represented by an adjacency matrix A and a feature matrix X. By adding self-loops to the adjacency matrix we get A ˜ = A + I and its degree matrix D ˜ = D + I . Using this notation, the graph convolutional network (GCN) model [14] can be expressed as
H ˜ ( k ) = D ˜ 1 / 2 A ˜ D ˜ 1 / 2 H ( k 1 ) ,
where D ˜ 1 / 2 A ˜ D ˜ 1 / 2 is the normalized adjacency matrix. Then, the hidden representation of a layer H ( k ) is obtained by multiplying the feature matrix H ˜ ( k ) with a parameter matrix θ and applying an activation function σ as
H ( k ) = σ ( H ˜ ( k ) θ ( k ) ) .
With normalized adjacency matrix A ^ = D ˜ 1 / 2 A ˜ D ˜ 1 / 2 a two-layer GCN model [14] can be expressed as
Y GCN = softmax A ^ ReLU A ^ X θ ( 0 ) θ ( 1 ) ,
where X is the node attribute matrix and θ ( 0 ) and θ ( 1 ) are the parameter matrices of two neural layers. The softmax function defined as softmax ( x ) = e x p ( x ) / c = 1 C e x p ( x c ) normalizes the output of the classifier across all classes. Rectified linear unit (ReLU) is a commonly used activation function where ReLU ( x ) = m a x ( 0 , x ) .
Wu et al. [29] showed that a simplified GNN model named SGC can achieve competitive performance on most attributed graphs at a significantly lower computational cost. They obtained this model by removing hidden layers and nonlinear activation functions in the GCN model. This model can be written as
Y SGC = softmax A ^ k X θ ,
where A k is the kth power of the adjacency matrix A. The parameter k determines the number of hops the feature vectors are propagated to. This approach is similar to propagating node attributes over the k-hop neighborhood and then performing logistic regression. Using a 2-hop neighborhood ( k = 2 ) often results in good performance.

2.3. Active Learning

In this paper, we consider the pool-based AL setting [5]. In a pool-based AL problem, the labeled dataset L is much smaller compared to a large pool of unlabeled items U . We can acquire the label of any unlabeled item by querying an oracle (e.g., a human annotator) at a uniform cost per item. Suppose we are given a query budget K, such that we are allowed to query labels of a maximum of K unlabeled items. We use the notation f θ to denote a classification model with trainable parameters θ . The probability of an instance q belonging to class c predicted by this model is written as P θ ( y ^ q = c | x , D L ) . We calculate this likelihood as
P θ ( y ^ q = c | x , D L ) = softmax f θ ( x q ) [ q = c ] .
AL research has contributed to a multitude of approaches for training supervised learning models with less labeled data. We recommend the work in [5] as a detailed review of existing AL research. The objective of AL approaches is to select the most informative instance for labeling. This task is performed with the use of an acquisition function, where the acquisition function decides which unlabeled example should be labeled next. Existing acquisition functions can be grouped into a few general frameworks based on how they are formulated. In this section, we describe a few commonly used AL frameworks.

2.3.1. Uncertainty Sampling

Uncertainty sampling [32] is one of the most widely used AL approaches. The active learner selects the instance for which the classifier predicts a label with the least certainty. The information entropy of the label predictions is usually used to quantify the uncertainty of the model for a given instance x q such that
H ( y q | x , D L ) = c = 1 C P θ ( y ^ q = c | x q , D L ) l o g P θ ( y ^ q = c | x q , D L ) .
The instance corresponding to the maximum entropy is selected for querying
q * = arg max q H ( y q | x q , D L ) .
The entropy computed over model predictions of a neural network does not correctly represent the model uncertainty for unseen instances. Even though Bayesian models are good at estimating the model uncertainty, Bayesian inference can be prohibitively time-consuming. Gal and Ghahramani [33] demonstrated that using dropout [34] at evaluation time is an approximation to a Bayesian neural network and this can be used to calculate the model uncertainty. Gal et al. [35] used this Bayesian approach to perform uncertainty sampling for active learning on image data with convolutional neural networks (CNN). Additionally, Gal et al. [35] performed a comparison of various acquisition functions proposed for quantifying the model uncertainty of CNN models. It is shown that uncertainty sampling is prone to select outliers [20].
Bayesian Active Learning by Disagreement (BALD) [6] is another uncertainty-based acquisition function used with Bayesian models. BALD algorithm selects the instance that maximizes the mutual information between the predictions and the model posterior. This can be written as
q * = arg max q H ( y q | x q , D L ) E θ p ( θ | D L ) H ( y q | x q , θ , D L ) .
The left side term of the Equation (8) is the entropy of the model prediction and the right side term is the expectation of the model prediction over the posterior of the model parameters. If the model is certain of its predictions for each draw of parameter values, the right side term becomes smaller. In this case the active learner selects the examples x q for which the model is most uncertain of its predictions (high H ( y q | x q , D L ) ), but the model is confident for individual parameter settings (low  E θ p ( θ | D L ) H ( y q | x q , θ , D L ) ).

2.3.2. Query by Committee (QBC)

Query by committee (QBC) [36] is a simple method that outperforms uncertainty sampling in many practical settings. This method maintains a committee of models trained on the same labeled dataset. Each model in the committee predicts the label of an unlabeled instance. The instance for which label predictions of the most number of committee members (models) disagrees is selected as the most informative instance. However, QBC is not a popular choice when AL is used with deep neural network (DNN) models since training a committee of DNN models is time-consuming.

2.3.3. Expected Error Reduction (EER)

Expected Error Reduction (EER) [37] is an AL approach that directly calculates the expected generalization error of a model trained on labeled instances including unlabeled instances L ( x q , y q ) . Then, the active learner queries the instance which minimizes the future generalization error. However, this approach involves the retraining of a model for each unlabeled instance x q with each label c C , making it one of the most time-consuming AL approaches. Therefore, the EER approach has been limited to simple classification algorithms such as Gaussian random fields (GRF) for which faster online retraining is possible.

3. Active Learning for Graph Classification Problems

Compared to application of AL on other types of data such as image and text data, only a limited number of AL models has been developed for graph data. Previous work on applying AL on graph data [38,39,40] is tightly coupled with earlier classification models such as Gaussian random fields, in which the features of nodes are not being used. Therefore, selecting query nodes uniformly in random coupled with a recent GNN model can easily outperform such AL models. AL models which utilize recent GNN architectures [41,42] are limited. Moreover, a comprehensive comparison of AL algorithms proposed for other domains of data has not been done yet.
In Table 1, we provide an extensive comparison of the literature on AL approaches proposed for node classification. We compare each work with respect to the following attributes.
  • AL approach
  • Classifier: Classification model used for predicting the label of a node
  • Attributes: Whether the node classifier uses node attributes
  • Adaptive: Whether the active learner is updated based on the newly labeled instances
  • Labels: Whether the active learner uses node labels in making a decision
In addition to generic approaches proposed for AL, there have been a few works that are specifically designed for graph-structured data. These algorithms use graph-specific metrics for selecting nodes for labeling. In addition to the attributes of data instances, graph topology provides useful information. For example, the degree centrality of a node represents how a particular data instance is connected with others. Table 1 demonstrates that most of the previous AL approaches proposed for node classification do not use the node attribute information. Moreover, some works [40,43] ignore the label information as well.

3.1. Active Learning Framework

In this problem, we start with an extremely small set of labeled instances. We are given a query budget K such that we are allowed to query K number of nodes to retrieve their labels. In each acquisition step, we select a node and retrieve its label from an oracle (e.g., a human labeler). The GNN model is retrained using the training set including the newly labeled instance. We repeat this process K times. The basic framework is shown in Algorithm 1. Here, f θ is any node classification algorithm with parameters θ and we can use different acquisition functions (e.g., uncertainty sampling or QBC) as g.
Algorithm 1 Active learning for node classification.
  • Input: Graph G = ( A , X ) , Query budget K, Initial labels Y L
  • Output: An improved model f θ
  • for i 1 to n q = K do
  •  Select the best unlabeled instance q * with an acquisition function g
  •  Retrieve its label Y q *
  •  Update label set Y L Y L Y q *
  •  Retrain the model θ arg min θ l ( f θ ( G ) , Y L )
  • end for
  • Return θ

3.2. The Importance of Exploration

After each acquisition step, the classifier is trained on a limited number of labeled instances, which in turn are selected by the active learner. Therefore, the selected labeled instances tend to bias towards instances evaluated to be “informative” by the active learner. Therefore, the distribution of labeled instances is often different from the true underlying distribution. The active learner cannot observe the consequences of selecting an instance which has lower “informativeness”. This leads the active learner to converge to policies that are not able to generalize for unlabeled data. This problem is amplified by the lack of hyperparameter tuning. A simple approach to overcome this limitation is to query a few instances in addition to the ones maximizing our selection criteria. This step is known as “exploration” while selecting the instance maximizing the criteria is “exploitation”. For example, if our criterion is model entropy, the exploration step involves acquiring labels of a few instances which do not have the maximum entropy. Intuitively, an active learner should perform more exploration initially, so it can have a better view of the true distribution of data.
This problem is known as the exploration vs. exploitation trade-off in sequential decision-making problems. Solving this trade-off requires the learner to acquire potentially suboptimal instances (i.e., exploration) in addition to the optimal ones. This problem is studied under the framework of multi-armed bandits (MAB) problems [46]. In a MAB problem, a set of actions are given and selecting an action results in observing a reward drawn from a distribution that is unknown to the learner. The problem is to select a sequence of actions that maximize the cumulative reward. A multitude of approaches is used in solving online learning problems modeled as MAB problems. ϵ -greedy, upper confidence bounds (UCB) [47], and Thompson sampling [48] are a few of the frequently used techniques.
We compare the performance of each active learner using two different exploration techniques: ϵ -greedy and count-based exploration.

3.2.1. ϵ -Greedy

ϵ -greedy is used as the simplest method of introducing exploration into an MAB algorithm. In the case of AL, with probability ϵ the active learner randomly selects an unlabeled instance for querying its label. The most informative instance is selected by an acquisition function with probability ( 1 ϵ ) . A key problem with this approach is that, as each unlabeled instance is selected with uniform probability, some of the labeled instances can be wasteful. This phenomena is known as undirected exploration [49].

3.2.2. Count-Based Exploration

In MAB problems, count-based exploration addresses the problem of undirected exploration by assigning a larger probability to actions that have been selected fewer times compared to the remaining actions. Based on the principle of optimism in the face of uncertainty, a count-based exploration algorithm computes an upper confidence bound (UCB) [47] and selects the action corresponding to the maximum UCB. We adopt the notion of count-based exploration as the number of labeled nodes in the neighborhood of an unlabeled node. We define the exploration term of an instance i as the logarithm of the number of unlabeled neighboring nodes of i. This term encourages the learner to sample nodes from neighborhoods with less number of labeled nodes. As this term and the informative metric used in the acquisition function (e.g., entropy) are on different scales, we normalize both of these quantities into [ 0 , 1 ] range and get ϕ exp ( i ) and ϕ inf ( i ) , respectively. We linearly combine these normalized quantities to get the criterion for acquiring nodes as
ϕ ( i ) = ( 1 γ t ) · ϕ inf ( i ) + γ t · ϕ exp ( i ) ,
where the exploration coefficient γ t is a hyperparameter that balances exploration and exploitation. Setting γ t to 0 corresponds to pure exploration disregarding the feedback of the classifier. On the other hand, γ t = 1 is equivalent to pure exploitation selecting a node based only on the uncertainty sampling (e.g., entropy).

4. Experiments

4.1. Data

We evaluate the performance of all algorithms on 11 real-world datasets belonging to different domains. as shown in Table 2. In Table 2, we list the datasets used in experiments with several graph properties. These datasets belong to different domains such as citation networks, product networks, co-author networks, biological networks, and social networks.
CiteSeer, PubMed, and CORA [50] are commonly used citation graphs. Each of these undirected graphs is made of documents as nodes and citations as edges between them. If one document cites another, they are linked by an edge. The bag-of-words features of the text content of a document correspond to the attributes of a node.
Co-author CS and Co-author Physics are co-authorship graphs constructed from Microsoft Academic Graph [51]. Authors are represented as nodes and two authors are linked by an edge if they have co-authored a paper. Node features correspond to the keywords of the papers authored by a particular author. An author’s most active field of study is used as the node label.
Amazon Computers is a subgraph of the Amazon co-purchase graph [52]. Products are represented as nodes, and two nodes are connected by an edge of those two products that are frequently bought together. Node attributes correspond to product reviews encoded as bag-of-words features. The product category is used as the node label.
The disease dataset [53] simulates the SIR disease propagation model [54] on a graph. The label of a node indicates whether a node is infected or not and the features indicate the susceptibility to the disease.
The Wiki-CS dataset [55] is a graph constructed from Wikipedia articles corresponding to computer science. A Wikipedia article is a node of this graph and two nodes are connected by an edge if one article has a hyperlink to the other. GloVe word embeddings [56] obtained from the text content of an article is used as the feature vector of the node corresponding to that article.
Each protein–protein interaction (PPI) graph represents physical contacts between proteins in a human tissue (brain, blood, and kidney) [57,58]. Unlike other datasets, in PPI graphs a protein (node) can have multiple functions as its label, making this a multi-label classification problem. Learning the protein function (cellular function from gene ontology) involves learning about node roles. Several properties of a protein such as positional gene sets, motif gene sets and immunological signatures are used as node attributes in a PPI graph.
Github is a social network dataset constructed from developer profiles on Github who have at least 10 public repositories [59]. Details of a developer such as location, employee, and starred repositories are represented as node attributes. Two nodes are linked by an edge if those two developers mutually follow each other on Github. The label of a node indicates whether a developer is primarily working on machine learning or web development projects.
From each dataset, we randomly select two nodes belonging to each label as the initial labeled set V L . We use 5% of the rest of the unlabeled nodes as the test set. The set of remaining unlabeled nodes V U qualify to be queried. The size of the initial labeled set and its size as a fraction of the total nodes (labeling rate) are shown in Table 2.

Graph Properties

In some real-world graphs, such as social and communication networks, nodes tend to cluster together creating tightly knit groups of nodes. This phenomenon is known as clustering and the clustering coefficient [60] quantifies the amount of clustering present in a graph. The local clustering coefficient of a node i is calculated as
C i = number of triangles connected to node i number of triples centered around node i .
Average clustering coefficient is calculated as the average of local clustering coefficients of all nodes of a graph.
In real-world graphs, nodes tend to connect with other nodes with similar properties. In network science literature this phenomenon is known as “assortative mixing” [61]. Assortativity coefficient quantifies the amount of assortative mixing present in a graph. Assortativity coefficient can be calculated with respect to any node attribute. We calculate the label assortativity ( r L ) with
r L = i e i i i a i b i 1 i a i b i ,
where e i j denotes the fraction of edges connecting a node with label i with a node with label j. For multi-label graphs, we calculate label assortativity for each label separately and take the average. A higher label associativity indicates that a node tends to connect with another node with the same label. As shown in Table 2, citation and co-author graphs exhibit high assortativity. It is easier to predict labels in a graph exhibiting high assortativity since neighbors of a node tend to have the same label as the node. Many node classification models are based on this assumption. However, the PPI graphs show low assortativity indicating that nodes with the same label are not necessarily in the same neighborhood. This is due to the fact that the function of a protein (i.e., node) depends on the role of a node in that graph rather than its neighboring proteins (i.e., nodes). Using degree centrality as a node attribute degree assortativity r D of each node can be computed in a similar manner. Average degree assortativity of a graph indicates whether a high degree node prefers to connect with other high degree nodes.

4.2. Experimental Setup

4.2.1. Node Classification Model

Recent studies demonstrated that GNN-based classifiers significantly outperform previous classifier algorithms such as GRFs. Therefore, we restrict our study of AL to GNN-based learning models. In our experiments, we consider two types of graph neural network architectures: GCN [14] and SGC [29]. SGC is a simplified GNN architecture that does not include a hidden layer and nonlinear activation functions. As the goal of AL is to reduce the number of labeled instances used for training, we do not use a separate validation set for fine-tuning the hyperparameters of a GNN model. In addition, it is shown that tuning hyperparameters while training a model with AL can lead to label inefficiency [62].
For all datasets, we use the default hyperparameters used in GNN literature (e.g., learning rate = 0.01). We use the following algorithms in our experiments.
  • Random: Select an unlabeled node randomly,
  • PageRank: Select the unlabeled node with the largest PageRank centrality,
  • Degree: Select the unlabeled node with the largest degree centrality,
  • Clustering coefficient: Select the unlabeled node with the largest clustering coefficient,
  • Entropy: Calculate the entropy of predictions of the current model over unlabeled nodes and select the node corresponding to the largest entropy.,
  • BALD [6,35]: Select the node which has the the largest mutual information value between predictions and model posterior, and
  • AGE [41]: Select the node which maximizes a linear combination of three metrics: PageRank centrality, model entropy and information density.
Here, PageRank, degree, and clustering coefficient-based sampling do not use node attributes or the feedback from the classification model. On the other hand, entropy BALD are uncertainty-based acquisition functions that calculate an uncertainty metric using the performance of the classifier trained using the current training set. We acquire the label of an unlabeled node and retrain the GNN model by performing 50 steps of adam optimizer [63]. We perform 40 acquisition steps (query budget = 40) and repeat this process on 30 different randomly initialized training and test splits for each dataset. Test dataset is often unbalanced. Therefore, accuracy is not suitable to be used as the performance metric. We report the average F1 score (macro-averaged) over the test set in each experiment. F1-score is the harmonic mean of the precision and recall metrics. Macro-F1 score is calculated by first calculating F1-scores for each class separately and then taking the average of class-wise F1-scores.

4.2.2. Packages and Hardware

We use the NetworkX library [64] for representing and processing graphs. We use the Pytorch [65] implementations of GCN [14] and SGC [29] node classification models. All experiments are run on a computer running Ubuntu 18.04 OS on an Intel(R) Core i9-7900X CPU @ 3.30GHz processor with 64GB memory and a NVIDIA GTX 1080-Ti GPU.

5. Results and Discussion

5.1. Performance Comparison of AL Approaches

In this section, we compare the performance of acquisition functions which use only a single type of approach. Figure 1 and Figure 2 show how the performance of the node classification model varies with the number of acquisitions.
As shown in previous works, AGE [41], the current state-of-the-art AL algorithm, performs well on citation networks (CiteSeer, CORA, and PubMed). However, the performance of this algorithm is suboptimal on other datasets such as Wiki-CS. The citation datasets possess similar characteristics. For example, average degree centrality of them is in the same range as shown in Table 2. Therefore, selecting AL algorithms based on their performance on a handful of graphs from the same domain may result in suboptimal algorithms.

5.2. Comparison of Exploration Strategies

In this experiment, we run uncertainty sampling algorithms: BALD and entropy with ϵ -greedy and count-based exploration terms. In the count-based exploration policy, we set the exploration coefficient β to 0.5 . In Table 3 and Table 4, we present the performance of GCN and SGC classifiers when 40 nodes are acquired using each of the acquisition functions. Entropy-Count and BALD-Count correspond to max entropy sampling and BALD policy combined with count-based exploration term. The values in bold indicate that the performance of an algorithm is significantly better (at 5% significance level) than the rest of the algorithms on that dataset. We calculate the statistical significance between the performance of two algorithms using paired t-test. If no single algorithm is significantly better than the rest, all statistically significant values are marked in bold. We summarize the results in Table 5 and show the best performing AL algorithm along with the classifier. Uncertainty-based acquisition functions, when combined with the count-based exploration term (Entropy-Count and BALD-Count), achieve the best performance on average on four datasets. It highlights that encouraging the active learner to select nodes in less explored neighborhoods is effective than selecting a node in random as the exploration step ( ϵ -greedy).

5.3. Running Time

Table 6 shows the execution time each algorithm spends to acquire a set of 40 unlabeled instances on average. AGE, the current state-of-the-art, is several magnitudes slower compared to the rest of the algorithms. The clustering step performed to compute the information gain is responsible for the additional time. The time complexity of this step grows O ( n 2 ) with the number of vertices n of a graph making AGE not suitable for large attributed graphs. For example, the AGE algorithm is 80 times slower than random sampling for the Amazon Computers graph but achieves inferior performance. Additionally, the SGC model can be trained in a relatively less time compared to the GCN model and this difference is significant for larger graphs such as Wiki-CS and co-authorship graphs. However, in AL problems, the time spent for selecting an unlabeled example is a minor concern since the labeling time is more valued.

5.4. Discussion

As shown in Table 5, the performance of acquisition functions is diverse such that no single approach can be considered the best for all datasets. Sampling nodes based on graph properties leads to good performance depending on the graph structure. We make several key observations on how average clustering coefficient and label assortativity of a graph impact the performance of AL acquisition functions as following.
Graphs with high level of clustering. Amazon computers, co-authorship graphs, and Wiki-CS graphs have larger average clustering coefficients. For these datasets, sampling the node with the largest clustering coefficient outperforms sampling with other node centrality measures.
Graphs with medium level of clustering. CiteSeer, CORA, Github, and PPI graphs possess a medium level of average clustering in the range of 0.1 to 0.2. On CORA, CiteSeer, and Github datasets uncertainty-based acquisition functions and their variants obtain the best performance. However, the performance of PPI graphs is quite different since their label assortativity values are significantly low compared to all other datasets.
Graphs with low level of clustering. Pubmed and the disease graphs have the lowest average clustering coefficients. In most cases, the use of clustering coefficient to select the nodes for querying lead to suboptimal results. However, sampling with clustering coefficient on PubMed dataset obtained good performance when the GCN model was used as the node classifier.
Graphs with low label assortativity. Out of all graph datasets, PPI graphs exhibit the lowest label assortativity. As most of the graphs used in node classification literature exhibit high label assortativity, commonly used node classification models are build on the assumption that neighbors of a node may have the same label. Therefore, such models are not confident in predicting the labels of unlabeled nodes, specially when the training data is scarce. On PPI graphs, we observe that performing AL by sampling the query nodes based on PageRank and degree centrality contributes to the best performing models. However, one limitation in calculating the label assortativity is that node labels need to be known beforehand. When we are given an unlabeled graph, one way to overcome this problem is we can use similar labeled graphs belonging to the same domain to approximate the label assortativity.

6. Conclusions

In this paper, we studied the application of the active learning framework as a method to make node classification on attributed graphs label efficient. We have performed an empirical evaluation of state-of-the-art active learning algorithms on the node classification task using twelve real-world attributed graphs belonging to different domains. In our experiments, we initiate the active learner with an extremely small number of labeled instances. Additionally, we assumed a more realistic setting in which the learner does not use a separate validation set. Our results highlight that no single acquisition function can be performs consistently well on all datasets and the performance of acquisition functions depend on graph properties. We further show that selecting an acquisition function based on the performance on a handful of attributed graphs with similar characteristics result in suboptimal algorithms. Notably, our results point that SGC, a simpler variant of GNN performs better and efficiently on most datasets compared to more complex GNN models.
A key takeaway of this research is that AL is beneficial in reducing the labeling cost of semisupervised node classification models and the choice of an AL acquisition function depends on the properties of the graph data at hand. Using an extensive set of graph datasets with a wide variety of characteristics, we showed that there is no single algorithm that works across different graph datasets possessing different graph properties. We further made the observation that using node PageRank and degree centrality of nodes achieve the best performance on graphs with low label assortativity.
Moreover, the current state-of-the-art active learning algorithm (AGE) [41] uses a combination of multiple acquisition functions and it is several magnitudes slower than all other acquisition functions that were used in this paper. Therefore, it is not suitable for large real-world attributed graphs. Lack of hyperparameter tuning and a minuscule number of training instances lead to classifiers that cannot generalize well for unlabeled data. We expressed this problem as balancing the exploration-vs.-exploitation trade-off and propose introducing an exploration term into acquisition functions. We evaluated the performance of two exploration terms using multiple real-world graph datasets. The introduction of this exploration term into existing uncertainty-based acquisition functions make their performance competitive with the current state-of-the-art AL algorithm for node classification on some datasets. As future work, we would like to explore how AL can be utilized for other graph-related learning tasks.

Author Contributions

Conceptualization, K.M.; methodology, K.M.; software, K.M.; validation, K.M.; formal analysis, K.M.; investigation, K.M.; writing—original draft preparation, K.M.; writing—review and editing, T.M.; visualization, K.M.; supervision, T.M.; funding acquisition, T.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by JSPS Grant-in-Aid for Scientific Research(B) (Grant Number 17H01785) and JST CREST (Grant Number JPMJCR1687).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript.
DNNDeep neural network
GCNGraph convolutional network
GNNGraph neural network
SGCSimplified graph convolution
ALActive learning
CNNConvolutional neural network
BALDBayesian Active Learning by Disagreement
QBCQuery by committee
EERExpected error reduction
GRFGaussian random fields
AGEActive graph embedding
PRPageRank
UCBUpper confidence bound

References

  1. Mohri, M.; Rostamizadeh, A.; Talwalkar, A. Foundations of Machine Learning; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  2. Hoi, S.C.; Jin, R.; Zhu, J.; Lyu, M.R. Batch mode active learning and its application to medical image classification. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 417–424. [Google Scholar]
  3. Parisot, S.; Ktena, S.I.; Ferrante, E.; Lee, M.; Guerrero, R.; Glocker, B.; Rueckert, D. Disease prediction using graph convolutional networks: Application to Autism Spectrum Disorder and Alzheimer’s disease. Med. Image Anal. 2018, 48, 117–130. [Google Scholar] [CrossRef] [Green Version]
  4. Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
  5. Settles, B. Active Learning Literature Survey; Technical Report; University of Wisconsin-Madison Department of Computer Sciences: Madison, WI, USA, 2009. [Google Scholar]
  6. Houlsby, N.; Huszár, F.; Ghahramani, Z.; Lengyel, M. Bayesian Active Learning for Classification and Preference Learning. arXiv 2011, arXiv:1112.5745. [Google Scholar]
  7. Rubens, N.; Elahi, M.; Sugiyama, M.; Kaplan, D. Active Learning in Recommender Systems. In Recommender Systems Handbook; Springer: Berlin/Heidelberg, Germany, 2015; pp. 809–846. [Google Scholar]
  8. Ying, R.; He, R.; Chen, K.; Eksombatchai, P.; Hamilton, W.L.; Leskovec, J. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 974–983. [Google Scholar]
  9. Yao, L.; Mao, C.; Luo, Y. Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 7370–7377. [Google Scholar]
  10. Neville, J.; Jensen, D. Iterative classification in relational data. In Proceedings of the AAAI-2000 Workshop on Learning Statistical Models From Relational Data, Austin, TX, USA, 31 July 2000; pp. 13–20. [Google Scholar]
  11. Zhu, X.; Ghahramani, Z. Learning from Labeled and Unlabeled Data with Label Propagation; Technical Report; Carnegie Mellon University: Cambridge, MA, USA, 2002. [Google Scholar]
  12. Zhu, X.; Ghahramani, Z.; Lafferty, J.D. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), Washington, DC, USA, 21–24 August 2003; pp. 912–919. [Google Scholar]
  13. Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Proceedings of the 2016 Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 1–14. [Google Scholar]
  14. Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
  15. Veličković, P.; Fedus, W.; Hamilton, W.L.; Liò, P.; Bengio, Y.; Hjelm, R.D. Deep Graph Infomax. arXiv 2018, arXiv:1809.10341. [Google Scholar]
  16. Fazakis, N.; Kanas, V.G.; Aridas, C.K.; Karlos, S.; Kotsiantis, S. Combination of Active Learning and Semi-Supervised Learning under a Self-Training Scheme. Entropy 2019, 21, 988. [Google Scholar] [CrossRef] [Green Version]
  17. Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Philip, S.Y. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2020. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  18. Zhou, J.; Cui, G.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. arXiv 2018, arXiv:1812.08434. [Google Scholar]
  19. Zhang, Z.; Cui, P.; Zhu, W. Deep learning on graphs: A survey. IEEE Trans. Knowl. Data Eng. 2020. [Google Scholar] [CrossRef] [Green Version]
  20. Settles, B.; Craven, M. An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Honolulu, HI, USA, 25–27 October 2008; pp. 1070–1079. [Google Scholar]
  21. Fu, Y.; Zhu, X.; Li, B. A survey on instance selection for active learning. Knowl. Inf. Syst. 2013, 35, 249–283. [Google Scholar] [CrossRef]
  22. Baram, Y.; Yaniv, R.E.; Luz, K. Online choice of active learning algorithms. J. Mach. Learn. Res. 2004, 5, 255–291. [Google Scholar]
  23. Huang, Z.; Chung, W.; Ong, T.H.; Chen, H. A graph-based recommender system for digital library. In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, Portland, OR, USA, 14–18 July 2002; pp. 65–73. [Google Scholar]
  24. Gilmer, J.; Schoenholz, S.S.; Riley, P.F.; Vinyals, O.; Dahl, G.E. Neural message passing for quantum chemistry. arXiv 2017, arXiv:1704.01212. [Google Scholar]
  25. Bhagat, S.; Cormode, G.; Muthukrishnan, S. Node classification in social networks. In Social Network Data Analytics; Springer: Berlin/Heidelberg, Germany, 2011; pp. 115–148. [Google Scholar]
  26. Zhu, X.; Lafferty, J.; Ghahramani, Z. Combining Active Learning and Semi-supervised Learning using Gaussian Fields and Harmonic Functions. In Proceedings of the ICML 2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data mining, Washington, DC, USA, 21–24 August 2003; Volume 3. [Google Scholar]
  27. Zhou, D.; Bousquet, O.; Lal, T.N.; Weston, J.; Schölkopf, B. Learning with local and global consistency. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 13–18 December 2004; pp. 321–328. [Google Scholar]
  28. Li, Y.; Tarlow, D.; Brockschmidt, M.; Zemel, R. Gated Graph Sequence Neural Networks. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  29. Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; Weinberger, K. Simplifying Graph Convolutional Networks. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Beach, CA, USA, 10–15 June 2019; pp. 6861–6871. [Google Scholar]
  30. Perozzi, B.; Al-Rfou, R.; Skiena, S. Deepwalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 701–710. [Google Scholar]
  31. Yang, Z.; Cohen, W.; Salakhudinov, R. Revisiting Semi-Supervised Learning with Graph Embeddings. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 14–19 June 2016; pp. 40–48. [Google Scholar]
  32. Lewis, D.D.; Catlett, J. Heterogeneous uncertainty sampling for supervised learning. In Machine Learning Proceedings 1994; Elsevier: Amsterdam, The Netherlands, 1994; pp. 148–156. [Google Scholar]
  33. Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1050–1059. [Google Scholar]
  34. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  35. Gal, Y.; Islam, R.; Ghahramani, Z. Deep Bayesian Active Learning with Image Data. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1183–1192. [Google Scholar]
  36. Seung, H.S.; Opper, M.; Sompolinsky, H. Query by committee. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, PA, USA, 27–29 July 1992; pp. 287–294. [Google Scholar]
  37. Roy, N.; McCallum, A. Toward Optimal Active Learning through Monte Carlo Estimation of Error Reduction. In Proceedings of the 18th International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001; pp. 441–448. [Google Scholar]
  38. Gu, Q.; Han, J. Towards Active Learning on Graphs: An Error Bound Minimization Approach. In Proceedings of the 2012 IEEE 12th International Conference on Data Mining, Brussels, Belgium, 10 December 2012; pp. 882–887. [Google Scholar]
  39. Bilgic, M.; Mihalkova, L.; Getoor, L. Active Learning for Networked Data. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 79–86. [Google Scholar]
  40. Ji, M.; Han, J. A Variance Minimization Criterion to Active Learning on Graphs. In Proceedings of the Artificial Intelligence and Statistics, La Palma, Canary Islands, 21–23 April 2012; pp. 556–564. [Google Scholar]
  41. Cai, H.; Zheng, V.W.; Chang, K.C.C. Active Learning for Graph Embedding. arXiv 2017, arXiv:1705.05085. [Google Scholar]
  42. Gao, L.; Yang, H.; Zhou, C.; Wu, J.; Pan, S.; Hu, Y. Active Discriminative Network Representation Learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; AAAI Press: Menlo Park, CA, USA, 2018; pp. 2142–2148. [Google Scholar]
  43. Gadde, A.; Anis, A.; Ortega, A. Active semisupervised learning using sampling theory for graph signals. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 492–501. [Google Scholar]
  44. Macskassy, S.A. Using graph-based metrics with empirical risk minimization to speed up active learning on networked data. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–July 2009; pp. 597–606. [Google Scholar]
  45. Ma, Y.; Garnett, R.; Schneider, J. σ-Optimality for active learning on gaussian random fields. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–8 December 2013; pp. 2751–2759. [Google Scholar]
  46. Lattimore, T.; Szepesvári, C. Bandit Algorithms; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
  47. Auer, P. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 2002, 3, 397–422. [Google Scholar]
  48. Thompson, W.R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 1933, 25, 285–294. [Google Scholar] [CrossRef]
  49. Thrun, S.B. The role of exploration in learning control. In Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches; Van Nostrand Reinhold: New York, NY, USA, 1992; pp. 1–27. [Google Scholar]
  50. Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; Eliassi-Rad, T. Collective Classification in Network Data. AI Mag. 2008, 29, 93. [Google Scholar] [CrossRef] [Green Version]
  51. Shchur, O.; Mumme, M.; Bojchevski, A.; Günnemann, S. Pitfalls of Graph Neural Network Evaluation. In Proceedings of the Relational Representation Learning Workshop (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
  52. McAuley, J.; Targett, C.; Shi, Q.; Van Den Hengel, A. Image-based Recommendations on Styles and Substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, 9–13 August 2015; pp. 43–52. [Google Scholar]
  53. Chami, I.; Ying, Z.; Ré, C.; Leskovec, J. Hyperbolic Graph Convolutional Neural Networks. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F.D., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 4868–4879. [Google Scholar]
  54. Anderson, R.M.; Anderson, B.; May, R.M. Infectious Diseases of Humans: Dynamics and Control; Oxford University Press: Oxford, UK, 1992. [Google Scholar]
  55. Mernyei, P.; Cangea, C. Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks. arXiv 2020, arXiv:2007.02901. [Google Scholar]
  56. Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
  57. Zitnik, M.; Leskovec, J. Predicting multicellular function through multi-layer tissue networks. Bioinformatics 2017, 33, 190–198. [Google Scholar] [CrossRef] [Green Version]
  58. Oughtred, R.; Stark, C.; Breitkreutz, B.J.; Rust, J.; Boucher, L.; Chang, C.; Kolas, N.; O’Donnell, L.; Leung, G.; McAdam, R.; et al. The BioGRID interaction database: 2019 update. Nucleic Acids Res. 2019, 47, D529–D541. [Google Scholar] [CrossRef] [Green Version]
  59. Rozemberczki, B.; Allen, C.; Sarkar, R. Multi-scale Attributed Node Embedding. arXiv 2019, arXiv:1909.13021. [Google Scholar]
  60. Watts, D.J.; Strogatz, S.H. Collective dynamics of ‘small-world’networks. Nature 1998, 393, 440–442. [Google Scholar] [CrossRef]
  61. Newman, M.E. Mixing patterns in networks. Phys. Rev. E 2003, 67, 026126. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  62. Ash, J.T.; Zhang, C.; Krishnamurthy, A.; Langford, J.; Agarwal, A. Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  63. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  64. Hagberg, A.; Swart, P.; S Chult, D. Exploring Network Structure, Dynamics, and Function Using NetworkX; Technical Report; Los Alamos National Lab.(LANL): Los Alamos, NM, USA, 2008. [Google Scholar]
  65. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 8026–8037. [Google Scholar]
Figure 1. Macro-F1 score (test) of active learning algorithms with number of acquisitions. A two-layer graph convolutional network (GCN) is used as the graph neural network (GNN) model. (a) CiteSeer. (b) PubMed. (c) CORA. (d) Amazon Computers. (e) Co-author CS. (f) Co-author Physics. (g) Disease. (h) Wiki-CS. (i) PPI-Brain. (j) PPI-Blood. (k) PPI-Kidney. (l) Github.
Figure 1. Macro-F1 score (test) of active learning algorithms with number of acquisitions. A two-layer graph convolutional network (GCN) is used as the graph neural network (GNN) model. (a) CiteSeer. (b) PubMed. (c) CORA. (d) Amazon Computers. (e) Co-author CS. (f) Co-author Physics. (g) Disease. (h) Wiki-CS. (i) PPI-Brain. (j) PPI-Blood. (k) PPI-Kidney. (l) Github.
Entropy 22 01164 g001
Figure 2. Macro-F1 score (test) of active learning algorithms with number of acquisitions. SGC model is used as the GNN model. (a) CiteSeer. (b) PubMed. (c) CORA. (d) Amazon Computers. (e) Co-author CS. (f) Co-author Physics. (g) Disease. (h) Wiki-CS. (i) PPI-Brain. (j) PPI-Blood. (k) PPI-Kidney. (l) Github.
Figure 2. Macro-F1 score (test) of active learning algorithms with number of acquisitions. SGC model is used as the GNN model. (a) CiteSeer. (b) PubMed. (c) CORA. (d) Amazon Computers. (e) Co-author CS. (f) Co-author Physics. (g) Disease. (h) Wiki-CS. (i) PPI-Brain. (j) PPI-Blood. (k) PPI-Kidney. (l) Github.
Entropy 22 01164 g002
Table 1. Summary of existing work for active node classification on attributed graphs. The work by Gadde et al. [43] does not use the labels of the nodes. Therefore, this method does not use a classifier. We use the following abbreviations in the table. LR—Logistic Regression, GRF—Gaussian Random Fields, LP—Label Propagation, SC—Spectral Clustering, NA—Not Applicable.
Table 1. Summary of existing work for active node classification on attributed graphs. The work by Gadde et al. [43] does not use the labels of the nodes. Therefore, this method does not use a classifier. We use the following abbreviations in the table. LR—Logistic Regression, GRF—Gaussian Random Fields, LP—Label Propagation, SC—Spectral Clustering, NA—Not Applicable.
WorkAL ApproachClassifierAttributesAdaptiveLabelsYear
Zhu et al. [26]EERGRFNoNoYes2003
Macskassy [44]EER + HeuristicsGRFNoYesYes2009
Bilgic et al. [39]QBCLRNoYesYes2010
Gu and Han [38]EERLPNoNoYes2012
Ji and Han [40]Variation MinimizationGRFNoNoNo2012
Ma et al. [45]UncertaintyGRFNoNoYes2013
Gadde et al. [43]SCNANoNoNo2015
Cai et al. [41]Uncertainty + HeuristicsGCNYesYesYes2017
Table 2. Dataset statistics. Labeling rate as a percentage of total nodes is shown within brackets. Avg. deg.: Average degree, Avg. CC: Average clustering coefficient, r D : Degree assortativity, r L : Label assortativity.
Table 2. Dataset statistics. Labeling rate as a percentage of total nodes is shown within brackets. Avg. deg.: Average degree, Avg. CC: Average clustering coefficient, r D : Degree assortativity, r L : Label assortativity.
DatasetNodesClasses Avg. Deg. Avg. CC r D r L FeaturesLabels (%)
CiteSeer211062.840.170.0070.67370312 (0.56)
PubMed19,71736.340.06−0.0440.695006 (0.03)
n CORA248574.000.24−0.0710.76143314 (0.56)
Amazon Comp.13,7521036.740.35−0.0570.6876720 (0.14)
Co-author Phy34,493514.380.380.2010.87841510 (0.03)
Co-author CS18,333158.930.340.1130.79680530 (0.16)
Disease104422.000.0−0.5440.6810004 (0.38)
Wiki-CS11,7011036.940.47−0.0650.5830020 (0.17)
PPI-Brain348012131.940.17−0.0640.095035 (1.0)
PPI-Blood331212132.910.18−0.0610.095033 (1.0)
PPI-Kidney328412131.700.18−0.0670.095033 (1.0)
Github37,700215.330.17−0.0750.3840054 (0.01)
Table 3. Average F1-score of different acquisition functions. Forty query instances are selected (average of 30 runs). Standard deviation is shown underneath the macro-averaged F1-score. Classifier: GCN. Rand—Random, Ent—Entropy, PR—PageRank, Deg—Degree, CC: Clustering coefficient.
Table 3. Average F1-score of different acquisition functions. Forty query instances are selected (average of 30 runs). Standard deviation is shown underneath the macro-averaged F1-score. Classifier: GCN. Rand—Random, Ent—Entropy, PR—PageRank, Deg—Degree, CC: Clustering coefficient.
DatasetRandEntBALDPRDeg CC Ent ϵ -GreedyBALD ϵ -GreedyEnt CountBALD CountAGE
CiteSeer58.4 ± 6.960.1 ± 8.158.6 ± 5.154.4 ± 3.453.8 ± 4.353.6 ± 7.059.4 ± 4.252.0 ± 5.760.3 ± 4.259.1 ± 4.761.5 ± 3.7
CORA74.1 ± 5.173.9 ± 6.671.4 ± 7.471.8 ± 5.170.2 ± 4.974.5 ± 5.575.1 ± 4.170.4 ± 6.473.3 ± 5.472.9 ± 3.674.5 ± 7.7
PubMed76.4 ± 4.074.1 ± 3.075.7 ± 3.875.3 ± 3.371.8 ± 3.776.8 ± 1.474.2 ± 4.274.1 ± 4.177.3 ± 1.775.7 ± 4.474.5 ± 2.2
Coauthor CS82.4 ± 2.378.9 ± 3.480.6 ± 3.679.7 ± 3.680.7 ± 2.581.9 ± 3.778.2 ± 4.481.2 ± 2.380.3 ± 4.982.1 ± 2.583.9 ± 2.2
Coauthor Phy85.2 ± 3.384.1 ± 3.283.8 ± 2.885.2 ± 1.877.1 ± 2.186.4 ± 2.985.5 ± 2.783.4 ± 3.586.9 ± 2.983.7 ± 2.887.9 ± 2.6
Amazon Comp.76.1 ± 5.474.2 ± 3.466.8 ± 7.665.2 ± 8.160.2 ± 15.676.7 ± 4.173.1 ± 6.070.8 ± 8.175.4 ± 3.873.3 ± 7.574.2 ± 5.9
Disease57.1 ± 7.167.1 ± 8.767.2 ± 8.759.4 ± 8.853.2 ± 9.120.8 ± 5.161.0 ± 10.766.5 ± 9.465.8 ± 9.267.2 ± 7.263.3 ± 8.0
Wiki-CS57.1 ± 7.155.0 ± 5.162.4 ± 2.559.4 ± 3.158.2 ± 2.260.5 ± 3.761.0 ± 10.763.3 ± 2.957.0 ± 3.362.1 ± 3.757.7 ± 4.9
PPI Brain25.6 ± 6.521.4 ± 6.331.6 ± 6.141.1 ± 2.141.0 ± 2.419.3 ± 6.622.3 ± 6.030.0 ± 8.922.2 ± 5.035.3 ± 6.122.3 ± 6.2
PPI Blood27.7 ± 3.222.9 ± 5.731.0 ± 6.542.4 ± 1.641.4 ± 1.921.0 ± 5.826.5 ± 4.936.9 ± 4.623.6 ± 5.437.4 ± 4.323.3 ± 5.6
PPI Kidney25.7 ± 2.918.7 ± 6.827.9 ± 9.642.1 ± 1.641.1 ± 2.216.3 ± 5.918.8 ± 7.033.5 ± 3.329.2 ± 1.737.6 ± 3.419.4 ± 4.8
Github74.0 ± 8.077.1 ± 1.974.5 ± 2.471.1 ± 2.962.3 ± 4.875.4 ± 1.877.3 ± 1.674.4 ± 2.276.4 ± 2.273.8 ± 2.373.9 ± 2.1
Table 4. Average F1-score of different acquisition functions. Forty query instances are selected (average of 30 runs). Standard deviation is shown underneath the macro-averaged F1-score. Classifier: SGC. Rand - Random, Ent—Entropy, PR—PageRank, Deg—Degree, CC: Clustering coefficient.
Table 4. Average F1-score of different acquisition functions. Forty query instances are selected (average of 30 runs). Standard deviation is shown underneath the macro-averaged F1-score. Classifier: SGC. Rand - Random, Ent—Entropy, PR—PageRank, Deg—Degree, CC: Clustering coefficient.
DatasetRandEntBALDPRDegCCEnt ϵ -GreedyBALD ϵ -GreedyEnt CountBALD CountAGE
CiteSeer55.5 ± 4.659.9 ± 4.758.0 ± 4.055.0 ± 3.453.4 ± 5.353.4 ± 7.456.3 ± 5.656.0 ± 4.660.0 ± 6.356.6 ± 4.860.4 ± 5.6
CORA76.1 ± 3.775.4 ± 4.071.4 ± 2.371.4 ± 5.169.3 ± 3.474.9 ± 6.173.9 ± 6.173.8 ± 4.476.7 ± 6.174.2 ± 3.174.7 ± 6.4
PubMed75.8 ± 3.674.8 ± 2.377.5 ± 2.676.7 ± 2.572.3 ± 6.460.7 ± 7.975.3 ± 3.777.2 ± 2.476.6 ± 2.878.0 ± 1.777.7 ± 3.4
Coauthor CS81.7 ± 2.976.8 ± 3.481.9 ± 3.981.3 ± 4.181.4 ± 4.081.9 ± 3.776.9 ± 4.182.6 ± 3.777.2 ± 4.782.7 ± 4.883.2 ± 2.9
Coauthor Phy86.5 ± 3.384.1 ± 2.490.2 ± 0.986.7 ± 2.979.3 ± 3.788.1 ± 2.784.1 ± 3.189.6 ± 2.687.5 ± 3.690.4 ± 1.488.9 ± 2.1
Amazon Comp.77.3 ± 4.173.4 ± 4.274.2 ± 5.371.9 ± 3.573.5 ± 6.178.3 ± 3.675.8 ± 5.474.5 ± 6.774.3 ± 3.274.9 ± 5.375.6 ± 3.8
Disease55.4 ± 8.768.2 ± 6.167.2 ± 7.159.7 9.558.5 ± 8.917.8 ± 4.563.4 ± 7.567.4 ± 8.567.1 ± 9.766.2 ± 8.466.4 ± 11.1
Wiki-CS59.8 ± 6.355.5 ± 3.664.7 ± 4.062.9 ± 3.661.3 ± 3.155.4 ± 6.657.5 ± 5.363.8 ± 2.456.5 ± 5.865.6 ± 3.150.4 ± 5.7
PPI Brain36.9 ± 2.238.4 ± 2.440.0 ± 1.441.0 ± 1.441.8 ± 1.234.6 ± 3.638.2 ± 2.040.3 ± 1.440.6 ± 1.041.6 ± 1.233.2 ± 2.7
PPI Blood34.6 ± 2.237.2 ± 3.739.5 ± 2.642.3± 2.341.7 ± 2.135.7 ± 1.737.0 ± 3.639.0 ± 2.939.8 ± 2.040.8 ± 2.139.4 ± 2.2
PPI Kidney39.1 ± 1.838.8 ± 2.639.9 ± 1.442.3 ± 1.841.7 ± 2.037.0 ± 1.440.0 ± 1.739.9 ± 1.440.4 ± 2.141.0 ± 1.841.0 ± 1.7
Github76.4 ± 2.577.4 ± 2.171.4 ± 2.569.7 ± 2.858.0 ± 5.676.8 ± 1.477.4 ± 2.272.8 ± 1.575.8 ± 2.772.9 ± 1.573.3 ± 4.0
Table 5. The best performing model according to Table 3 and Table 4.
Table 5. The best performing model according to Table 3 and Table 4.
DataWithout ExplorationWith Exploration
Macro-F1ModelClassifierMacro-F1ModelClassifier
CiteSeer61.5AGEGCN61.5AGEGCN
CORA76.1RandomSGC76.7Entropy CountSGC
PubMed77.7AGESGC78.0BALD CountSGC
Coauthor CS83.9AGEGCN83.9AGEGCN
Coauthor Phy90.2BALDSGC90.4BALD CountSGC
Amazon Comp.78.3ClusteringSGC78.3ClusteringSGC
Disease68.2EntropySGC68.2EntropySGC
Wiki-CS64.7BALDSGC65.6BALD CountSGC
PPI Brain41.8DegreeSGC41.8DegreeSGC
PPI Blood42.4PageRankGCN42.4PageRankGCN
PPI Kidney42.3PageRankSGC42.3PageRankSGC
Github77.4EntropySGC77.4EntropySGC
Table 6. Running time (seconds): average execution time to acquire 40 unlabeled instances. We run all experiments on a single NVIDIA GTX 1080-Ti GPU. PR: PageRank, CC: Clustering coefficient.
Table 6. Running time (seconds): average execution time to acquire 40 unlabeled instances. We run all experiments on a single NVIDIA GTX 1080-Ti GPU. PR: PageRank, CC: Clustering coefficient.
Clf.DatasetRandEntPRDegCCAGEBALD ϵ -GreedyCount
EntBALDEntBALD
GCNCiteSeer4.24.84.84.74.94.84.84.84.84.84.8
PubMed6.97.625.47.3321125.97.97.57.87.67.9
CORA4.24.54.64.414.526.84.54.54.54.54.5
Coauthor CS20.422.340.821.939.32154.223.722.323.622.423.6
Coauthor Phy46.150.5116.448.598.62436.950.850.450.750.550.8
Amazon Comp.17.519.131.818.833.81688.919.219.119.119.119.2
Disease4.14.34.24.14.28.74.34.34.34.34.3
Wiki-CS15.316.630.028.333.0410.816.716.616.616.716.7
PPI Brain8.38.911.510.210.9133.39.08.48.68.48.7
PPI Blood7.98.210.49.49.9130.28.48.28.48.38.5
PPI Kidney7.37.89.88.08.8129.47.77.77.77.87.9
Github57.169.2211.8102.9121.46810.072.169.671.170.573.2
SGCCiteSeer1.71.95.61.82.718.31.91.91.91.91.9
PubMed2.02.23.92.221.11229.22.22.22.22.22.2
CORA3.84.85.84.72.323.74.94.84.84.84.9
Coauthor CS16.819.833.219.337.92098.219.819.819.819.819.8
Coauthor Phy35.640.790.439.888.72232.340.840.440.540.740.7
Amazon Comp.12.214.717.216.917.11134.614.814.614.714.814.8
Disease1.41.41.51.41.46.01.41.41.41.41.4
Wiki-CS1.92.013.68.218.3400.52.12.02.02.12.1
PPI Brain4.44.55.14.84.9142.24.64.44.64.54.7
PPI Blood4.14.34.94.74.8139.44.44.34.34.44.5
PPI Kidney3.94.14.44.34.5135.64.14.14.14.14.2
Github22.324.516678.3106.24905.125.824.425.424.626.0
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Madhawa, K.; Murata, T. Active Learning for Node Classification: An Evaluation. Entropy 2020, 22, 1164. https://0-doi-org.brum.beds.ac.uk/10.3390/e22101164

AMA Style

Madhawa K, Murata T. Active Learning for Node Classification: An Evaluation. Entropy. 2020; 22(10):1164. https://0-doi-org.brum.beds.ac.uk/10.3390/e22101164

Chicago/Turabian Style

Madhawa, Kaushalya, and Tsuyoshi Murata. 2020. "Active Learning for Node Classification: An Evaluation" Entropy 22, no. 10: 1164. https://0-doi-org.brum.beds.ac.uk/10.3390/e22101164

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop