Next Article in Journal
Does Information on Automated Driving Functions and the Way of Presenting It before Activation Influence Users’ Behavior and Perception of the System?
Previous Article in Journal
Blockchain-Based Coordination: Assessing the Expressive Power of Smart Contracts
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Community Detection Based on a Preferential Decision Model

School of Computer Science and Engineering, Central South University, Changsha 410083, China
*
Author to whom correspondence should be addressed.
Submission received: 11 December 2019 / Revised: 7 January 2020 / Accepted: 14 January 2020 / Published: 18 January 2020

Abstract

:
The research on complex networks is a hot topic in many fields, among which community detection is a complex and meaningful process, which plays an important role in researching the characteristics of complex networks. Community structure is a common feature in the network. Given a graph, the process of uncovering its community structure is called community detection. Many community detection algorithms from different perspectives have been proposed. Achieving stable and accurate community division is still a non-trivial task due to the difficulty of setting specific parameters, high randomness and lack of ground-truth information. In this paper, we explore a new decision-making method through real-life communication and propose a preferential decision model based on dynamic relationships applied to dynamic systems. We apply this model to the label propagation algorithm and present a Community Detection based on Preferential Decision Model, called CDPD. This model intuitively aims to reveal the topological structure and the hierarchical structure between networks. By analyzing the structural characteristics of complex networks and mining the tightness between nodes, the priority of neighbor nodes is chosen to perform the required preferential decision, and finally the information in the system reaches a stable state. In the experiments, through the comparison of eight comparison algorithms, we verified the performance of CDPD in real-world networks and synthetic networks. The results show that CDPD not only has better performance than most recent algorithms on most datasets, but it is also more suitable for many community networks with ambiguous structure, especially sparse networks.

1. Introduction

Today, great progress has been made in the research of complex networks. Complex network theory is widely used in social, economic, biological, transportation, and other fields. The nodes and edges in the network can truly reflect the existing relationships and the invisible relationships that may exist in the future. For example, the Theory of Six Degrees tells us that two seemingly unrelated people need only six people to get connected [1]. With the structure of complex network becoming more and more complicated, some characteristics of the network have aroused a widespread concern of related scholars [2]. A very important part of a complex network is the community structure of complex network. The connections between nodes are different [3]. Some nodes are closely connected, others are sparse, and even one node is independent. These can all be defined as associations. The general concept of a community is that the connections between members of the community are closer than those outside the community [4]. Therefore, a common community detection method is mainly to calculate the closeness of these members through a certain algorithm to realize the division of the community networks.
Thus far, exploring community structure in complex networks has attracted great attention. Community structures that are also called groups, clusters, cohesive subgroups or modules in different contexts are an important research direction in complex networks. As an important character of complex networks, the division of community structure can not only deeply understand the function of each community in the network and the relationship with the whole network function, but also explore the relationship between different communities in the network, which is helpful in understanding the network topology [5]. A large number of community detection algorithms based on various theories have emerged in the past years. All algorithms have their own advantages and shortcomings. For example, some algorithms have short time complexity but may require parameters. Some algorithms are stable, but their accuracy is poor. Hence, with the increasing complexity of the network, accurate and efficient partitioning in complex networks is still a hot research direction, which attracts numerous people to confront the challenge.

1.1. Basic Idea

The research of complex networks is a hot issue in many fields. Among them, community detection is a complex and meaningful process that plays an important role in studying the characteristics of complex networks. We regard the topological network as a dynamic circle of life that is constantly communicated in real life [6], and nodes communicate with each other through learning behaviors to make decisions.
In the process of information exchange in real life, learning strategies and learning behaviors of people are easily affected by the surrounding environment. When we make decisions, we are often affected by the environment at the time. It is not difficult to understand that we tend to choose to exchange information with people who are more influential, closer to us and more similar to our knowledge structure. Thus, we are influenced by the people around us when we make a decision. It helps us make decisions by learning from the experiences of others. There are many types of learning in different circles. Like those around us who have great resources and influence, and those who have a close relationship with ourselves, we will be greatly influenced by them when making decisions. This is also applicable to community discovery in complex networks. In the continuous iteration of local communities, nodes constantly make decisions, and eventually stabilize.
A preferential decision-making mechanism based on dynamic relationship mainly includes decision-making behaviors, learning strategies, and relationships. Information is exchanged through links between these three factors. The clustering process of the information dynamic system is described as follows: The first step is the node initialization process, giving each node in the network a unique label. The second step is the information iterative process. During the interaction of the nodes in the dynamic system [7], they will comprehensively consider the decision based on the information of the surrounding nodes. The last step is to uncover the community structure. Each node tends to a stable state, and nodes with the same information category will cluster into the same community.
A simple network with eight nodes can be represented in Figure 1. The color of the nodes represents the information carried by the nodes. The same color represents clustering into a community and the thicker the edges between the nodes, the better the priority between the nodes. Figure 1a displays information initialization. Each node is assigned its unique label information. At this stage, the nodes have the same degree of preference for surrounding nodes. Figure 1b shows the information dynamic interaction. Nodes continuously exchange information during the iterative process of the dynamic system. The tightness of the connection between the nodes has been changing during the dynamic iteration, and, through the introduction of the preferential decision mechanism, the system finally stabilized. Figure 1c displays uncover communities. It can be known that the preference relationship between nodes in the same community will be much larger, and the number of contact interactions will be too high. This is consistent with the definition of community in complex networks. Finally, the system achieves stable community division.

1.2. Contributions

For traditional label propagation algorithms, the higher randomness and poor accuracy are common disadvantages. There are two main differences between our method and the existing ones in algorithm design. First, the existing algorithm learns the labels of surrounding neighbors in the process of label propagation randomly. Such random learning behavior is the main thing that leads to the large randomness of the algorithm results and the lower accuracy. In our method, a dynamic relationship-based preferential decision mechanism is introduced that can make the label learning behavior have a certain basis instead of random selection, which improves the stability and accuracy to some extent. Secondly, we know that the results of algorithms with parameters often depend on the setting of parameters. Our algorithm is non-parametric. Compared with other algorithms with parameters, our algorithm can automatically uncover communities without prior knowledge and parameter settings. Based on the improvement of such two-point algorithm design and a large number of experiments, we can give four advantages of our algorithm:
  • Higher accuracy. By comparing classic and latest algorithm experiments in the complex network, the CDPD algorithm has an accurate community partition on the real-world network and the synthetic network. In addition, the discovered communities are relatively close to the real communities.
  • Stability. The randomness has been greatly reduced by introducing the preferential decision mechanism and the past memory similarity.
  • Scalability. Although the iterative process based on dynamic relationships is complicated, CDPD only needs to consider the characteristics of neighboring nodes, which results in low time complexity of O ( t × k 2 × n ) . The values of t and k are usually small. Therefore, the CDPD algorithm can be applied to large-scale networks.
  • Free parameters. Instead of relying on prior knowledge and parameter setting methods, the CDPD method does not require parameter settings, and the communities are automatically discovered based on the local topology of the network.

2. Related Work

Community detection problems in complex networks have received extensive attention from academia since they were proposed. A large number of excellent algorithms have been published in recent years. These algorithms try to solve community detection problems from various angles. This section gives a brief summary of the current state of development regarding this issue from several unique perspectives. More detailed explanations and summaries can be referred from the excellent reviews [8,9,10].
Parameter Optimization Based Methods. Some algorithms discover the hidden community structure in a complex network by maximizing network characteristics or evaluating indicators. Usually, the features or parameters that can be selected are density [11], closeness [12] and modularity, and especially algorithms based on modularity optimization are emerging endlessly. The modularity (Q) [13] is widely used as evaluating the quality of community detection. The higher the modularity value obtained by an algorithm is, the better the segmentation performance is. However, the modularity-based optimization problem has been proved to be an NP-complete problem [14], but there are still many algorithms that can achieve similar results through certain processing methods. Among them, the more representative ones are the FN algorithm [15] and the EO algorithm [16]. The common problem of this kind of algorithm is that the performance of clustering is not good, especially in some large real-world networks.
Game Theoretic Model Based Methods. The research of game theory provides a novel perspective for the problems of community detection. The game theory models mainly used in this problem include non-cooperative games and cooperative games [17,18,19]. In the non-cooperative game, nodes are considered to be selfish individuals. Each node guarantees that its benefit is maximized through the implementation of the strategy. When all nodes can’t get more benefits by changing the strategy, we think that the system has reached a stable state, which is Nash equilibrium. The strategy here is usually to join, leave, and maintain existing community relationships. Common algorithm implementations include NASHDeCo algorithm [20], CaoGame [21], and DGT [22]. What is different from the non-cooperative game is that the nodes of the cooperative game model need to maximize the overall benefit by changing the strategy and dividing the community when the system reaches the Nash equilibrium. Some good algorithms are implemented along these lines, such as SNS-CD [2]. Nevertheless, challenges that such game theory-based algorithms cannot avoid are how to determine the utility function and how to define the equilibrium conditions, which are often difficult.
Dynamic System Based Methods. The dynamic-based community detection algorithm is another significant research route. This kind of method regards complex networks as a dynamic system and achieves the purpose of community detection through dynamic interaction. One of the angles is based on random walks. The algorithm represented WalkTrap(WT) [23]. The main idea of the WT is to construct a similarity evaluation method between nodes. It must be mentioned that WT’s high time complexity ( O ( m n 2 ) ) makes it unsuitable for large-scale networks. Another important research route is the label propagation algorithm (LPA) [24]. Although LPA has linear time complexity, one drawback that cannot be ignored is randomness. This comes from two aspects. On one hand, the node update order is random, on the other hand, when the neighbors have more than one maximum number of the labels, the node randomly selects one of the labels. Although some scholars have made improvements to LPA from various aspects [25,26,27], there is no algorithm that considers label number, label history, and other information comprehensively when selecting labels. There are also some superior dynamic system-based algorithms such as synchronization [28], distance dynamics [29], and so on.

3. Community Detection Based on Preferential Decision

3.1. Basic Formula Principle and Concepts

Before getting into the detail of our proposed algorithm, we first introduce some basic definitions that will be used in the following sections. All key notations are summarized in Table 1. Let G = ( V , E ) be a given network, where V is the set of nodes and E is the set of edges.
Definition 1
(Jaccard similarity coefficient).It is used to compare similarities and differences between finite sample sets. The larger the Jaccard coefficient is, the higher the similarity of samples is. Given an undirected network, G = ( V , E ) , the Jaccard coefficient of node u and node v is defined as:
J ( u , v ) = Γ ( u ) Γ ( v ) Γ ( u ) Γ ( v ) ,
where Γ ( u ) means a set of neighborhoods of node u that includes its adjacent nodes N ( u ) . Γ ( u ) is defined as:
Γ ( u ) = { v V { u , v } E } { u } .
Definition 2
(Contact strength).There are different levels of connection strength between two nodes in any networks, which can be seen as good or bad in interpersonal relationships. This concept plays an important role in information exchange. We use the degree of the node as its initial information, and introduce the connection strength to indicate the connection tightness between nodes. Given an undirected network G = ( V , E ) , the contact strength of node v on node u is defined as:
C S ( u , v ) = N ( u ) N ( v ) T u ,
where T u means the number of triangles owned by node u. The number of triangles shared by node u and node v is N ( x ) N ( y ) .

3.2. Preferential Decision Model Based on Dynamic Iteration

Based on the formulas and definitions mentioned above, a new model has been proposed, which is a preferential decision model based on dynamic iteration that involves many aspects: the information propagation model, the memory similarity, preferential decision degree, the loss of information, and so on.
In real life, we make progress through continuous learning. No matter what age, we will be influenced by the surrounding environment to make decisions, which imperceptibly drives us to learn from the people or things around us. We compare the community network to a social circle. When we make decisions, we will be influenced by the people around us. Learning from the experience of others to help us make decisions is the core idea presented in this article.
There are many types of learning in different circles. Like those around us who have resources and influence and those who have a close relationship with us, we will be greatly influenced by them when making decisions. This is also true for community detection in complex networks. In the constant iterations of local communities, nodes are constantly making decisions and finally approaching stability. In our daily life, we can understand that parents have a stronger connection with us. For example, their suggestions will have a great impact on the decision of which we will settle in the future. This is one of the reasons why we want to introduce connection strength. Compared with our own past behavioral cycle, it shows that our hobbies and life rhythms are very similar. Because different people have different decision-making styles, people with identical similarities will be more likely to get together. We are closer to making this decision, which is close to the life we are exposed to. Thus, the past similarity between the two nodes is introduced to reflect the true tightness of the two nodes. In each iteration, for neighbor nodes, each node will make its own decision according to the surrounding comprehensive factors. Thus, in the end, we lead to the probability that one node chooses another node, and there is a certain choice for any node around. Some of the probabilities may approach zero, in which case the node will basically not make its own choice. In the information propagation based on dynamic iteration, the neighboring nodes are de-spreaded with a certain probability. For example, in our real life, we will pass on information, ideas, attitudes or affections to other individuals or groups in order to make resultant changes. In information dissemination, there is a possibility of information loss, which is directed by the topological features and transmitted information. Because there are always some useless or uninteresting information in real life to interfere with us, we consider the information propagation model, the past memory similarity, preferential decision degree, and the loss of information together.
The past memory similarity. Given a list of memory tags to any social network node, the higher the similarity of the past is, the closer the past behavior of the node is. This shows that the probability of interaction between two nodes is greater. In the following formula, M u [ i ] means the i-th memory label list array of the node u. In particular, is a constant, usually 0.001 , and the purpose of introducing is to prevent M L ( u , v ) from falling to zero. We have done numerous experiments to change the value of but found that the performance is the best in this case. We can define M L ( u , v ) , which is the past memory similarity between node u and node v, as follows:
M L ( u , v ) = + M u [ i ] = M v [ i ] 1 2 i .
We can identify memory labels as something that a person was interested in, and the length of the memory area represents the range of personal acceptance. The more similar the memory tags are, the greater the value of M L ( u , v ) is, which means the more similar common experience between the two nodes. In real life, people with the same experience share common interests and hobbies. Because different people have different decision-making styles, people with similar experiences are more likely to get together.
Preferential decision degree. In each iteration of information propagation, each decision of the node is the result of preferential selection after comprehensive consideration. Thus, we introduce the concept of preferential decision-making degree to better represent the stability of the algorithm. When choosing community division, it is analogous to choosing to communicate with people around us in our daily life and make their own decisions. In real life, we prefer to go to communicate with people who are similar to past cycle behaviors. People who have similar interests and interests will hope to learn what they want from people who have close relationships with them. Generally, let N ( u ) be the set of neighbors of node u, C S ( u , v ) and J ( u , v ) indicate the connection strength and the Jaccard similarity between node u and node v, respectively. M L ( u , v ) denotes the past memory similarity coefficient between node u and node v. Let P D ( u , v ) be the preferential decision degree, which is defined as follows:
P D ( u , v ) = | N ( u ) | × | N ( u ) | × J ( u , v ) × e C S ( u , v ) + M L ( u , v ) .
The propagation of information between nodes is analogous to the exchange of information among people. Communication between people can be influenced by many factors. Knowledge and resources of a person can be regarded as the Preferential decision degree of a node. The association between the two persons can be considered as the connection strength and the Jaccard coefficient. In the process of iteration, the node will always choose the best result to make a reasonable decision.
The probability of propagation. From the above formulas, we can conclude that the node has a preferential decision degree for each neighboring node, and the higher it is, the greater the decision-making probability will be. Let N ( u ) be the set of neighboring nodes of node u, and node v k belongs to N ( u ) . Let P ( u , v k ) be the probability of selecting node v k as deciding objects for node u, which is defined as follows:
P ( u , v k ) = P D ( u , v k ) v k N ( u ) P D ( u , v k ) .
For the node u, all its neighbors are likely to be selected. Low Preferential decision degree indicates that the probability of being selected is small, but this does not mean that it is impossible. It can also be analogous to our real life, and sometimes we need to learn what we need from strangers.
Information loss. In our daily lives, there is always some information that will be discarded in the process of information exchange. As the number of iterations increases, the probability of information loss increases. To prevent the information exchange failure caused by the preferential decision degree between nodes, it is extremely important to introduce a threshold. The copra algorithm divides the community by introducing a label dependent coefficient and an adjustable parameter v. In this paper, a threshold is used to control nodes to make better decisions. Its threshold has a significant impact on the experimental results, which not only reduces the reference of parameters but also divides the community accurately in a better way according to its node characteristics.

3.3. Community Detection Based on the Preferential Decision Algorithm

In this section, based on the above-mentioned content, we introduce the preferential decision algorithm. The proposed algorithm is similar to the traditional tag propagation algorithm. Whether it is a simple network or a complex network, it has its specific topological structure. Initially, the node does not have any interactive information behavior. We initialize each node’s information and give each node a unique label to ensure its atomicity. In addition, the connection strength and the Jaccard coefficient between nodes are calculated, and the close relationship between nodes is analyzed. After the initialization of the previously mentioned information, it enters the dynamic iteration stage. Information is transmitted in the network, and the nodes interact with each other continuously. According to its decision-making probability and goal, it standardizes its own tag list. In each step, each node constantly updates its own memory tag list. Finally, with continuous iteration, due to the influence of topology-driven, all nodes in the network will reach a stable state, and the memory labels of nodes will also reach a convergent state. When the information in the network converges, the labels of the same community will naturally be the same, and the labels of different communities will be different. Therefore, after considering the comprehensive factors, we can naturally uncover the community, so as to achieve the performance of community division.
From the above, we can propose that the algorithm flow is similar to the traditional label propagation algorithm. It can be roughly divided into three categories: (a) information initialization, (b) information dynamic interaction, and (c) community detection. The CDPD algorithm is given in Algorithm 1 as follows:
Algorithm 1 CDPD
Input:   G = ( V , E )
1:
/ / Information initialization.
2:
for each node u in V do
3:
  Initialize label(u) = u;
4:
  Get the N ( u ) ;
5:
  for each node v in N ( u ) do
6:
    Set the memory label;
7:
    Compute the Jaccard similarity coefficient using Equation (1);
8:
    Compute the connection strength using Equation (3);
9:
  end for
10:
end for
11:
/ / Information dynamic interaction.
12:
S y m b o l = T R U E
13:
while S y m b o l do
14:
  Integrate order(V);
15:
  for each node u in V do
16:
    for each node v in N ( u ) do
17:
      Compute the past memory similarity (cf. Equation (4)) between u and v;
18:
      Compute the preferential decision degree (cf. Equation (5)) of u for v;
19:
      Learn and normalize labels;
20:
      Decide the target of u on the basis of probability of propagation;
21:
      if satisfy the stop iteration condition of the algorithm then
22:
         S y m b o l = F A L S E ;
23:
      end if
24:
    end for
25:
  end for
26:
end while
27:
/ / Community detection.
28:
for each node u in V do
29:
  for each label k in label list of node u do
30:
    if the label list of node u is equivalent then
31:
      Set the label of node u;
32:
    else
33:
       u > C k ;
34:
    end if
35:
  end for
36:
end for
37:
/ / Return the resulting components C(communities)
38:
Set S e t c = C 1 , C 2 , C 3 C n and n is number of community.
Output:   C.

3.4. Complexity Analysis

In the CDPD algorithm, we mainly analyze the time complexity from three parts. It is summarized as follows:
Information initialization. The initial state is labeled for each node, and a loop is only needed. Thus, its time complexity is O ( n ) . At the same time, our algorithm needs to calculate some additional necessary information, such as memory similarity, connection strength, Jaccard coefficient, and so on. The time complexity of this process is O ( ( k + 1 ) × n ) , where k is the average degree of the whole network.
Information dynamic interaction. In each iteration process, each node only chooses one node at one time as the target, and its time complexity is O ( L × m a x ) , where L is the length of its memory tag, and m a x is the maximum length of the tag in the network. In addition, the time complexity of deciding the target is O ( m a x ) . The overall number of iterations is denoted as t, which is typically between 30 and 100. Thus, the time complexity of this process is O ( t × n × ( L × m a x + m a x ) ) .
Community detection. The time complexity of this process is O ( n u m × n ) , where n u m stands for the number of communities in the whole network, usually n u m < < n .
In summary, the time complexity of the CDPD algorithm is the sum of the three parts: O ( ( k + 1 ) × n + t × n × ( L × m a x + m a x ) + n u m × n ) . We note that the values of t and m a x are usually small. Therefore, for small and sparse networks, our algorithm can be considered to be nearly linear.

4. Experiment

In this section, we first briefly introduce several common representative algorithms as the proposed comparison algorithm. Afterward, the datasets used in this experiment are introduced, which contain the real-world datasets and the synthetic datasets.Then, we select the appropriate evaluation metrics based on the characteristics of the datasets. Finally, our algorithm and comparison algorithm are applied to each dataset to observe and analyze the results of the experiment. To eliminate the randomness of the algorithm, each experiment was repeated 100 times to take the average as the experimental result. The experimental environment is i5-4590 3.2 GHz CPU, 8 GB RAM, Win8 OS PC.

4.1. Comparing Algorithms

To evaluate the performance of CDPD, we compare it with several representatives of community detection algorithms which belong to different types and different periods to verify the universality and the accuracy of CDPD. The basic principle and time complexity of all algorithms are shown in Table 2.
Fastgreedy. FG greedily maximizes the modularity of the graph by dividing nodes into communities [30]. The algorithm stops running when the modularity value of the graph is no longer increasing.
Spinglass. SG maps the community structure into the spin configuration and takes the spin state as the community indices to realize the community detection by minimizing the energy of the spin glass [31].
Infomap. In this algorithm, the probability flow in the graph theory is used to represent the information flow in the real-world network and then the community structure in the network is discovered through probability flow processing [32].
LPA. LPA is one of the fastest community detection algorithm based on label propagation implementation, each node will choose one of its surrounding tags to update itself [24]. Although the time complexity of this method of selecting labels through many iterations is low, the randomness is high. After the termination condition is reached, nodes holding the same label are divided into the same community.
Louvain. Louvain divides every node into separate communities, and each node is moved between communities to achieve the purpose of community division. The algorithm will be stopped when moving any one node can improve overall modularity degree scores [33].
Leading eigenvector. LE is a community detection algorithm using the leading eigenvector method, which divides the network structure by continuously maximizing the original network modularity degree [34].
FluidC. FluidC is probably the first community detection algorithm based on the idea of fluid dyeing [35]. This algorithm needs to determine the number of communities in advance.
EDCD. EDCD is a new algorithm for iteratively deleting constrained edges [36]. The original network is divided into several vertices of strongly connected communities and the community structure is optimized by an improved edge deletion process optimization module. Finally, the isolated vertex initial communities are reconnected to optimize the community structure.

4.2. Data Description

The datasets used in the experiments include synthetic network datasets and real-world network datasets. Synthetic network datasets generated by an LFR model represent the virtual network in the network. Real-world network datasets come from real life.

4.2.1. Synthetic Networks

We use the well-known LFR model [37], which can fit the real-world network characteristics by controlling several parameters of the LFR model, such as number of nodes ( | V | ), average degree (k), maximum degree ( m a x k ), minimum community size ( m i n c ), and maximum community size ( m a x c ). Among these, what is particularly noteworthy here is the mixing parameter ( m u ), which indicates whether the community structure in the network is obvious. It can be specifically defined as:
m u = k o u t k ,
where k represents the number of connections for all nodes and k o u t represents the number of connections between nodes in different communities. The higher the value of m u is, the more blurred the community structure of the network is. We want to observe the performance of several algorithms on different mu. We use the LFR model to generate a series of networks where the number of nodes | V | = 1000 , the average degree k = 15 , the maximum degree m a x k = 38 , the minimum community size m i n c = 10 , and the maximum community size m a x c = 50 . We set m u increase from 0.1 to 0.8 . At the same time, the average degree k may also be an important factor influencing the results of community detection. We set m u = 0.1 and the average degree k changes from 3 to 20 to generate several networks. Based on this experimental idea, two groups of synthetic networks have been generated as Table 3:

4.2.2. Real-World Networks

We also select some representative real networks to evaluate the above algorithms. All data of networks can be downloaded from SNAP (http://snap.stanford.edu/data/) and UCI (https://networkdata.ics.uci.edu/index.php). Some relevant information about real-world networks have be briefly described in the following Table 4.
Zacharys Karate network. The Karate network is a social network constructed by observing an American University Karate club. The network consists of 34 nodes and 78 edges, in which the node represents the member of the club and the edge represents the friendship between the members.
Dolphin social network. A group of researchers from the Harbor Branch Oceanographic Institution ( H B O I ) carefully studied the relationship between Dolphins in the Indian lagoon. The network consists of 62 nodes and 159 edges. To better understand how dolphin populations treat and use their environment, scientists can determine how social networks affect information transmission and potentially affect reproductive behavior and disease transmission.
Politics books network. This is a book network of American politics that are sold by the online bookseller on Amazon.com, consisting of 105 nodes and 441 edges. The nodes represent US politics-related books sold on the online bookstore and the edges represent some readers who purchased both books at the same time.
Football network. This network is a real network based on the American College Football League. The network consists of 115 nodes and 616 edges. The node represents the team and the edge represents the game between the two teams. These teams are divided into 12 leagues, which can be analogous to our real online community. Similarly, this network is a classic datasets in community detection.
Citeseer network. This network is derived from a scientific paper citation dataset. The node represents the paper and the edge represents the mutual reference relationship between the two papers. The network is mainly divided into six categories, which contains 2110 nodes and 4732 edges.

4.3. Evaluation Metrics

Evaluating the result of community detection has raised a fierce discussion in the data mining area, and some of the classic evaluation indicators have been put forward in recent years, which provides a reference for our research. However, some scholars have confirmed that two different evaluation indicators may have the opposite conclusions on the same experimental results [8]. This experiment selects multiple commonly used evaluation indexes to evaluate the above nine algorithms community detection areas, such as Normalized Mutual Information ( N M I ) [38], Adjusted Rand Index ( A R I ) [39] and Cluster Purity ( p u r i t y ) [40].
Normalized Mutual Information. N M I is a formula that represents the similarity of two sets, which stems from the information theory. The specific definition is as follows:
N M I ( X , Y ) = 2 i = 1 C X j = 1 C Y N i j l o g ( N i j N N i N j ) i = 1 C X N i l o g ( N i N ) + j = 1 C Y N j l o g ( N j N ) ,
where X represents the real partition and Y is the partition found by the algorithm; C X is the number of real communities and C Y is the number of found communities; N is the number of nodes in the network; N i j is the number of nodes shared by the real community i in partition X and the found community j in partition Y; N i denotes the sum over row i of matrix N i j ; and N j denotes the sum over column j [38].
Adjusted Rand Index. A R I is another commonly used clustering evaluation index based on similarity measure, which is defined as:
A R I = i j ( 2 n i j ) [ i ( 2 a i ) j ( 2 b j ) ] / ( 2 n ) 1 2 [ i ( 2 a i ) + j ( 2 b j ) ] [ i ( 2 a i ) j ( 2 b j ) ] / ( 2 n ) .
if you want to know more details, please refer to [41,42].
Cluster Purity. P u r i t y is a evaluation indicator that can measure the accuracy of classification, which is defined as:
P u r i t y ( Ω , C ) = 1 N i = 1 k m a x j | ω i c j | ,
where N is the number of community, Ω = { ω 1 , ω 2 , , ω k } represents the set of predicted community, and C = { c 1 , c 2 , , c j } is the set of ground truth.

4.4. Performance Evaluation

In all experiments, the parameters of the comparison algorithm are set as the default parameters suggested by the author, putting into mind if the parameters are even needed. For a dataset with known community partition results, we enter the number of communities as a parameter for FluidC. For a dataset with an unknown number of communities, we use the number of communities found by the CDPD as a parameter input. Most of the comparison algorithms are based on the igraph package (http://igraph.org/) except for FluidC which can be downloaded on Github (github.com/FerranPares/Fluid-Communities).
Evaluation of Synthetic Networks. We apply the above algorithms to the LFR synthetic network datasets, where the m u of the datasets varies from 0.1 to 0.8 . We evaluate the performance of the algorithm by using various aspects of the evaluation indicators ( N M I , A R I , P u r i t y ) and the recorded experiment results have been shown in Figure 2.
We can see that all three evaluation indicators decrease with increasing m u because the growth of m u means that the community structure is increasingly blurred. In Figure 2, when m u is less than or equal to 0.4 ( m u 0.4 ) , CDPD always maintains optimal performance when compared with other algorithms. When m u is equal to 0.5 ( m u = 0.5 ) , the performance of CDPD is second only to Infomap. We can see that Infomap always performs well with m u ranging from 0.1 to 0.6 . However, when m u equals 0.7 ( u = 0.7 ) , the performance of Infomap decays rapidly; although the performance of CDPD has also weakened, it is still relatively superior.
Then, to verify the performance of above algorithms to network on the different average degree k, we run these algorithms on the LFR synthetic networks when k is varied from 3 to 20. The results obtained are shown in Figure 3. From Figure 3, we can see that the performance of LE is always poor compared to other algorithms, but other algorithms perform better with increasing k. When k is greater than or equal to 8 ( k 8 ) , CDPD has the best results as other algorithms. When k is equal to 5 ( k = 5 ) , the performance of CDPD on N M I and P u r i t y is still optimal, and the performance of A R I is second only to EDCD and LPA. When k is equal to 3 ( k = 3 ) , CDPD is superior to other algorithms on all indicators and has obvious advantages.
Evaluation of Real-World Networks. To more widely verify the performance of the above algorithms, we apply these algorithms to real-world networks with ground truth. The results obtained are shown in Table 5. The best performance on each of the datasets is bolded for observation. In Karate, CDPD has a good performance on all metrics ( N M I = 0.897 , A R I = 0.902 , P u r i t y = 1 ) , which is the best among all algorithms. Infomap has performed well in synthetic network datasets but performs poorly here. The detection results are visualized as shown in Figure 4a. We can see that the neighbor nodes of node 10, node 3, and 34, belonging to two communities respectively. According to the label update rules of LPA, when the neighbor node has multiple maximum numbers of labels, one of the nodes will be randomly selected by node 10. Node 10 in the LPA has a high probability of being divided into the wrong community. The CDPD prefers to learn the labels of neighbors with a higher degree. This ensures that node 10 and node 34 has a greater probability is divided into the same community because node 34 has a greater degree. In Dolphins, CDPD has the best N M I and A R I ( N M I = 0.72 , A R I = 0.69 ) . However, the P u r i t y of CDPD ( P u r i t y = 0.984 ) is inferior to P u r i t y of SG ( P u r i t y = 0.999 ) . However, the N M I ( N M I = 0.638 ) and A R I ( A R I = 0.493 ) of SG are smaller than CDPD. Therefore, the overall performance of CDPD in Dolphins is good. In addition, the community detection results of Dolphins are shown in Figure 4b. In Polbooks, the performance of CDPD is still outstanding. The N M I of CDPD ( N M I = 0.586 ) , A R I ( A R I = 0.69 ) and P u r i t y ( P u r i t y = 0.924 ) are all the best. In addition, the community division results of Polbooks are shown in Figure 4c. In Football, CDPD, EDCD, and Infomap have the best N M I ( N M I = 0.92 ) , and CDPD and Infomap also have the best P u r i t y ( P u r i t y = 0.92 ) . However, the A R I of CDPD ( A R I = 0.85 ) is lower than A R I of Infomap ( A R I = 0.897 ) . In addition, the community division results of Football are shown in Figure 4d. Finally, in Citeseer, the A R I of CDPD and FG ( A R I = 0.198 ) is the best. However, the N M I of the CDPD ( N M I = 0.398 ) is only lower than the N M I of the EDCD ( N M I = 0.435 ) , which performs better than other algorithms. The P u r i t y of the CDPD ( P u r i t y = 0.492 ) is slightly lower than the P u r i t y of Infomap ( P u r i t y = 0.777 ) .
In summary, compared to other comparison algorithms, CDPD has a good performance in both the synthetic networks and the real-world networks. Especially in networks with real-world networks and super sparse networks, CDPD can still achieve good partitioning results when other algorithms struggle (cf. Figure 4 and Table 5).

5. Conclusions

In this paper, we introduced a community detection algorithm, called CDPD, which is based on the preferential decision model. CDPD does not need any parameters to uncover community structure, and it has respectable accuracy in most cases. It especially performs well in small networks and sparse networks. At the same time, another advantage of CDPD is that it is stable. To put it in a nutshell, CDPD has the advantage of higher accuracy, free parameter, and stability. In the future, we intend to apply our algorithm to overlapping community detection and detection based on dynamic networks and large-scale network applications. Further improving the efficiency of the algorithm is still a challenging and meaningful topic.

Author Contributions

Conceptualization, J.S. and B.W.; Software, B.L., J.H. and Q.D.; Supervision, J.S. and B.W.; Validation, B.L., J.H., K.W. and D.A.; Writing, J.S., B.L. and X.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Major Project of China (2017ZX06002005).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Oger, R.; Yerson, B.M. Game theory: Analysis of conflict. Long Range Plan. 1992, 25, 130. [Google Scholar]
  2. Basu, S.; Maulik, U. Community detection based on strong Nash stable graph partition. Soc. Netw. Anal. Min. 2015, 5, 1–15. [Google Scholar] [CrossRef]
  3. Fortunato, S.; Barthelemy, M. Resolution limit in community detection. Mob. Netw. Appl. 2007, 104, 36–41. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. IEEE Transactions on Emerging Topics in Computational Intelligence; Publishing House: New York, NY, USA, 2018; Volume 2, pp. 214–223.
  5. Li, H.J.; Daniels, J.J. Social significance of community structure: Statistical view. Phys. Rev. E 2015, 91, 012801. [Google Scholar] [CrossRef] [PubMed]
  6. Bu, Z.; Li, H.J.; Zhang, C.; Cao, J.; Li, A.; Shi, Y. Graph K-means based on Leader Identification, Dynamic Game and Opinion Dynamics. IEEE Trans. Knowl. Data Eng. 2019. [Google Scholar] [CrossRef]
  7. Li, H.J.; Bu, Z.; Wang, Z.; Cao, J. Dynamical clustering in electronic commerce systems via optimization and leadership expansion. IEEE Trans. Ind. Inform. 2019. [Google Scholar] [CrossRef]
  8. Fortunato, S.; Hric, D. Community detection in networks: A user guide. Phys. Rep. 2016, 659, 1–44. [Google Scholar] [CrossRef] [Green Version]
  9. Khan, B.S.; Niazi, M.A. Network Community Detection: A Review and Visual Survey. arXiv 2017, arXiv:1708.00977. [Google Scholar]
  10. Jonnalagadda, A.; Kuppusamy, L. A survey on game theoretic models for community detection in social networks. Soc. Netw. Anal. Min. 2016, 6, 83. [Google Scholar] [CrossRef]
  11. Qi, X.; Tang, W.; Wu, Y.; Guo, G.; Fuller, E.; Zhang, C.Q. Optimal local community detection in social networks based on density drop of subgraphs. Pattern Recognit. Lett. 2014, 36, 46–53. [Google Scholar] [CrossRef]
  12. Badie, R.; Aleahmad, A.; Asadpour, M.; Rahgozar, M. An efficient agent-based algorithm for overlapping community detection using nodes closeness. Phys. Stat. Mech. Appl. 2013, 392, 129–140. [Google Scholar] [CrossRef]
  13. Newman, M.E.; Girvan, M. Finding and evaluating community structure in networks. Phys. Rev. Stat. Nonlinear Soft Matter Phys. 2004, 69, 026113. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Fortunato, S. Community detection in graphs. Phys. Rep. 2010, 486, 75–174. [Google Scholar] [CrossRef] [Green Version]
  15. Guimera, R.; SalesPardo, M.; Amaral, L.A. Modularity from fluctuations in random graphs and complex networks. Phys. Rev. Stat. Nonlinear Soft Matter Phys. 2004, 70, 025101. [Google Scholar] [CrossRef] [Green Version]
  16. Jordi, D.; Alex, A. Community Detection in Complex Networks Using Extremal Optimization. Phys. Rev. Stat. Nonlinear Soft Matter Phys. 2005, 72, 027104. [Google Scholar]
  17. Osborne, M.J.; Rubinstein, A. A Course in Game Theory; MIT Press: Cambridge, MA, USA, 1994. [Google Scholar]
  18. Li, H.-J.; Wang, Q.; Liu, S.; Hu, J. Exploring the trust management mechanism in self-organizing complex network based on game theory. Phys. Stat. Mech. Appl. 2019, 123514. [Google Scholar] [CrossRef]
  19. Cao, J.; Bu, Z.; Wang, Y.; Yang, H.; Jiang, J.; Li, H.-J. Detecting Prosumer-Community Groups in Smart Grids From the Multiagent Perspective. IEEE Trans. Syst. Man Cybern. Syst. 2019, 1–13. [Google Scholar] [CrossRef]
  20. Narayanam, R.; Narahari, Y. A game theory inspired, decentralized, local information based algorithm for community detection in social graphs. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), Tsukuba, Japan, 11–15 November 2012; pp. 1072–1075. [Google Scholar]
  21. Cao, L.; Li, X.; Han, L. Detecting community structure of networks using evolutionary coordination games. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS2013), Beijing, China, 19–23 May 2013; pp. 2533–2536. [Google Scholar]
  22. Alvari, H.; Hajibagheri, A.; Sukthankar, G. Community detection in dynamic social networks: A game-theoretic approach. In Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Beijing China, 17–20 August 2014; pp. 101–107. [Google Scholar]
  23. Pons, P.; Latapy, M. Computing Communities in Large Networks Using Random Walks. J. Graph Algorithms Appl. 2006, 10, 191–218. [Google Scholar] [CrossRef] [Green Version]
  24. Raghavan, U.N.; Albert, R.; Kumara, S. Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. Stat. Nonlinear Soft Matter Phys. 2007, 76, 036106. [Google Scholar] [CrossRef] [Green Version]
  25. Hosseini, R.; Azmi, R. Memory-based label propagation algorithm for community detection in social networks. In Proceedings of the International Symposium on Artificial Intelligence and Signal Processing, Mashhad, Iran, 3–5 March 2015; pp. 256–260. [Google Scholar]
  26. Cordasco, G.; Gargano, L. Community Detection via Semi-Synchronous Label Propagation Algorithms. In Proceedings of the 2010 IEEE International Workshop on: Business Applications of Social Network Analysis (BASNA), Bangalore, India, 5 December 2010; pp. 1–8. [Google Scholar]
  27. Zhang, X.; Kun, R.; Jing, S.; Chen, J.; Zhang, Q. Label propagation algorithm for community detection based on node importance and label influence. Mob. Phys. Lett. 2017, 381, 2691–2698. [Google Scholar] [CrossRef]
  28. Khadivi, A.; Rad, A.A.; Hasler, M. Community detection enhancement in networks using proper weighting and partial synchronization. In Proceedings of the IEEE International Symposium on Circuits and System, Florence, Italy, 27–30 May 2018; pp. 3777–3780. [Google Scholar]
  29. Shao, J.; Han, Z.; Yang, Q.; Zhou, T. Community Detection based on Distance Dynamics. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 8–9 August 2015; pp. 1075–1084. [Google Scholar]
  30. Clauset, A.; Newman, M.E.; Moore, C. Finding community structure in very large networks. Phys. Rev. Stat. Nonlinear Soft Matter Phys. 2004, 70, 066111. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  31. Reichardt, J.; Bornholdt, S. Statistical mechanics of community detection. Phys. Rev. Stat. Nonlinear Soft Matter Phys. 2006, 74, 016110. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  32. Rosvall, M.; Bergstrom, C.T. Maps of random walks on complex networks reveal community structure. Proc. Natl. Acad. Sci. USA 2008, 105, 1118–1123. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  33. Blondel, V.D.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of community hierarchies in large networks. J. Stat. Mech. Theory Exp. 2008, 2008, 10008. [Google Scholar] [CrossRef] [Green Version]
  34. Newman, M.E.J. Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. Stat. Nonlinear Soft Matter Phys. 2006, 74, 036104. [Google Scholar] [CrossRef] [Green Version]
  35. Parés, F.; Gasulla, D.G.; Vilalta, A.; Moreno, J.; Ayguadé, E.; Labarta, J.; Cortés, U.; Suzumura, T. Fluid Communities: A Competitive, Scalable and Diverse Community Detection Algorithm. In Proceedings of the Complex Networks and Their Applications VI, Lyon, France, 29 November–1 December 2017; pp. 229–240. [Google Scholar]
  36. Chen, X.; Li, J. Community detection in complex networks using edge-deleting with restrictions. Phys. Stat. Mech. Appl. 2019, 519, 181–194. [Google Scholar] [CrossRef]
  37. Lancichinetti, A.; Fortunato, S.; Radicchi, F. Benchmark graphs for testing community detection algorithms. Phys. Rev. Stat. Nonlinear Soft Matter Phys. 2008, 78, 046110. [Google Scholar] [CrossRef] [Green Version]
  38. Strehl, A.; Ghosh, J. Cluster Ensembles—A Knowledge Reuse Framework for Combining Multiple Partitions. J. Mach. Learn. Res. 2002, 3, 583–617. [Google Scholar]
  39. Rand, W.M. Objective Criteria for the Evaluation of Clustering Methods. J. Am. Stat. Assoc. 1971, 66, 846–850. [Google Scholar] [CrossRef]
  40. Ying, Z.; Karypis, G. Criterion Functions for Document Clustering: Experiments and Analysis. 2002. Available online: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.6872 (accessed on 16 January 2020).
  41. Vinh, N.X.; Epps, J.; Bailey, J. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. J. Mach. Learn. Res. 2010, 11, 2837–2854. [Google Scholar]
  42. Lawrence, H.; Phipp, A. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar]
Figure 1. A sample on a simple network. (a) information initialization, (b) information dynamic interaction, (c) uncover communities.
Figure 1. A sample on a simple network. (a) information initialization, (b) information dynamic interaction, (c) uncover communities.
Information 11 00053 g001
Figure 2. Performances of the above algorithms on the synthetic networks generated by LFR by varying the mixing parameter from mu 0.1 to 0.8.
Figure 2. Performances of the above algorithms on the synthetic networks generated by LFR by varying the mixing parameter from mu 0.1 to 0.8.
Information 11 00053 g002
Figure 3. Performances of the above algorithms on the synthetic networks generated by LFR by varying the average degree k from 3 to 20.
Figure 3. Performances of the above algorithms on the synthetic networks generated by LFR by varying the average degree k from 3 to 20.
Information 11 00053 g003
Figure 4. The communities obtained in different networks by CDPD. (a) community structure obtained by CDPD on Karate, (b) community structure obtained by CDPD on Dolphins, (c) community structure obtained by CDPD on Polbooks, (d) community structure obtained by CDPD on Football.
Figure 4. The communities obtained in different networks by CDPD. (a) community structure obtained by CDPD on Karate, (b) community structure obtained by CDPD on Dolphins, (c) community structure obtained by CDPD on Polbooks, (d) community structure obtained by CDPD on Football.
Information 11 00053 g004
Table 1. A summary of the notations.
Table 1. A summary of the notations.
SymbolDefinition
| V | The number of nodes
| E | The number of edges
M u [ i ] The i-th memory tag of the node u
N ( u ) The neighborhood of node u
J ( u , v ) The Jaccard similarity coefficient of node u and node v
C S ( u , v ) The contact strength of node u and node v
M S ( u , v ) The past memory similarity of node u and node v
P D ( u , v ) The preferential decision degree of node u for node v
P ( u , v k ) Node u determines the probability of selecting node v
Table 2. The basic principle and time complexity of algorithms used for comparison.
Table 2. The basic principle and time complexity of algorithms used for comparison.
AlgorithmBasic PrincipleTime Complexity
FGModularity optimization O ( e ( e + n ) )
SGSpin configuration O ( ( k + e ) n )
InfomapRandom walk O ( e )
LPALabel propagation O ( n )
LouvainSeparate division O ( k n )
LEEigenvectors of matrices O ( n 2 )
FluidCFluids interacting in an environment O ( e )
EDCDEdge deletion O ( c ( k + 1 ) e + k n )
CDPDPreferential decision O ( t × k 2 × n )
Table 3. Characteristics of synthetic networks.
Table 3. Characteristics of synthetic networks.
Dataset | V | kmaxkmumincmaxc
1st100015380.1–0.81050
2nd10003–20 6 k 0.11050
Table 4. Characteristics of real-world networks.
Table 4. Characteristics of real-world networks.
Datasets V E k C Clustering CoefficientAverage Path Length
Karate34784.58820.5882.408
Dolphins621595.12920.3033.357
Football11561310.611120.4032.508
Polbooks1054418.430.483.097
Citeseer211473963.5260.249.32
Table 5. Performances of the above algorithms in real-world networks with ground truth.
Table 5. Performances of the above algorithms in real-world networks with ground truth.
ZacharyDolphinsPolbooksFootballCiteseer
NMIARIPur.NMIARIPur.NMIARIPur.NMIARIPur.NMIARIPur.
FG0.7070.680.9710.6520.4930.9840.5310.6380.8380.7570.4740.5830.3620.1980.705
SG0.7030.5260.9940.6380.3690.9990.5110.510.8510.9060.8460.8970.280.160.62
Infomap0.7110.7020.9710.6110.3520.9910.5030.5360.8480.920.8970.9220.3970.0330.777
LPA0.6640.6420.9130.6790.5560.9780.5560.6520.8450.8780.7560.8460.3880.0710.776
Louvain0.6180.4620.9710.5680.3430.9680.5130.5490.8480.8910.8070.870.3520.1590.698
LE0.7150.51210.4970.2830.9520.5250.5470.8480.7030.4640.6260.3410.1760.666
FluidC0.6420.5940.760.6310.6280.8860.4960.5320.7510.8850.790.8660.2390.1970.482
EDCD0.8960.880.9680.370.130.3870.520.560.7900.920.8960.9210.4350.130.243
CDPD0.8970.90210.720.690.9840.5860.690.9240.920.850.9220.3980.1980.492

Share and Cite

MDPI and ACS Style

Sheng, J.; Lu, B.; Wang, B.; Hu, J.; Wang, K.; Pan, X.; Dong, Q.; Aklilu, D. Community Detection Based on a Preferential Decision Model. Information 2020, 11, 53. https://0-doi-org.brum.beds.ac.uk/10.3390/info11010053

AMA Style

Sheng J, Lu B, Wang B, Hu J, Wang K, Pan X, Dong Q, Aklilu D. Community Detection Based on a Preferential Decision Model. Information. 2020; 11(1):53. https://0-doi-org.brum.beds.ac.uk/10.3390/info11010053

Chicago/Turabian Style

Sheng, Jinfang, Ben Lu, Bin Wang, Jie Hu, Kai Wang, Xiaoxia Pan, Qiangqiang Dong, and Dawit Aklilu. 2020. "Community Detection Based on a Preferential Decision Model" Information 11, no. 1: 53. https://0-doi-org.brum.beds.ac.uk/10.3390/info11010053

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop