Next Article in Journal
Optimized Variational Mode Decomposition and Permutation Entropy with Their Application in Feature Extraction of Ship-Radiated Noise
Next Article in Special Issue
The Vegetation–Climate System Complexity through Recurrence Analysis
Previous Article in Journal
Limit Theorems as Blessing of Dimensionality: Neural-Oriented Overview
Previous Article in Special Issue
Subgraphs of Interest Social Networks for Diffusion Dynamics Prediction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Mixture Model of Truncated Zeta Distributions with Applications to Scientific Collaboration Networks †

by
Hohyun Jung
1,2 and
Frederick Kin Hing Phoa
1,*
1
Institute of Statistical Science, Academia Sinica, Taipei City 11529, Taiwan
2
Department of Statistics, Sungshin Women’s University, Seoul 02844, Korea
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in Jung, H.; Phoa, F.K.H. Analysis of a Finite Mixture of Truncated Zeta Distributions for Degree Distribution. In Proceedings of the International Conference on Complex Networks and Their Applications, Madrid, Spain, 1–3 December 2020; pp. 497–507.
Submission received: 25 March 2021 / Revised: 10 April 2021 / Accepted: 20 April 2021 / Published: 22 April 2021

Abstract

:
The degree distribution has attracted considerable attention from network scientists in the last few decades to have knowledge of the topological structure of networks. It is widely acknowledged that many real networks have power-law degree distributions. However, the deviation from such a behavior often appears when the range of degrees is small. Even worse, the conventional employment of the continuous power-law distribution usually causes an inaccurate inference as the degree should be discrete-valued. To remedy these obstacles, we propose a finite mixture model of truncated zeta distributions for a broad range of degrees that disobeys a power-law behavior in the range of small degrees while maintaining the scale-free behavior. The maximum likelihood algorithm alongside the model selection method is presented to estimate model parameters and the number of mixture components. The validity of the suggested algorithm is evidenced by Monte Carlo simulations. We apply our method to five disciplines of scientific collaboration networks with remarkable interpretations. The proposed model outperforms the other alternatives in terms of the goodness-of-fit.

1. Introduction

Network science focuses on the study of complex networks such as telecommunication, computer, biological, cognitive, and social networks. A network consists of nodes and links. The topological structure, which explores how nodes are connected in the system, has been investigated with great interest [1,2,3]. Network researchers have examined foundational network topologies using various network-related quantities such as the degree distribution, clustering coefficient, and average path length. Most networks are dynamic, so accordingly, network-related quantities also change over time. Studying the evolution of these network quantities could provide insight into the behavior of individuals expressed by nodes and the change of topological properties of the network. The evolution of a network has been studied from various perspectives, e.g., the community [4,5], rich-get-richer [1], node heterogeneity [2,6], and link persistence [3].
The degree of a node is the number of links connected to a node. The degree distribution of a network, P ( k ) , tells us the probability of degree k that a randomly chosen node will have. One challenge in studying network science is to develop simplified measures that capture the network structure. The degree distribution is one of such measures that help to find influential nodes in the system. Many attempts have been made to study the degree distribution using Poisson, exponential, and power-law distributions. In particular, the analysis of the power-law degree distribution has been considered as one of the basic steps, and networks that have power-law degree distributions are often referred to as scale-free networks. We can frequently observe power-law degree distributions in collaboration, World Wide Web, protein–protein interactions, and semantic networks [7,8,9,10]. The emergence of hubs (highly connected nodes) is a consequence of a scale-free property of networks. A rich-get-richer mechanism, also called a popularity effect [11], has been known to produce power-law degree distributions and hubs [1,6].
Many dynamic network models have been developed to explain the power-law degree distribution in real networks. The most widely known dynamic model for the power-law degree distribution is the rich-get-richer generative model in Barabasi and Albert (1999) [1], called the BA model. They employ the preferential attachment mechanism in which nodes with more neighbors tend to receive more links from other nodes. Specifically, the algorithm of the BA model has the following parts with the parameter m that controls the number of links over time:
  • Growth: At each time point, a node enters the network. Then the node tries to connect with m nodes in the network.
  • Preferential attachment: The newly entered node connects with node i with probability proportional to the degree of node i.
The BA model yields a power-law degree distribution with exponent 3, i.e., P ( k ) k 3 . The exact form, given by Bollobas et al. in [12], is:
P ( k | m ) = 2 m ( m + 1 ) k ( k + 1 ) ( k + 2 ) , k = m , m + 1 , .
There are many variant models to cover a broad range of power-law exponents [13,14,15], degree correlations [16], accelerating growth [17,18,19], and node heterogeneity [6,20,21]. Another model for generating the power-law degree distribution is a copying network model presented in Kumar et al. (2000) [22]. In this model, newly entered nodes randomly select some existing nodes and copy some of the links.
The power-law distribution has the form of P ( k ) k α , which can be expressed as ln P ( k ) = α ln k + c o n s t a n t . Therefore, a straight line of ln P ( k ) plot on ln k (log-log plot) can be an indication of a power-law relationship, and its slope is a power-law exponent. In real networks, however, the degree distribution does not have a shape of a straight line in the entire range. There are many empirical distributions where the power-law behavior is not observed in the range of small degrees. Many variants of the power-law distribution are developed to address this issue, such as generalized power-law distributions [23,24], composite distributions with threshold [25,26], and power-law distributions with an exponential cutoff [27,28]. However, these methods do not consider the essential foundation of the power-law. According to the BA model and its variants, the power-law nature is an inherent property exhibited from the preferential attachment rule. The model presented in this paper preserves the power-law nature to avoid manual modifications of the power-law distribution function, given by P ( k ) k α .
Note that the BA model has a parameter m. Jordan (2006) [29] relieved the constant m condition that, at each time, the number M of connections can change over time according to the distribution of M. The degree distribution turned out to be
P ( k ) = 2 E [ M ( M + 1 ) 1 ( M k ) ] k ( k + 1 ) ( k + 2 ) , k = 1 , 2 , .
We note the following statistical property.
Theorem 1.
Suppose that M has a finite support, M = 1 , , M m a x . Then the degree distribution of Jordan’s model can be expressed as a mixture distribution, given by
P ( k ) = m = 1 M m a x w m P ( k | m ) , k = 1 , 2 , ,
where w m = P ( M = m ) is a mixture weight corresponding to the mth mixture component P ( k | m ) in Equation (1), m = 1 , , M m a x .
Proof. 
Note that we can rewrite P ( k | m ) and P ( k ) as
P ( k | m ) = 2 m ( m + 1 ) k ( k + 1 ) ( k + 2 ) 1 ( k m ) P ( k ) = m = 1 M m a x w m 2 m ( m + 1 ) k ( k + 1 ) ( k + 2 ) 1 ( m k )
from Equations (1) and (2). Then, we have
P ( k ) = m = 1 M m a x w m 2 m ( m + 1 ) k ( k + 1 ) ( k + 2 ) 1 ( m k ) = m = 1 M m a x w m P ( k | m )
by 𝟙 ( k m ) = 𝟙 ( m k ) . □
Theorem 1 suggests that the degree distribution might be expressed as a mixture distribution. For example, we consider a network where a new node connects to one or two existing nodes, with probabilities P ( M = 1 ) = 2 / 3 and P ( M = 2 ) = 1 / 3 . Then we have the following degree distribution in a mixture form of the two BA model’s distributions with different m,
P ( k ) = 2 3 P ( k | m = 1 ) + 1 3 P ( k | m = 2 ) .
Inspired by this property, we consider a mixture model as an explanation of the deviation from the power-law in the range of small degrees.
Moreover, many studies have considered the degree as a continuous variable using continuity assumptions [30,31]. This approach may mislead researchers since the degree is discrete-valued. Therefore, a discrete power-law distribution, called a truncated zeta distribution, is used in this paper.
In this study, we propose a mixture model of truncated zeta distributions for the analysis of degree distributions. The proposed model covers the entire range of the degree distribution through a mixture of truncated zeta distributions while maintaining the scale-free nature of a network. We can characterize the degree distribution more accurately through the discrete version of the power-law distribution. In addition, we present the maximum likelihood estimation algorithm along with a model selection method. A simulation study examines the validity of the proposed estimation procedure. In addition, real collaboration networks are investigated with the proposed model to describe the characteristics of the degree distribution.
We focus on analyzing actual scientific collaboration networks and have made significant advancements compared to the previous work in Jung and Phoa (2020) [32]. The major improvements are as follows:
  • We detected some inconsistency in the scientific collaboration data. For example, “Smith, James,” “Smith, John,” and “Smith, Jacob” are all stated as “Smith, J” before 2007. Hence, we change the period of data to avoid the author name inconsistency for accurate inferences.
  • Moreover, a more elaborate analysis of the real data is conducted with noteworthy interpretations.
  • The validity of the presented algorithm is addressed with Monte Carlo simulations.
  • Extensive comparison studies are performed to show the superiority of the proposed model. We compared the proposed model with generalized Pareto models as well as base models that lack discreteness or mixture nature.
  • We provide more detailed explanations throughout the paper.
The rest of the paper is organized as follows. The continuous and discrete power-law distributions are defined in Section 2, and the proposed mixture model of truncated zeta distributions is defined in Section 3. Section 4 presents the estimation method of the mixture model, and the validity of the estimation procedure is demonstrated in Section 5. In Section 6, we analyze the scientific collaboration network by applying the proposed mixture model with interpretations. Section 7 concludes the paper.

2. The Power-Law Distribution

2.1. Continuous Power-Law Distribution

The probability density function (pdf) of the continuous power-law distribution parameterized by the power-law exponent α > 1 and the minimum value l > 0 , denoted by P L ( α , l ) , is given by [7]
f ( x | α , l ) = ( α 1 ) l α 1 x α , x l .
Note that the support is real values greater than or equal to l.
The complementary cumulative distribution function is useful to describe the tail of the power-law distribution, expressed as
P ( X x ) = x l α + 1 , x l .
Its moments are given by
E [ X m ] = α 1 α 1 m l m ,
provided α > m + 1 . We can deduce that P L ( α , l ) has the mean and variance given by
E [ X ] = α 1 α 2 l , V a r [ X ] = α 1 ( α 3 ) ( α 2 ) 2 l 2 ,
where the mean and variance are well-defined for α > 2 and α > 3 , respectively.
The continuous power-law distribution has been widely used in the analysis of the degree distribution. To analyze the discrete-valued variable degree, we need an approximation method. Setting a constant c, 0 c 1 , for the correction of continuity, the degree distribution can be approximated by
f ( k | α , l ) k c k + 1 c f ( x | α , l ) d x
for an integer value k.
One of the most common approaches is to round values to the nearest integer, which corresponds to c = 0.5 [7,33]. This rounding approach is acceptable when considering the tail part of the power-law distribution. However, if k is small, the constant c that satisfies the exact equation of (5) may be considerably less than 0.5 , and it should be avoided. According to Clauset et al. (2009) in [7], the rounding approach is reasonable for k > 6 .
Since the node degree is usually small, the approximation may lead to misunderstanding when performing statistical analysis on the degree distribution, such as generating node degrees and fitting the distribution to real data. We consider the exact version of the discrete power-law in the next subsection.

2.2. Truncated Zeta Distribution

The truncated zeta distribution, denoted by T Z ( α , l ) , is a discrete form of the power-law distribution. Parameters are the same as the continuous power-law distribution in which α > 1 is the power-law exponent and l > 0 is the minimum value. The probability mass function (pmf) of T Z ( α , l ) is given by [7]
g ( k | α , l ) = 1 ζ ( α , l ) k α , k = l , l + 1 , ,
where ζ ( · , · ) is the Hurwitz zeta function
ζ ( α , l ) = k = l k α ,
which can be regarded as the normalizing constant of the distribution. The Hurwitz zeta function in Equation (6) corresponds to the continuous counterpart 1 / ( α 1 ) l α 1 in Equation (4) via the upper Riemann sum approximation, expressed as
ζ ( α , l ) l x α d x = 1 / ( α 1 ) l α 1 .
The complementary cumulative distribution function is given by
P ( K k ) = ζ ( α , k ) ζ ( α , l ) , k l .
Next, the moments of the truncated zeta distribution are expressed as
E [ K m ] = ζ ( α m , l ) ζ ( α , l ) , α > m + 1 .
We can straightforwardly derive the mean and variance as
E [ X ] = ζ ( α 1 , l ) ζ ( α , l ) , V a r [ X ] = ζ ( α , l ) ζ ( α 2 , l ) ζ 2 ( α 1 , l ) ζ 2 ( α , l ) ,
provided α > 2 for the mean and α > 3 for the variance.
Figure 1 shows some pmfs of T Z ( α , l ) when α = 2.50 . We can check that the straight line of a log-log plot can be strong evidence of a power-law behavior.

3. Truncated Zeta Mixture Model

We consider the finite mixture model of truncated zeta distributions by fixing the power-law exponent α for mixture components while varying minimum values to produce a mixture of truncated zeta distributions. The probability mass function is represented as
p ( k | α , L , w ) = l = 1 L w l g ( k | α , l ) , k = 1 , 2 , ,
where g ( k | α , l ) is the pmf of T Z ( α , l ) , and L is the number of mixture components. Mixture weights w = ( w 1 , w 2 , , w L )
w l 0 , l = 1 , 2 , , L 1 , w L > 0 , a n d l = 1 L w l = 1 .
In this paper, we assume that the minimum value l is equal to 1, but it can be modified according to the data. The tail of most real networks follows the power-law distribution, and Equation (7) has the exact power-law behavior for sufficiently large degrees.
Theorem 2.
For k larger than or equal to L, the truncated zeta mixture distribution in Equation (7) has the exact power-law relationship, given by
p ( k | α , L , w ) k α .
Proof. 
By using the pmf of T Z ( α , l ) , we can write
p ( k | α , L , w ) = l = 1 L w l ζ ( α , l ) k α .
Since the term inside the bracket is independent with k, the pmf of the mixture is proportional to k α . □
Mixture models may suffer from the non-identifiability issue even for finite mixtures. The following theorem proves that the proposed truncated zeta mixture model is identifiable.
Theorem 3.
The mixture distribution Equation (7) is identifiable with respect to α, L, and w.
Proof. 
Let p ( k | α 1 , L 1 , w 1 ) and p ( k | α 2 , L 2 , w 2 ) be the mixture distributions of α 1 , L 1 , w 1 = ( w 11 , w 12 , , w 1 L ) and α 2 , L 2 , w 2 = ( w 21 , w 22 , , w 2 L ) , respectively. Suppose that p ( k | α 1 , L 1 , w 1 ) and p ( k | α 2 , L 2 , w 2 ) are identical, i.e., p ( k | α 1 , L 1 , w 1 ) = p ( k | α 2 , L 2 , w 2 ) for all k = 1 , 2 , . Further, we define the slope function s ( k | α , L , w ) of the log-log degree distribution, given by
s ( k | α , L , w ) = ln p ( k + 1 | α , L , w ) ln p ( k | α , L , w ) ln ( k + 1 ) ln ( k ) , k = 1 , 2 , .
The equality of the two mixture distributions give the identical slope function, s ( k | α 1 , L 1 , w 1 ) = s ( k | α 2 , L 2 , w 2 ) for all k = 1 , 2 , . We then have α = s ( k | α , L , w ) for the sufficiently large k ( max { L 1 , L 2 } ) . Therefore, we obtain α 1 = α 2 . The number of mixture components L is the largest integer k such that s ( k 1 | α , L , w ) α , and we also have L 1 = L 2 . Let L ( = L 1 = L 2 ) and α ( = α 1 = α 2 ) be the common number of mixture components and the power-law exponent.
Using p ( k | α 1 , L 1 , w 1 ) = p ( k | α 2 , L 2 , w 2 ) for k = 1 , 2 , , L , we have the following L equations:
w 11 g ( 1 | α , 1 ) = w 21 g ( 1 | α , 1 ) , w 11 g ( 2 | α , 1 ) + w 12 g ( 2 | α , 2 ) = w 21 g ( 2 | α , 1 ) + w 22 g ( 2 | α , 2 ) , w 11 g ( L | α , 1 ) + + w 1 L g ( L | α , L ) = w 21 g ( L | α , 1 ) + + w 2 L g ( L | α , L ) .
By solving these equations, we get w 1 l = w 2 l for l = 1 , 2 , , L . Thus, the mixture of truncated zeta distributions is identifiable. □
In Figure 2, we depict some log-log plots of mixture distributions. We can see that this model can handle empirical degree distributions that do not follow the power-law distribution at small degrees.

4. Estimation Algorithm

We use the Expectation-Maximization (EM) algorithm to estimate the exponent parameter α and mixture weights w for a given number of mixture components L. Let k = ( k 1 , k 2 , , k N ) be the observed data, and z n be the membership of k n , where the membership z n is assigned as z n = l if k n is from the lth mixture component T Z ( α , l ) . We consider the membership variable z = ( z 1 , z 2 , , z N ) as missing. Let θ = ( α , w ) = ( α , w 1 , , w L ) be the parameters of the mixture model.
The complete-data likelihood function is given by
L ( θ | k , z ) = n = 1 N p ( z n | w ) g ( k n | α , z n ) = n = 1 N w z n 1 ζ ( α , z n ) k n α , k n = z n , z n + 1 , .
We proceed by taking the logarithms of both sides to have the complete-data log-likelihood function:
ln L ( θ | k , z ) = n = 1 N ln w z n ln ζ ( α , z n ) α ln k n .
We define Q ( θ | θ ) as the expected value of the log-likelihood given the observed data k and the current parameter estimate θ = ( α , w ) , which can be expressed as
Q ( θ | θ ) = E ln L ( θ | k , Z ) | k , θ = n = 1 N l = 1 L γ ( l , n , θ ) ln w l ln ζ ( α , l ) α ln k n ,
where γ ( l , n , θ ) is the membership responsibility of the nth observation k n corresponding to the lth mixture component T Z ( α , l ) . They are defined by the posterior probabilities of mixture component memberships for each observation,
γ ( l , n , θ ) = P ( z n = l | k n , θ ) = w l g ( k n | α , l ) l = 1 L w l g ( k n | α , l ) , l = 1 , 2 , , L , n = 1 , 2 , , N .
The E-step computes membership responsibilities.
In the M-step, a new parameter estimate θ ˇ = argmax θ Q ( θ | θ ) is computed using the quantity previously computed in the E-step. To find w ˇ , we need to solve the following optimization problem:
M a x i m i z e n = 1 N l = 1 L γ ( l , n , θ ) ln w l , S u b j e c t t o l = 1 L w l = 1 , a n d w l 0 , l = 1 , 2 , , L .
The Lagrange multiplier method yields
w ˇ l = 1 N n = 1 N γ ( l , n , θ ) , l = 1 , 2 , , L .
Next, α ˇ can be found in the partial derivative of Q with respect to α , given by
Q ( θ | θ ) α = n = 1 N l = 1 L γ ( l , n , θ ) ζ α ( α , l ) ζ ( α , l ) + ln k n .
Here, ζ α ( α , l ) is the partial derivative of the Hurwitz zeta function with respect to α , given by
ζ α ( α , l ) = ζ ( α , l ) α = k = l ( ln k ) k α .
Then, the equation
Q ( θ | θ ) α = 0
gives the desired α ˇ . Unfortunately, a closed-form solution does not exist in general. In this paper, we employ Brent’s method [34] to solve Equation (10) with respect to α .
The two steps are necessarily repeated until the convergence obtains the final parameter estimate θ ^ . The process is summarized in Algorithm 1.
In order to select the number of mixture components L, we employ the Bayesian information criterion (BIC) considering the trade-off between the goodness-of-fit and the complexity of the model. BIC is given by the following formula:
B I C = L ln N 2 n = 1 N ln p ( k n | α , L , w ) ,
where α and w are the obtained estimated parameters given the number of mixture components L. We choose L giving the smallest BIC, where the candidates of L are the integers from 1 to the minimum of the two values: 100 and the nearest integer to 0.90 × ( m a x i m u m d e g r e e ) .
Algorithm 1: EM Algorithm
Entropy 23 00502 i001

5. Monte Carlo Simulation

We study the validity of the presented estimation methods using the synthetic data. We consider three values of true power-law exponents α = 2.50 , 3.00 , 3.50 and also three values of the number of mixture components L = 2 , 4 , 6 . For each pair of α and L, we generate 1000 samples from the finite mixture of truncated zeta distributions in Equation (7). Each process is repeated 30 times. Mixture weights w l , l = 1 , 2 , , L are set to 1 / L for each dataset.
We assume that the true number of components is given. Algorithm 1 is applied to each dataset to obtain the parameter estimate θ ^ = ( α ^ , w ^ ) . Table 1 presents the parameter estimation result. We find that the average values of estimated parameters are very close to the true parameters, implying that the EM algorithm works well in estimating the exponent parameter α and weights w.
Next, we assume that the number L of mixture components is not given. We will check whether the presented algorithm can find the appropriate number of mixture components. Note that BIC is used as a model selection criterion for the generated datasets. Table 2 shows the result of determining L. The result indicates that the maximum likelihood method yields a reasonable estimation result in finding L.

6. Application: Collaboration Networks

6.1. The Data

We study the scientific collaboration network obtained from the Web of Science, where a large-scale database collects the information of all published scientific articles in the world. Among all 275 disciplines, we randomly choose five, which are Biotechnology and Applied Microbiology, Computer Science, Environmental Science, Materials Science, and Physical Chemistry, for demonstration. We reorganize the Web of Science database into a network data structure such that the nodes of networks are authors, and two authors are connected by an undirected link if there is at least one paper co-authored by them. We employ the data from 2007 to 2016 as the inconsistency of author’s names is observed before the year 2007. The data from 2007 to 2009 are used to accumulate data so that the dependence of node degrees can be ignorable. We analyze the degree distribution of the data from 2010 to 2016. Table 3 shows the summary statistics in the year 2016.

6.2. Application of the Truncated Zeta Mixture Model

We apply the presented estimation method to degree distributions from 2010 to 2016.
In Figure 3, we plot degree distributions and the estimated fitting lines of the proposed model in the years 2001 and 2016. The plots suggest that the truncated zeta mixture shows a good performance in fitting degree distributions. The beginning points (see the vertical lines in Figure 3) of power-law behaviors are reasonably estimated via L ^ . This result shows the usefulness of the proposed model for the degree distribution that deviates from the power-law distribution on a small degree part.
Figure 4 shows the temporal changes of L ^ and α ^ . The estimated number L of mixture components exhibits a rising tendency, and α ^ shows an opposite trend. They constantly move towards large L and small α regions for all fields, indicating that the stabilization of the degree distribution has not yet been achieved. Many network scientists [1,13,14] have concentrated on the stationary or the converged degree distribution. The result implies that, however, non-stationary network models are of great importance, such as acceleration models [15]. Moreover, many works in real data analysis have focused on the power-law exponent of a snapshot of a network. However, the temporal variation of α ^ tells us that the significance of temporal models can account for the temporal change of the power-law exponent.
It should be noted that L ^ tends to approach different values across fields. According to Equation (3) and the relevant interpretation in Section 1, the number of mixture components L and mixture weights w are closely related to the distribution of the number of links in the system. Therefore, L ^ could be a measure to the distribution of the number of links. On the other hand, α ^ seems to converge a value near 2.80 for all fields, suggesting that α could be a network-specific quantity instead of a field-specific quantity.
We now focus on the field of Computer Science. Table 4 presents the result of applying the model to the field of Computer Science in more detail. We can observe an interesting pattern in the estimated mixture weights, where w ^ 1 and w ^ 2 show decreasing trends whilst w ^ 3 , w ^ 4 , and w ^ 5 show the opposite. According to Jordan’s model [29] and Equation (3), mixture weights have much to do with the number of newly made links. The decreasing trend of w 1 and w 2 and the increasing trend of w 3 , w 4 , ⋯ indicates the increasing number of links over time. As shown in Table 4, the number of created links tends to increase over time. Since there are many new links compared to new nodes (authors), the average degree gradually increased from 4.70 in 2010 to 6.68 in 2016. The increase in the average degree also explains that the state of this network is still evolving. With the rapid advancement in technology and science, many publications have been produced by researchers. In particular, Computer Science is making greater progress due to recently emerging areas: artificial intelligence and big data. The proposed model tries to explain the increasing mean degree in two ways: decreasing power-law exponents α and an appropriate change in the mixture weight, suggesting that the proposed model is helpful to describe the change of the network pattern.

6.3. Comparison to Other Models

Our model is developed to deal with two essential characteristics of the degree distribution: the non-power-law behavior at small degrees and the discreteness. We study the superiority of the proposed model to base models that lack one of these characteristics.
We first consider the standard discrete power-law distribution, T Z ( α , 1 ) . The estimates of the power-law exponent α are obtained by fitting the data into T Z ( α , 1 ) , and the result is presented in Table 5. The α ^ estimates of the standard discrete power-law distribution are smaller than those of the mixture distribution. As we can observe in Figure 5, the small-degree data that deviates from the power-law makes the exponent small. The estimated fitting lines indicate that the standard discrete power-law distribution is inappropriate to describe the degree distribution.
Next, the degree distribution is applied to the mixture of the continuous power-law distribution using the rounding method with the constant c = 0.5 of the continuity correction. The continuous version of the mixture of truncated zeta distributions can be constructed by the substitution of T Z ( α , l ) to P L ( α , l c ) . In addition, we imitate the method in Section 4 for the continuous mixture to compare the performance of the continuous mixture with the discrete mixture. The estimation procedure is identical to the case of the discrete mixture model. Table 5 presents the estimation result of α and mixture weights w. The estimates L ^ and α ^ are much smaller in continuous mixture models due to the non-exactness of the continuous model. In addition, estimated mixture weights w ^ are considerably different from the discrete mixture. According to Equation (3), mixture weights are involved in the distribution of the number of links. Therefore, we should use the discrete version of the power-law to determine mixture weights correctly. Figure 5 shows that the fitting lines of the continuous mixture seem to deviate from empirical distributions.
As we can see in Table 4 and Table 5, the smallest BIC values are achieved in the proposed model as well. Thus, we can conclude that the proposed model outperforms the continuous model as well as the standard discrete power-law model.
Next, we compare the proposed model with generalized Pareto distributions in which all degree ranges are covered. We use both continuous and discrete types of generalized Pareto distributions. The complementary cumulative density function of the continuous generalized Pareto distribution G P D ( μ = 0.5 , σ , ξ ) is given by
P ( X x ) = 1 + ξ ( x μ ) σ 1 / ξ , x μ .
There are few studies concerning the discrete version of the generalized Pareto distribution. We here use the model in Prieto et al. (2013) [35], expressed as D G P ( α , λ , μ = 1 ) . This distribution has advantages over the continuous distribution since the node degree is discrete. Table 6 shows the results of discrete and continuous generalized Pareto distributions applied to the field of Computer Science. In Figure 6, we plot fitting results in the years 2010 and 2016.
Interestingly, Figure 6 suggests that the continuous version has better fitting results. The discrete version has difficulty in explaining a range of large degrees. We can see that our model performs better than the two generalized Pareto distributions, as well indicated by the BIC values (see Table 4 and Table 6).

7. Concluding Remark

Inspired by Jordan’s model [29], a novel mixture model for the degree distribution is proposed to describe the entire range of degrees while maintaining the power-law or the scale-free property of a network. The truncated zeta distribution enables us to analyze discrete distributions for accuracy purposes. The parameter estimation procedure is presented along with the model selection criterion for determining the number of mixture components. A simulation study shows the validity of the suggested estimation procedure. The practical performance of the model is studied through the comparison analysis with the other techniques.
We perform the real data analysis on five disciplines of the scientific collaboration data obtained from the Web of Science. We observe the increasing tendency in the number of mixture components and the decreasing tendency in the power-law exponent. In addition, mixture weights change over time. It can be suggested from these results that the analyzed networks are still in an evolving state, highlighting the practical importance of non-stationary temporal network models. The non-convergence of the degree distribution might be due to the short-term analysis performed. Determining whether the collaboration network will stabilize the equilibrium remains as future work.
We can observe power-law distributions not only in the degree distribution but also in sandpile avalanches, species extinctions, city sizes, and so on. The proposed model could be useful when (i) the distribution does not follow the power-law only in small values while the power-law is suitable for large values, (ii) the background knowledge does not support the manual modification of the power-law relationship, or (iii) a mixture distribution can be regarded as reasonable for describing data.

Author Contributions

H.J.: methodology, experiment, proof of theorems, writing—original draft preparation. F.K.H.P.: conceptualization, methodology, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by the Academia Sinica grant number AS-TP-109-M07 and the Ministry of Science and Technology (Taiwan) grant numbers 107-2118-M-001-011-MY3 and 109-2321-B-001-013.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data was obtained from a neo4j database refined from the Web of Science database. They are available from Dr. Frederick Kin Hing Phoa with the permission of the URA team of ISM (Japan) and Clarivate Analytics.

Acknowledgments

The authors would like to thank Clarivate Analytics to provide access to the raw data of the Web of Science database for research investigations, the URA team of ISM for transforming the data into the neo4j database and providing the neo4j database for analysis in this work, and Ula Tzu-Ning Kung to provide English editing service in this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Barabási, A.L.; Albert, R. Emergence of scaling in random networks. Science 1999, 286, 509–512. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Jung, H.; Lee, J.G.; Lee, N.; Kim, S.H. Comparison of fitness and popularity: Fitness-popularity dynamic network model. J. Stat. Mech. Theory Exp. 2018, 2018, 123403. [Google Scholar] [CrossRef] [Green Version]
  3. Mazzarisi, P.; Barucca, P.; Lillo, F.; Tantari, D. A dynamic network model with persistent links and node-specific latent variables, with an application to the interbank market. Eur. J. Oper. Res. 2020, 281, 50–65. [Google Scholar] [CrossRef] [Green Version]
  4. Xu, K.S.; Hero, A.O. Dynamic stochastic blockmodels for time-evolving social networks. IEEE J. Sel. Top. Signal Process. 2014, 8, 552–562. [Google Scholar] [CrossRef] [Green Version]
  5. Peixoto, T.P.; Rosvall, M. Modelling sequences and temporal networks with dynamic community structures. Nat. Commun. 2017, 8, 582. [Google Scholar] [CrossRef] [Green Version]
  6. Bianconi, G.; Barabási, A.L. Competition and multiscaling in evolving networks. Europhys. Lett. 2001, 54, 436. [Google Scholar] [CrossRef] [Green Version]
  7. Clauset, A.; Shalizi, C.R.; Newman, M.E. Power-law distributions in empirical data. SIAM Rev. 2009, 51, 661–703. [Google Scholar] [CrossRef] [Green Version]
  8. Newman, M.E. The structure and function of complex networks. SIAM Rev. 2003, 45, 167–256. [Google Scholar] [CrossRef] [Green Version]
  9. Newman, M.E. Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 2005, 46, 323–351. [Google Scholar] [CrossRef] [Green Version]
  10. Albert, R.; Jeong, H.; Barabási, A.L. Diameter of the world-wide web. Nature 1999, 401, 130–131. [Google Scholar] [CrossRef] [Green Version]
  11. Jung, H.; Lee, J.G.; Lee, N.; Kim, S.H. PTEM: A popularity-based topical expertise model for community question answering. Ann. Appl. Stat. 2020, 14, 1304–1325. [Google Scholar] [CrossRef]
  12. Bollobás, B.E.; Riordan, O.; Spencer, J.; Tusnády, G. The degree sequence of a scale-free random graph process. Random Struct. Algorithms 2001, 18, 279–290. [Google Scholar] [CrossRef] [Green Version]
  13. Dorogovtsev, S.N.; Mendes, J.F.F.; Samukhin, A.N. Structure of growing networks with preferential linking. Phys. Rev. Lett. 2000, 85, 4633. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Krapivsky, P.L.; Redner, S. Organization of growing random networks. Phys. Rev. E 2001, 63, 066123. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Dorogovtsev, S.N.; Mendes, J.F.F. Effect of the accelerating growth of communications networks on their structure. Phys. Rev. E 2001, 63, 025101. [Google Scholar] [CrossRef] [Green Version]
  16. Fotouhi, B.; Rabbat, M.G. Degree correlation in scale-free graphs. Eur. Phys. J. B 2013, 86, 510. [Google Scholar] [CrossRef] [Green Version]
  17. Yu, X.; Li, Z.; Zhang, D.; Liang, F.; Wang, X.Y.; Wu, X. The topology of an accelerated growth network. J. Phys. A Math. Gen. 2006, 39, 14343. [Google Scholar] [CrossRef]
  18. Jung, S.; Kim, S.; Kahng, B. Geometric fractal growth model for scale-free networks. Phys. Rev. E 2002, 65, 056101. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  19. Dorogovtsev, S.N.; Mendes, J.F. Accelerated growth of networks. arXiv 2002, arXiv:cond-mat/0204102. [Google Scholar]
  20. Pham, T.; Sheridan, P.; Shimodaira, H. Joint estimation of preferential attachment and node fitness in growing complex networks. Sci. Rep. 2016, 6, 32558. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  21. Jung, H.; Lee, J.G.; Kim, S.H. On the analysis of fitness change: Fitness-popularity dynamic network model with varying fitness. J. Stat. Mech. Theory Exp. 2020, 2020, 043407. [Google Scholar] [CrossRef]
  22. Kumar, R.; Raghavan, P.; Rajagopalan, S.; Sivakumar, D.; Tomkins, A.; Upfal, E. Stochastic models for the web graph. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science, Redondo Beach, CA, USA, 12–14 November 2000; pp. 57–65. [Google Scholar]
  23. Prieto, F.; Sarabia, J.M. A generalization of the power law distribution with nonlinear exponent. Commun. Nonlinear Sci. Numer. Simul. 2017, 42, 215–228. [Google Scholar] [CrossRef] [Green Version]
  24. Arnold, B.C. Pareto Distributions; Chapman and Hall/CRC: New York, NY USA, 2015. [Google Scholar]
  25. Meibom, A.; Balslev, I. Composite power laws in shock fragmentation. Phys. Rev. Lett. 1996, 76, 2492. [Google Scholar] [CrossRef] [PubMed]
  26. Garcıa, F.; Garcıa, R.; Padrino, J.; Mata, C.; Trallero, J.; Joseph, D. Power law and composite power law friction factor correlations for laminar and turbulent gas–liquid flow in horizontal pipelines. Int. J. Multiph. Flow 2003, 29, 1605–1624. [Google Scholar] [CrossRef]
  27. Fenner, T.; Levene, M.; Loizou, G. A model for collaboration networks giving rise to a power-law distribution with an exponential cutoff. Soc. Netw. 2007, 29, 70–80. [Google Scholar] [CrossRef] [Green Version]
  28. Mossa, S.; Barthelemy, M.; Stanley, H.E.; Amaral, L.A.N. Truncation of power law behavior in “scale-free” network models due to information filtering. Phys. Rev. Lett. 2002, 88, 138701. [Google Scholar] [CrossRef] [Green Version]
  29. Jordan, J. The degree sequences and spectra of scale-free random graphs. Random Struct. Algorithms 2006, 29, 226–242. [Google Scholar] [CrossRef] [Green Version]
  30. Barabási, A.L.; Albert, R.; Jeong, H. Mean-field theory for scale-free random networks. Phys. A 1999, 272, 173–187. [Google Scholar] [CrossRef] [Green Version]
  31. Albert, R.; Barabási, A.L. Statistical mechanics of complex networks. Rev. Mod. Phys. 2002, 74, 47. [Google Scholar] [CrossRef] [Green Version]
  32. Jung, H.; Phoa, F.K.H. Analysis of a Finite Mixture of Truncated Zeta Distributions for Degree Distribution. In Studies in Computational Intelligence, Proceedings of the International Conference on Complex Networks and Their Applications, Madrid, Spain, 1–3 December 2020; Springer: Cham, Switzerland, 2020; pp. 497–507. [Google Scholar]
  33. Gillespie, C. Fitting Heavy Tailed Distributions: The poweRlaw Package. J. Stat. Softw. 2015, 64, 1–16. [Google Scholar] [CrossRef]
  34. Brent, R.P. Algorithms for Minimization without Derivatives; Dover: New York, NY USA, 2013. [Google Scholar]
  35. Prieto, F.; Gómez-Déniz, E.; Sarabia, J.M. Modelling road accident blackspots data with the discrete generalized Pareto distribution. Accid. Anal. Prev. 2014, 71, 38–49. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Figure 1. The pmfs of T Z ( α , l ) on a standard scale (left) and a log-log scale (right) when α = 2.50 and l = 1 , 2 , 3 , 4 .
Figure 1. The pmfs of T Z ( α , l ) on a standard scale (left) and a log-log scale (right) when α = 2.50 and l = 1 , 2 , 3 , 4 .
Entropy 23 00502 g001
Figure 2. The pmfs of the mixture of truncated zeta distributions with L = 4 on a log-log scale. Mixture weights are w = ( 0.4 , 0.3 , 0.2 , 0.1 ) (left) and w = ( 0.1 , 0.4 , 0.4 , 0.1 ) (right). Power-law exponents are α = 2.0 (blue), α = 2.5 (black), and α = 3.0 (red).
Figure 2. The pmfs of the mixture of truncated zeta distributions with L = 4 on a log-log scale. Mixture weights are w = ( 0.4 , 0.3 , 0.2 , 0.1 ) (left) and w = ( 0.1 , 0.4 , 0.4 , 0.1 ) (right). Power-law exponents are α = 2.0 (blue), α = 2.5 (black), and α = 3.0 (red).
Entropy 23 00502 g002
Figure 3. Degree distributions and fitted curves for five fields of scientific collaboration in the years 2010 and 2016. The estimated L values are depicted in blue vertical lines.
Figure 3. Degree distributions and fitted curves for five fields of scientific collaboration in the years 2010 and 2016. The estimated L values are depicted in blue vertical lines.
Entropy 23 00502 g003
Figure 4. The change of the estimated number of mixture components L (top) and power-law exponents α (bottom).
Figure 4. The change of the estimated number of mixture components L (top) and power-law exponents α (bottom).
Entropy 23 00502 g004
Figure 5. The degree distributions of the collaboration network data of the Computer Science field in the years 2010 (left) and 2016 (right). The estimated mixture of truncated zeta distributions (red solid), standard discrete power-law distributions (black dash-dot), and the mixture of continuous power-law distributions (blue dashed) are presented. Vertical lines refer to the estimated number L ^ of mixture components.
Figure 5. The degree distributions of the collaboration network data of the Computer Science field in the years 2010 (left) and 2016 (right). The estimated mixture of truncated zeta distributions (red solid), standard discrete power-law distributions (black dash-dot), and the mixture of continuous power-law distributions (blue dashed) are presented. Vertical lines refer to the estimated number L ^ of mixture components.
Entropy 23 00502 g005
Figure 6. The degree distributions of the collaboration network data of the Computer Science field in the years 2010 (left) and 2016 (right). The estimated fitting lines of continuous (red solid) and discrete (blue dashed) generalized Pareto distributions are presented.
Figure 6. The degree distributions of the collaboration network data of the Computer Science field in the years 2010 (left) and 2016 (right). The estimated fitting lines of continuous (red solid) and discrete (blue dashed) generalized Pareto distributions are presented.
Entropy 23 00502 g006
Table 1. Estimates of θ of the EM algorithm applied to the synthetic data with the true number of mixture components. We present the average (standard deviation in parentheses) values of estimated parameters θ ^ = ( α ^ , w ^ ) of 30 datasets for each case.
Table 1. Estimates of θ of the EM algorithm applied to the synthetic data with the true number of mixture components. We present the average (standard deviation in parentheses) values of estimated parameters θ ^ = ( α ^ , w ^ ) of 30 datasets for each case.
α L w l α ^ w ^
2.5020.50 2 . 51 ( 0.05 ) [ 0 . 49 ( 0.02 ) , 0 . 51 ( 0.02 ) ]
40.25 2 . 51 ( 0.06 ) [ 0 . 25 ( 0.01 ) , 0 . 25 ( 0.02 ) , 0 . 25 ( 0.03 ) , 0 . 25 ( 0.02 ) ]
60.17 2 . 50 ( 0.07 ) [ 0 . 17 ( 0.02 ) , 0 . 16 ( 0.02 ) , 0 . 17 ( 0.03 ) , 0 . 18 ( 0.04 ) , 0 . 16 ( 0.04 ) , 0 . 16 ( 0.04 ) ]
3.0020.50 3 . 01 ( 0.09 ) [ 0 . 49 ( 0.02 ) , 0 . 51 ( 0.02 ) ]
40.25 3 . 03 ( 0.10 ) [ 0 . 25 ( 0.01 ) , 0 . 24 ( 0.02 ) , 0 . 25 ( 0.03 ) , 0 . 26 ( 0.03 ) ]
60.17 3 . 03 ( 0.11 ) [ 0 . 17 ( 0.01 ) , 0 . 17 ( 0.02 ) , 0 . 17 ( 0.03 ) , 0 . 17 ( 0.03 ) , 0 . 16 ( 0.03 ) , 0 . 17 ( 0.03 ) ]
3.5020.50 3 . 50 ( 0.14 ) [ 0 . 50 ( 0.02 ) , 0 . 50 ( 0.02 ) ]
40.25 3 . 51 ( 0.13 ) [ 0 . 25 ( 0.02 ) , 0 . 25 ( 0.02 ) , 0 . 26 ( 0.02 ) , 0 . 25 ( 0.02 ) ]
60.17 3 . 50 ( 0.13 ) [ 0 . 17 ( 0.01 ) , 0 . 17 ( 0.01 ) , 0 . 17 ( 0.02 ) , 0 . 17 ( 0.02 ) , 0 . 15 ( 0.03 ) , 0 . 17 ( 0.03 ) ]
Table 2. The estimation result of L applied to the maximum likelihood method. L ^ is selected with the smallest BIC for each dataset. The average, standard deviation, and count of L ^ over 30 datasets are presented.
Table 2. The estimation result of L applied to the maximum likelihood method. L ^ is selected with the smallest BIC for each dataset. The average, standard deviation, and count of L ^ over 30 datasets are presented.
α LAverage L ^ St.Dev. L ^ Count L ^
2.5022.030.18{2: 29, 3: 1}
44.000.00{4: 30}
65.870.34{6: 26, 5: 4}
3.0022.000.00{2: 30}
44.000.00{4: 30}
66.000.00{6: 30}
3.5022.000.00{2: 30}
44.000.00{4: 30}
65.970.18{6: 29, 5: 1}
Table 3. The summary statistics of collaboration networks in the year 2016. For each field, we present the number of authors (nodes) and links between them. The mean, median, standard deviation, and maximum of degrees are also presented.
Table 3. The summary statistics of collaboration networks in the year 2016. For each field, we present the number of authors (nodes) and links between them. The mean, median, standard deviation, and maximum of degrees are also presented.
FieldNumber ofDegree Distribution
NodesLinksMeanMedianSt.Dev.Max.
Biotechnology & A. M.729,4783,977,91910.917.0019.522250
Computer Science528,2671,765,2836.684.0015.19989
Environmental Science680,9243,291,7809.675.0017.421149
Materials Science1,154,9086,861,18911.886.0028.473801
Physical Chemistry727,2134,150,18311.416.0023.422260
Table 4. The results of L ^ , α ^ , w ^ , and BIC using the proposed model for the discipline of Computer Science. We also present the number of new links and nodes as well as the mean degree over time.
Table 4. The results of L ^ , α ^ , w ^ , and BIC using the proposed model for the discipline of Computer Science. We also present the number of new links and nodes as well as the mean degree over time.
YearDiscrete MixtureNumber of NewMean Deg.
L ^ α ^ w ^ 1 w ^ 2 w ^ 3 w ^ 4 w ^ 5 BICNodesLinks
2010113.070.180.320.250.120.061,010,549--4.70
2011133.060.170.300.250.120.061,241,55543,237132,5604.94
2012143.060.160.300.250.130.071,486,63445,004144,3055.16
2013132.840.160.300.260.130.071,813,10156,392256,6465.78
2014162.860.150.290.260.140.072,137,66656,091215,2366.03
2015162.740.150.290.260.140.072,481,78057,317277,9836.48
2016162.720.140.280.260.140.072,810,56355,800234,7146.68
Table 5. The results of L ^ , α ^ , w ^ , and BIC using the continuous mixture model for the discipline of Computer Science. Estimated α and BIC for the standard discrete power-law distribution T Z ( α , 1 ) are also presented.
Table 5. The results of L ^ , α ^ , w ^ , and BIC using the continuous mixture model for the discipline of Computer Science. Estimated α and BIC for the standard discrete power-law distribution T Z ( α , 1 ) are also presented.
YearContinuous MixtureZeta
L ^ α ^ w ^ 1 w ^ 2 w ^ 3 BIC α ^ BIC
201042.300.200.400.301,053,2891.601,170,929
201142.290.190.390.301,290,0151.591,436,579
201252.290.180.380.301,540,8121.571,719,325
201352.260.170.370.301,871,7831.562,090,369
201452.240.160.360.302,201,8341.552,462,841
201552.220.160.350.312,550,1301.542,852,831
201652.210.150.340.302,884,0641.533,229,651
Table 6. The results of discrete and continuous generalized Pareto distributions applied to the field of Computer Science.
Table 6. The results of discrete and continuous generalized Pareto distributions applied to the field of Computer Science.
YearContinuous Generalized ParetoDiscrete Generalized Pareto
σ ^ ξ ^ BIC α ^ ( 10 4 ) λ ^ ( 10 6 ) BIC
20103.540.141,032,3172.519.531,043,227
20113.680.161,267,6275.424.171,282,398
20123.830.161,517,1455.653.811,535,652
20133.930.211,852,7995.693.341,911,678
20144.090.222,183,3411.4012.922,249,684
20154.200.252,535,7745.632.982,633,601
20164.310.262,871,0263.494.652,980,178
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Jung, H.; Phoa, F.K.H. A Mixture Model of Truncated Zeta Distributions with Applications to Scientific Collaboration Networks. Entropy 2021, 23, 502. https://0-doi-org.brum.beds.ac.uk/10.3390/e23050502

AMA Style

Jung H, Phoa FKH. A Mixture Model of Truncated Zeta Distributions with Applications to Scientific Collaboration Networks. Entropy. 2021; 23(5):502. https://0-doi-org.brum.beds.ac.uk/10.3390/e23050502

Chicago/Turabian Style

Jung, Hohyun, and Frederick Kin Hing Phoa. 2021. "A Mixture Model of Truncated Zeta Distributions with Applications to Scientific Collaboration Networks" Entropy 23, no. 5: 502. https://0-doi-org.brum.beds.ac.uk/10.3390/e23050502

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop