Next Article in Journal
Zero-Shot Learning-Based Recognition of Highlight Images of Echoes of Active Sonar
Next Article in Special Issue
Synthetic Aperture Radar Image Despeckling Based on a Deep Learning Network Employing Frequency Domain Decomposition
Previous Article in Journal
Research on Path Planning with the Integration of Adaptive A-Star Algorithm and Improved Dynamic Window Approach
Previous Article in Special Issue
Intelligent Scheduling Based on Reinforcement Learning Approaches: Applying Advanced Q-Learning and State–Action–Reward–State–Action Reinforcement Learning Models for the Optimisation of Job Shop Scheduling Problems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Landmark-Based Domain Adaptation and Selective Pseudo-Labeling for Heterogeneous Defect Prediction

1
National and Local Joint Laboratory of Cyberspace Security Technology, Zhengzhou 450001, China
2
School of Cyber Science and Engineering, Information Engineering University, Zhengzhou 450001, China
*
Author to whom correspondence should be addressed.
Submission received: 5 December 2023 / Revised: 14 January 2024 / Accepted: 18 January 2024 / Published: 22 January 2024
(This article belongs to the Special Issue Application of Machine Learning and Intelligent Systems)

Abstract

:
Cross -project defect prediction (CPDP) is a promising technical means to solve the problem of insufficient training data in software defect prediction. As a special case of CPDP, heterogeneous defect prediction (HDP) has received increasing attention in recent years due to its ability to cope with different metric sets in projects. Existing studies have proven that using mixed-project data is a potential way to improve HDP performance, but there remain several challenges, including the negative impact of noise modules and the insufficient utilization of unlabeled modules. To this end, we propose a landmark-based domain adaptation and selective pseudo-labeling (LDASP) approach for mixed-project HDP. Specifically, we propose a novel landmark-based domain adaptation algorithm considering marginal and conditional distribution alignment and a class-wise locality structure to reduce the heterogeneity between both projects while reweighting modules to alleviate the negative impact brought by noise ones. Moreover, we design a progressive pseudo-label selection strategy exploring the underlying discriminative information of unlabeled target data to further improve the prediction effect. Extensive experiments are conducted based on 530 heterogeneous prediction combinations that are built from 27 projects using four datasets. The experimental results show that (1) our approach improves the F1-score and AUC over the baselines by 9.8–20.2% and 4.8–14.4%, respectively and (2) each component of LDASP (i.e., the landmark weights and selective pseudo-labeling strategy) can promote the HDP performance effectively.

1. Introduction

Software defect prediction technology can identify which modules (e.g., files, classes, and functions) are more likely to be defective in software projects, thereby helping developers/maintainers allocate testing resources reasonably [1,2]. However, this process often suffers from the lack of training data [3]. In order to overcome this, researchers came up with the idea of introducing external and well-labeled projects as the training data, that is, cross-project defect prediction (CPDP) [4,5,6]. Over the past decade, CPDP has brought about widespread attention in the field of software engineering and has made remarkable progress in alleviating the distribution difference between source and target projects’ data. Conventional CPDP always holds the assumption that source and target projects have the same metrics. Therefore, if the assumption is invalid (i.e., both projects have different metrics), existing CPDP methods cannot be directly applied to defect prediction. In response to this special case of CPDP, researchers have proposed heterogeneous defect prediction (HDP) [7,8] and designed corresponding solutions to deal with the different metrics and distributions of both projects.
HDP focuses on defect prediction across those projects in which metrics are partially or completely different and thus is often viewed as a special case of CPDP. Its greatest challenge is how to overcome the obstacle brought by different metrics while reducing the distribution discrepancy between source and target data. The current related work also focuses on solving the subproblems (e.g., class imbalance and linear inseparability) derived from HDP to further improve the prediction effect without considering the labeled data in the target project.
Actually, in the early stage of software development, a small amount of labeled data (i.e., training target data) and a large amount of unlabeled data (i.e., test target data) often coexist in the target project [9]. Moreover, Turhan et al. proved the effectiveness of the combination of “within” and “cross-project” data (i.e., mixed-project data) for defect prediction [6]. Inspired by this, several mixed-project HDP methods [10,11,12] have been proposed to fully utilize labeled target data and show the potential to improve predictive performance. However, research on the HDP with mixed-project data is still in the initial stage, and a series of issues (e.g., negative impact of noise modules and insufficient utilization of unlabeled data) have not been fully considered and resolved. In this paper, we further explore and address issues relevant to HDP with mixed-project data for improving the prediction performance.

1.1. Motivation

1.1.1. Negative Impact of Noise Modules

The original intention of cross-project defect prediction is to improve the predictive effect of the target project by introducing external projects with sufficient historical data, but this is not the case in reality. Recent research [13] has pointed out that treating all project modules equally in the CPDP process may lead to data redundancy. In other words, equally treated modules commonly play different roles in cross-project defect prediction. Some of them (i.e., noise modules) may even inhibit learning and then weaken prediction performance. As a special case of CPDP, a similar issue also exists in HDP. The left part of Figure 1 displays an ideal distribution of the source and target data where modules can be well separated by category. Noise modules make it hard to decide the classification boundary for source and target data, as shown in the right part of Figure 1.
Table 1 provides statistics on prediction combinations under different situations to illustrate the negative impact of noise modules (for detailed experimental settings, refer to Section 4.3). The WPDP method only uses labeled target data to build the classifier for predicting the remaining unlabeled ones. CLSUP is a mixed-project HDP method that trains the predictor with source data and labeled target data. As shown in this table, there are 21.5% and 20.2% prediction combinations in which WPDP performs better than CLSUP regarding the F1-score and AUC, respectively. This means that the source data are not helpful for prediction across projects in this situation (i.e., an invalid cross-project prediction combination). Therefore, we believe that noise modules must exist with negative impacts on predictions in the source project. To the best of our knowledge, current mix-project HDP methods usually consider all the modules of source and target projects equally important, which is not conducive to highlighting the effect of the relevant modules while eliminating the negative impact of the noise versions. Therefore, it is meaningful to tackle the above problem to investigate how to focus on the relevant modules while ignoring the noise versions.

1.1.2. Insufficient Utilization of Unlabeled Modules

In the existing HDP research, the unlabeled modules of the target project are commonly used to match the data distributions of both projects. It can alleviate the heterogeneity between the source and target projects so that the prediction model can be better adapted to the target data. However, this utilization of unlabeled target data does not fully explore the latent discriminant information within it, which makes it difficult to improve the classification ability of prediction models significantly. Guided by a small number of labels, semi-supervised learning can utilize a large number of unlabeled samples to improve learning performance and avoid wasting data resources. It solves the problems of the weak generalization ability of supervised learning methods when there are few labeled samples and the inaccuracy of unsupervised learning methods when there is a lack of labeled samples [14]. As an effective semi-supervised method, pseudo-label learning [15] is beneficial for determining the classification boundary in low-density regions, thereby improving the model’s performance. Considering this, we further investigate how to utilize the pseudo-labels of unlabeled data from the target project to enhance the prediction effect.

1.2. Contribution

In this paper, we propose a landmark-based domain adaptation and selective pseudo-labeling (LDASP) approach for improving HDP with mixed-project data. The detailed contributions are summarized below:
  • The proposal of a novel landmark-based domain adaptation algorithm that considers marginal and conditional distribution alignment and a class-wise locality structure to reduce the heterogeneity between both projects, reweighting modules to alleviate the negative impact brought about by the noise modules.
  • The proposal of a progressive pseudo-label selection strategy exploring the underlying discriminative information of unlabeled target data to further improve the prediction effect.
  • Extensive experiments are conducted on 27 projects from four datasets to demonstrate that our approach provides significantly better performance when compared to state-of-the-art methods, verifying the effectiveness of each component of it.

2. Related Work

In this section, we introduce the developments of heterogeneous defect prediction and domain adaptation that are relevant to this work. We first review the current HDP studies. Then, we summarize the concept of domain adaptation and detail the landmark-based domain adaptation methods.

2.1. Heterogeneous Defect Prediction

Based on the learning process of predictors, we roughly divide the existing HDP methods into two categories, i.e., conventional and mixed-project HDP methods, which correspond to different application scenarios.

2.1.1. Conventional HDP Methods

Conventional HDP methods are suitable for the scenario where the source project is well labeled and the target project is unlabeled. They train the defect predictor with labeled source data and unlabeled target data. To date, most of the existing related work belongs to this category. Jing et al. [8] were among the first to identify the HDP problem and propose an improved canonical correlation analysis (CCA) method for alleviating the heterogeneity between source and target projects. Meanwhile, Nam and Kim [16] proposed a metric selection and matching method for HDP that can remove redundant metrics for the source project and then match up the metrics of both projects (one-to-one) based on metric similarity. Based on these two methods, a series of related works [11,17,18,19] has emerged, and considerable progress has been made.
In order to deal with the class imbalance and linear inseparability issues, Li et al. successively proposed an ensemble multiple kernel correlation alignment (EMKCA) method [17] and a cost-sensitive transfer kernel canonical correlation analysis (CTKCCA) method [18]. Later, they also extended EMKCA to a two-stage ensemble learning framework by combining multiple predictors [20]. Similar to the idea of [16], Yu et al. [21] proposed a novel feature matching and transfer (FMT) method to select one-to-one feature pairs for heterogeneous source and target projects based on the “distance” between distribution curves. Tong et al. [19] designed a kernel spectral embedding transfer ensemble (KSETE) method to improve the prediction effect. This method first alleviates the class imbalance problem by performing the synthetic minority over-sampling technique (SMOTE) on the source project. Then, it combines kernel spectral embedding and transfer learning to find a latent common metric space for source and target data. Eventually, the prediction results are decided by ensemble learning. In addition, multi-view learning [22,23], multi-source transfer learning [24], and deep generation network [25] are introduced to improve the performance of the conventional HDP method.
In summary, the above HDP methods utilize the labeled source data and unlabeled target data in the learning process to reduce the heterogeneity between both projects effectively, showing considerable performance. However, they are not designed for a situation with a small number of labeled target data, and thus, the supervised (label) information within it cannot be utilized reasonably in predictions. In this paper, we design a novel approach that can use labeled target data to improve the discriminative ability of predictors while eliminating the heterogeneity between source and target projects.

2.1.2. Mixed-Project HDP Methods

Mixed-project HDP methods focus on the scenario of the labeled source project and the target project with a small number of labeled modules. They combine heterogeneous source data and a small number of labeled target data (i.e., training target data) to build the defect predictor. Among the current research on HDP, only a few studies provide solutions for using mixed-project data to improve prediction performance. Li et al. [11] first proposed a cost-sensitive label and structure-consistent unilateral projection (CLSUP) method. This method combines domain adaptation and cost-sensitive learning techniques to learn the projection matrix from source to target data while introducing the misclassification cost to alleviate the impact of class imbalance. In order to enhance the quality of training data, Li et al. [10] proposed a multi-source selection-based manifold discriminant alignment method. Its core component is an improved manifold discriminant alignment (MDA) algorithm that learns transformation matrices for source and target data to make their distributions closer and have a favorable classification ability. The recent work [12] by Niu et al. was an extension of MDA that first applies a sampling technique to handle the class imbalance problem and then uses the kernel manifold discriminant alignment algorithm to overcome the linear inseparability issue. Extensive experiments on 13 projects from three public datasets demonstrate its state-of-the-art prediction performance.
Overall, the current mixed-project HDP methods focus primarily on using limited labeled target data reasonably in the learning process while considering common issues, such as class imbalance and linear inseparability. Although showing promising results, they ignore the negative impact of noise modules on prediction across projects, and the underlying discriminative information in unlabeled target data is not fully excavated. In this paper, we propose a novel landmark-based domain adaptation and selective pseudo-labeling approach for mixed-project HDP to address these limitations.

2.2. Domain Adaptation

Domain adaptation has been an important research direction of transfer learning and gained a lot of success in a wide range of tasks, such as computer vision and natural language processing [26]. The core idea of domain adaptation is to learn a feature extractor that makes the data distributions of both domains aligned so that the knowledge learned from the source domain can be applied to the target one. To this end, researchers attempt to introduce different methods (e.g., marginal, conditional, and joint distributions) to measure and optimize the discrepancy between different domains [27,28,29]. Unfortunately, the effectiveness of transfer learning is not always guaranteed due to the unsatisfying basic assumptions, which causes negative transfer, meaning the source knowledge decreases the learning performance in the target domain [30].
In order to deal with the negative transfer issue, several sample reweighting methods [28,31,32] are proposed to bring the source distribution closer to the target one; hence, this may help eliminate the impact of the negative transfer on learning performance in the target domain. These methods select the most relevant samples (i.e., landmarks) to match the data distributions by reweighting the samples from the source or both domains. Aljundi et al. [31] designed a novel unsupervised domain adaptation approach based on the subspace alignment and selection of landmarks similarly distributed between both domains. For the heterogeneous domain adaptation with the semi-supervised setting, Tsai et al. [32] proposed a cross-domain landmark selection method to project the source data into the feature subspace of the target domain. Specifically, this method considers reducing the marginal and conditional distribution discrepancies simultaneously while selecting landmarks from both domains through the learning weights for the samples. Inspired by this, we introduce the idea of landmark-based domain adaptation to alleviate the negative impact of noise modules in this work.
Unlike the landmark-based domain adaptation methods mentioned above, we consider more comprehensive factors, including marginal and conditional distribution alignment, and class-wise locality structure preservation to further improve the effect of domain adaptation. Furthermore, we also design a progressive selection strategy to raise the quality of used pseudo-labels instead of the direct utilization of all pseudo-labels.

3. Approach

In this section, we first introduce the setting of HDP with mixed-project data; the notations used in this paper are provided in Section 3.1. Section 3.2 and Section 3.3 provide detailed descriptions of our improved landmark-based domain adaptation method and the proposed progressive pseudo-label selection strategy, respectively.

3.1. Problem Statement

The only difference from the conventional setting (i.e., labeled source and unlabeled target projects with different metric sets) is the existence of a small number of labeled target modules in the mixed-project data setting. Specifically, the labeled source project can be defined as S = { x s i , y s i } i = 1 n s = { X S , Y S } , in which x s i refers to the ith module of the source project, and y s i is the corresponding label (i.e., defective or non-defective). Similarly, the unlabeled and labeled parts of the target project can be defined as T U = { x u i , y u i } i = 1 n u = { X U , Y U } and T L = { x l i , y l i } i = 1 n l = { X L , Y L } separately. In this task, the source project data (S) and training target data ( T L ) will be combined to build the predictor and further identify the labels ( Y U ) of the test target data ( T U ).

3.2. Landmark-Based Domain Adaptation

In this section, we first present the basic structure of the objective function that is used to match the distributions between the source and target data. Then, we introduce the modified objective function by considering landmark weights to alleviate the negative impact of the noise modules.

3.2.1. Matching the Distributions of Source and Target Data

Generally, there is a significant difference between the data distributions of heterogeneous source and target projects. This will make the predictor trained using source data not be able to adjust to the target data well. To this end, we learn the transformation matrix, A R d s × m , for the source data to project it into the metric subspace of the target data with m dimensions. Specifically, a principal component analysis (PCA) is first used to project target data into the metric subspace of the target data with m dimensions. The maximum mean discrepancy is an effective method to measure the marginal distribution difference between two domains, which does not assume any parametric form for the distributions. Hence, we employ it to calculate the distance between the distributions of both projects and define the term of marginal distribution alignment, E M ( A , S , T L , T U ) , as follows:
E M ( A , S , T L , T U ) = 1 n s i = 1 n s A T x s i 1 n l + n u ( j = 1 n l x ^ l j + j = 1 n u x ^ u j )   2 ,
where x ^ u and x ^ l refer to the unlabeled and labeled target modules processed via PCA.
Ref. [28] proposed a transfer learning algorithm that jointly optimizes marginal and conditional distributions, and the authors verified its effectiveness in reducing the distribution discrepancy through extensive experiments. As the labeled modules in both projects can be used to measure the conditional distribution discrepancy, we also define the term of conditional distribution alignment E C ( A , S , T L ) as shown below:
E C ( A , S , T L ) = c = 1 C 1 n s i = 1 n s A T x s i , c 1 n l j = 1 n l x ^ l j , c   2 + 1 n s c n l c i = 1 n s c j = 1 n l c A T x s i , c x ^ l j , c 2 ,
where the former is designed to match the conditional distributions between the source and target data approximately, and the latter further aggregates the source and target modules with the same category in the target metric subspace.
In order to preserve the structure of the transformed source data, we impose an additional class-wise locality constraint [29] on the projected source data and define the class-wise locality-preserving term E S ( A , S ) as shown below:
E S ( A , S ) = i = 1 n s j = 1 n s w i j A T x s i A T x s j 2 ,
where w i j is defined as follows:
w i j = e x p ( x s i x s j 2 / δ 2 ) i f { x s i , x s j } X S c 0 o t h e r w i s e
In summary, the final objective function can be integrated based on Equations (1)–(3) as follows:
min A E M ( A , S , T L , T U ) + E C ( A , S , T L ) + E S ( A , S ) + λ A 2 ,
where A 2 is the regularization term that controls the complexity of A to avoid over-fitting. λ is the parameter of the regularization term. Here, the transformation matrix A can be optimized by minimizing Equation (5).

3.2.2. Matching the Distributions of Source and Target Data Based on Landmarks

Although the objective function in Equation (5) considers reducing the discrepancy between both projects from different aspects, it still believes that all modules are equally important and thus ignores the negative impact brought about by noise modules. As discussed in Section 1.1.1, this may inhibit the learning process of the transformation matrix and lead to poor performance. In order to solve this problem, we weight the modules of the source and target projects and view the modules with nonzero weights as landmarks. Based on this thought, the objective function in Equation (5) can be redefined as shown below:
min A , α , β E M ( A , S , T L , T U , α , β ) + E C ( A , S , T L , T U , α , β ) + E S ( A , S , α ) + λ A 2 , s . t . α i c , β i c [ 0 , 1 ] , α c T 1 n S c = β c T 1 n T c = θ ,
where α = [ α 1 ; ; α c ; α C ] R n s represents the weights of the source modules. Specifically, α c represents the module weights belonging to the category c in the source project. Because a small number of labeled target modules can provide discriminative information, all of them in T L are considered landmarks, and their weights are fixed to 1, whereas the unlabeled ones in the target project are weighted using β . Similarly, β = [ β 1 ; ; β c ; β C ] R n u represents the weights of the unlabeled modules in the target project, where β c specifically refers to the weights of the target modules that are predicted as category c (pseudo-label). θ [ 0 , 1 ] controls the ratio of the source data and target test data used for distribution adaptation.
Based on the module weights α and β defined above, the extended marginal distribution alignment term E M ( A , S , T L , T U , α , β ) can be computed as follows:
E M ( A , S , T L , T U , α , β ) = 1 θ n s i = 1 n s α i A T x s i 1 n l + θ n u ( j = 1 n l x ^ l j + j = 1 n u β j x ^ u j )   2 .
In order to match the conditional distribution of the source and target data, we use labeled project data (S and T L ) to train the classification model and make predictions of the unlabeled module x ^ u i to generate its corresponding pseudo-label y ˜ u i . With the assistance of the obtained pseudo-labels, the unlabeled module T U can be assigned specific categories. Therefore, the extended conditional distribution alignment term E C ( A , S , T L , T U , α , β ) can be defined as follows:
E C ( A , S , T L , T U , α , β ) = c = 1 C E c o n d c + 1 e c E e m b e d c ,
where e c = θ n s c n l c + θ n l c n u c + θ 2 n u c n s c . E c o n d c and E e m b e d c are separately defined below:
E c o n d c = 1 θ n s c i = 1 n s c α i A T x s i , c 1 n l c + θ n u c ( j = 1 n l c x ^ l j , c + j = 1 n u c β j x ^ u j , c )   2 ,
E e m b e d c = i = 1 n s c j = 1 n l c α i A T x s i , c x ^ l j , c 2 + i = 1 n l c j = 1 n u c x ^ l i , c β j x ^ u j , c 2 + i = 1 n u c j = 1 n s c β i x ^ u i , c α j A T x s j , c 2 .
It can be seen that Equation (8) is essentially extended by imposing the unlabeled data T U and its corresponding pseudo-labels { y ˜ u i } i = 1 n u to Equation (2). With the module weight α in the source project, the class-wise locality-preserving term E S ( A , S , α ) can be defined as follows:
E S ( A , S , α ) = 1 θ 2 n s 2 i = 1 n s j = 1 n s w i j α i A T x s i α j A T x s j 2 ,
where w i j can be referred to in Equation (4).

3.2.3. Solution

Because the objective function in Equation (6) is a non-convex joint optimization problem with respect to A , α , and β , we employed the iterative optimization method described in Ref. [32] to learn the transformation matrix, A , and landmark weights, α and β , alternately.
(1)
Optimizing A
In order to learn the transformation matrix A , we first consider the landmark weights α and β as the constant term. Then, we compute the first-order derivative of Equation (6) with respect to A and let it be equal to zero. Finally, the closed-form solution of A can be obtained as follows:
A = ( λ I d s + X S H S X S 1 ) 1 ( X s ( H L X ^ L T + H U X ^ U T ) ) ,
where I d s is the identity matrix with d s dimensions. X S R d s × n s , X ^ L R m × n l , and X ^ U R m × n u represent the source data, training target data, and test target data, respectively. The element ( H S ) i , j in H S R n s × n s refers to the derivative coefficient associated with x s i T x s j . A similar explanation can be applied to H L R n s × n l and H U R n s × n u .
(2)
Optimizing α and β
Given the transformation matrix A , Equation (6) can be rewritten as:
min α , β 1 2 α T K S , S α + 1 2 β T K U , U β α T K S , U β + k S , L T α + k U , L T β s . t . α i c , β i c [ 0 , 1 ] , α c T 1 n S c = β c T 1 n T c = θ ,
where the element ( K S , S ) i , j in K S , S R n s × n s denotes the correlation coefficient associated with ( A T x s i ) T A T x s j . A similar explanation can be applied to the element ( K U , U ) i , j in K U , U R n u × n u and the element ( K S , U ) i , j in K S , U R n s × n u . The element ( k S , L ) i in k S , L R n s denotes the sum of coefficients of ( A T x s i ) x ^ l 1 , ( A T x s i ) x ^ l 2 , , ( A T x s i ) x ^ l n l . A similar explanation can be applied to the element ( k U , L ) i in k U , L R n u . Based on the above formulations, the solution of Equation (13) is transformed into the following quadratic programming problem that can be resolved by using existing tools (i.e., the quadprog function in MATLAB R2023b).
min z i [ 0 , 1 ] , Z T V = W 1 2 Z T BZ + b T Z ,
where Z = [ α ; β ] , B = [ K S , S , K S , U ; K S , U T , K U , U ] , b = [ k S , L ; k U , L ] ,
W R 1 × 2 C w i t h ( W ) c = θ n S c i f c C θ n S c C i f c > C ,
V = V S 0 n s × C 0 n u × C V U R ( n s + n u ) × 2 C w i t h
( V S ) i j = 1 i f x s i c l a s s j 0 o t h e r w i s e ,
( V U ) i j = 1 i f x u i i s p r e d i c t e d a s c l a s s j 0 o t h e r w i s e .
After obtaining the transformation matrix, A , and landmark weights, α and β , we can utilize the transformed source data and training target data by combining them with landmark weights to train the classification model, which is used to predict the pseudo-labels { y ˜ u i } i = 1 n u of the test target data { x ^ u i } i = 1 n u .

3.3. Progressive Pseudo-Label Selection

In the initial stage of learning, the transformation matrix A , and the landmark weights α and β , cannot be fully optimized, which may degrade the performance of the classification model and further lead to low-quality pseudo-labels. For this reason, we designed a progressive strategy for pseudo-label selection that selects the corresponding pseudo-label subset in each iteration for the optimization of the objective function. First, we define { ( x ^ u i , y ˜ u i , p ( y ˜ u i | x ^ u i ) ) } i = 1 n u , where p ( y ˜ u i | x ^ u i ) denotes the probability of predicting the label of x ^ u i as y ˜ u i . In the kth iteration, we then select the top n u k / T high-probability modules to optimize A , α , and β . T is the total number of iterations. In order to avoid the situation that selective pseudo-labels belong to the same category, we separately select the top n u c k / T high-probability modules for each category, c, where n u c denotes the number of modules predicted as category c in the test target data.
Algorithm 1 describes the detailed process of our approach. Before iteration, we first preprocess the source and target data (lines 3–4). Then, we project the target data to the metric subspace with m dimensions by using principal component analysis (line 5). The projection matrix A and the pseudo-labels of the test target data { y ˜ u i } i = 1 n u are initialized (lines 6–7). For each iteration, we first select the pseudo-labels and their corresponding modules and then use these to update the transformation matrix A and landmark weights α and β sequentially (lines 9–11). Next, given the A , α , and β , the pseudo-labels of the test target modules can be updated (line 12). When the iteration is over, we regard the pseudo-labels of the test target modules as the final prediction results.
Algorithm 1 Pseudo code of LDASP
 1:
Input: source data S = { x s i , y s i } i = 1 n s , training target data T L = { x l i , y l i } i = 1 n l , test target data T U = { x u i } i = 1 n u ; dimension of metric subspace m; θ ; iteration number T.
 2:
Output: predicted labels { y u i } i = 1 n u of test target data { x u i } i = 1 n u .
 3:
Remove the duplicated modules and the modules with missing metric values for S and { T L , T U } ;
 4:
Use z-score normalization to scale the data from S and { T L , T U } ;
 5:
Project target data { x l i } i = 1 n l and { x u i } i = 1 n u into the metric subspace with m dimensions;
 6:
Initialize the transformation matrix A via Equation (5);
 7:
According to Section 3.3, initialize the pseudo-labels of test target data { y ˜ u i } i = 1 n u ;
 8:
for  k = 1 to T do
 9:
    For each category c, select n u c k / T pseudo-labels with the highest probability from { y ˜ u i } i = 1 n u and their corresponding modules;
10:
   Update the transformation matrix A according to Equation (12);
11:
   Update landmark weights α and β according to Equation (14);
12:
   Update the pseudo-labels of test target data { y ˜ u i } i = 1 n u based on the description of Section 3.3;
13:
end for

4. Experiment Setup

In this work, experiments will be conducted based on the following research questions to verify the effectiveness of our approach.
  • RQ1: How effective is LDASP in HDP with mixed-project data?
  • RQ2: What is the effect of each component of LDASP on improving HDP performance?

4.1. Dataset and Evaluation Indicator

For a comprehensive evaluation, we selected 27 projects from four widely used public datasets, including NASA [33], AEEEM [34], PROMISE [35], and JIRA [36]. Table 2 displays the dataset details, where the last column lists what object a module specifically refers to in the corresponding project. It can be seen that the four datasets contain different numbers of metrics, which accords with the HDP setting.
In order to evaluate the overall performance of the competing methods, we employed the frequently used indicators, including the F1-score and AUC. The F1-score [8] is denoted as the harmonic means of recall (probability of detection) and precision. According to the confusion matrix in Table 3, the F1-score can be calculated as 2 × r e c a l l   ×   p r e c i s i o n r e c a l l + p r e c i s i o n , where recall is defined as tp/(tp + fn), and precision is defined as tp/(tp + fp). The AUC is the area under the receiver operating characteristic curve, which is plotted in a two-dimensional space with pf as the x-coordinate and pd as the y-coordinate. The AUC is a well-known indicator for the comparison of different models [5,16,37,38,39,40] because it is unaffected by class imbalance and is independent of the prediction threshold. The higher the AUC is, the better the performance of the prediction model. When the AUC is 0.5 , this means the performance of a random predictor [41]. Moreover, Lessmann et al. [38] and Ghotra et al. [37] suggested using the AUC for better cross-dataset comparability. Hence, we select the AUC as one of the indicators.

4.2. Experimental Design

This section provides the detailed experimental design for each research question.

4.2.1. RQ1: How Effective Is LDASP in HDP with Mixed-Project Data?

For RQ1, we compared LDASP for three types of related methods (as shown below) to evaluate its performance.
WPDP method: This method uses labeled data to build the prediction model that is then employed to determine whether the remaining unlabeled modules are defective or non-defective. Because the training and test data come from the same project, considerable prediction results can be generally obtained when the labeled modules are sufficient.
Unsupervised method: Chen et al. [42] found that the mainstream unsupervised prediction methods perform better than HDP ones for traditional and effort-aware indicators through extensive experiments. Therefore, we also chose the unsupervised methods SC [5] and ManualDown [43], which show considerable performance, as baselines. SC is a spectral clustering-based unsupervised method and achieves a better prediction effect over supervised models. ManualDown is a simple and effective unsupervised method based on the number of code lines, and it performs better than most current CPDP methods through extensive experiments. Note that the threshold of ManualDown was set as 50%, according to the suggestion of the original paper [43].
HDP method: Three mixed-project HDP methods, i.e., CLSUP [11], sMDA [10], and DSKMDA [12], were selected to compare with our approach. CLSUP not only utilizes the labeled data within source and target projects during the metric transformation but also imposes the costs of misclassification for the class imbalance problem. sMDA and DSKMDA are different extension versions of the manifold discriminant alignment algorithm [44]. In addition, we also selected a conventional HDP method, i.e., an ensemble method based on aligned metric representation (EAMR) [23], to verify the effectiveness of our approach.
In order to evaluate the statistical significance of the performance difference between both methods, we employed the nonparametric Wilcoxon signed-rank test [45] at the confidence level of 95%, and we report the results of the Win/Tie/Lose of LDASP vs. each baseline from the previous studies [6,46,47,48]. Moreover, we used the nonparametric effect size to test Cliff’s delta ( δ with values in [−1, 1]) [49] and measure the level of difference between the performance of the competitors. Table 4 presents the mappings of the δ values to the corresponding effectiveness levels.

4.2.2. RQ2: What Is the Effect of Each Component of LDASP on Improving HDP Performance?

In order to investigate RQ2, we constructed a series of variants based on the proposed LDASP, as shown in Table 5. This table lists the compared methods and their corresponding explanations. For example, LDASP n o M a r g i n a l denotes the variant of our approach without the term of the marginal distribution alignment in the objective function, i.e., Equation (6); DASP refers to the variant of our approach without considering the landmark weights for the source and target modules, which is equivalent to the setting of α = 1 and β = 1 . In the experiments related to RQ2, we compared LDASP with four variants for all evaluation indicators. Furthermore, the nonparametric Wilcoxon signed-rank test [45] at a confidence level of 95% and the nonparametric effect size test of Cliff’s delta were conducted between LDASP and each variant.

4.3. Evaluation Protocol

In the experiments, we constructed heterogeneous prediction combinations based on all the projects and used the predictor trained using the source data to predict the target project. For each prediction combination (i.e., source⇒target), we chose one project from the 27 total projects as the target. Another project from the datasets that did not contain the target project was selected as the source. Supposing that the target project is selected from AEEEM, the source project should come from NASA, PROMISE, or JIRA. In this way, 530 prediction combinations can be constructed based on these 27 projects from four datasets. Considering that there exists a small amount of labeled data in the target project, we randomly split 10% of the target modules for the training target data, and the remaining 90% were regarded as the test target data, as in Refs. [10,11]. Each prediction combination was executed 20 times, and we ultimately report the average indicator results of the competing methods on each target project.

4.4. Parameter Setting

For the parameters of LDASP, we set the metric subspace dimension m, the number of iterations T, and θ as m i n ( d s , d t ) / 2 , 5, and 0.5 , respectively. For each competing method, we used the logistic regression (LR) classifier to conduct the predictions. This is mainly because LR has been applied in extensive SDP research and has exhibited better classification ability than other classifiers. The LR classifier is implemented by LIBLINEAR [50] and adopts the parameter setting “−s 0” (i.e., logistical regression) and “−b 1” (i.e., no bias term added) suggested in Ref. [51].

5. Experimental Result

This section reports and analyzes the experimental results for each research question and further summarizes the corresponding conclusions.

5.1. Results for RQ1

Table 6 and Table 7 report the mean results of the competing methods for all the target projects, in which the best result for each project is marked in bold. The “Average” presents the overall mean results across the 27 target projects.
  • Our approach obtains the best results for the F1-score and AUC for more than half (i.e., 15 out of 27 and 14 out of 27) of the target projects. Moreover, it also achieves the best overall performance for each indicator in terms of the “Average” results.
  • When compared with the mixed-project HDP methods (i.e., CLSUP, sMDA, and DSKMDA), our approach separately improves the F1-score and AUC by 9.8–20.2% and 5.3–11.3% in terms of the overall mean results. This is mainly because the learned landmark weights eliminate the negative impact brought about by the noise modules to some extent; meanwhile, selective pseudo-labeling further promotes prediction performance.
  • When compared with the conventional HDP method (i.e., EAMR), our approach makes improvements of 15.8% and 4.8% on the F1-score and AUC, respectively. As a conventional HDP method without using training target data, EAMR even shows better performance than the mixed-project HDP baselines (i.e., CLSUP, sMDA, and DSKMDA). However, an over-reliance on the label information of the source data also prevents its performance for further promotion.
  • When compared with the unsupervised methods (i.e., SC and ManualDown), our approach improves the average F1-score and AUC by at least 10.6% and 11.6%, respectively. The possible reason is that the lack of labeled modules results in unsupervised methods being unable to obtain reliable discriminative information and, thus, inhibit the improvement in prediction performance.
  • When compared with the WPDP method, our approach separately improves the average F1-score and AUC by 18.7% and 14.4%, which means that the effective utilization of labeled source data can increase the prediction effect on the target project.
“Win/Tie/Lose” and “N/S/M/L” in Table 6 and Table 7 exhibit the results of the nonparametric Wilcoxon signed-rank test and the nonparametric effect size test, respectively. From these tables, we can obtain the following observations:
  • In terms of “Win/Tie/Lose”, our approach obtains significant performance improvements for most projects when compared to the competing methods. This is especially true for EAMR, which had the best performance among the baselines; our approach wins the comparisons against the other approaches, with significant advantages in 22 out of 27 and 17 out of 27 projects, respectively, according to the F1-score and AUC.
  • In terms of “N/S/M/L” ( δ > 0 ), our approach achieves non-negligible performance improvements for more than 20 projects when compared to the competing methods, except for the best baseline EAMR. Even compared to EAMR, our approach could still obtain non-negligible performance improvements for more than half of the cases (i.e., 22 out of 27 and 16 out of 27 projects).
Answer to RQ1: Our approach shows the best overall performance among all the competitors in terms of the F1-score and AUC. Moreover, its performance improvements are statistically significant in general.

5.2. Results for RQ2

Figure 2 displays the prediction results of each method for the F1-score and AUC, in which the horizontal line and diamond in a box denote the median and mean results, respectively. This figure shows that our approach generally obtains the best median and mean results on both indicators, whereas the degree of improvement varies, which indicates the positive effect of each component of LDASP on defect prediction. As shown in this figure, the improvements in our approach compared to LDASP n o M a r g i n , LDASP n o C o n d i t i o n , and DASP are visible, but the improvements over LDASP n o L o c a l and LDAP are slight.
To further illustrate the performance difference, Table 8 provides the statistical significance test and effect size test results of LDASP versus other competitors. Based on the observation of Table 8, we can conclude the following:
  • When compared with LDASP n o M a r g i n , our approach wins the comparisons of both indicators in almost all projects (i.e., 21 out of 27 and 25 out of 27 projects). In other words, LDASP n o M a r g i n shows significant performance degradation without the marginal distribution alignment term. Moreover, in terms of N/S/M/L ( δ > 0 ), our approach obtains non-negligible performance improvements in the F1-score and AUC on 22 out of 27 projects, respectively. This is because the marginal distribution alignment can reduce the discrepancy between the source and target data. Therefore, a failure to consider marginal distribution alignment may result in the insufficient reduction in the data distribution difference between both projects, which causes the inadaptation of the predictor to target data.
  • When compared with LDASP n o C o n d i t i o n , our approach maintains consistent performance improvements in general. In terms of Win/Tie/Lose, our approach wins the comparisons in all projects on the F1-score and AUC. Furthermore, the N/S/M/L ( δ > 0 ) rows show that our approach obtains large performance improvements in both indicators on almost all projects, except for the medium improvement in the AUC on one project. Thus, removing the conditional distribution alignment term weakens the prediction effect significantly. The possible reason is that conditional distribution alignment can alleviate the distribution difference in the modules from the same category in different projects, thereby enhancing the classification ability of predictors in target data.
  • When compared with LDASP n o L o c a l , our approach wins the comparisons of both indicators in 14 out of 27 and 19 out of 27 projects, respectively. Furthermore, regarding N/S/M/L ( δ > 0 ), our approach achieves the non-negligible performance improvements in the F1-score and AUC in more than half of the projects. This illustrates that although preserving the class-wise locality brings about limited improvements, it still shows a significantly positive effect on defect predictions in general.
  • When compared with DASP, our approach improves both indicators significantly in most projects in terms of Win/Tie/Lose and N/S/M/L ( δ > 0 ). Especially for the AUC, our approach wins the competition against DASP and achieves non-negligible performance improvements in all projects. Based on the above observation, we can learn that the learned landmark weights are conducive to the improvement in predictions by highlighting the relevant modules and mitigating the negative transfer of noise modules.
  • When compared with LDAP, our approach wins the comparisons while achieving non-negligible performance improvements for more than half of the cases, which demonstrates that the proposed progressive pseudo-labeling strategy effectively reduces the introduction of unreliable pseudo-labels that may hinder the learning process.
Answer to RQ2: According to the comparison results for the F1-score and AUC, each component of our approach can promote HDP performance effectively.

6. Discussion

In this section, we further discuss the performance of LDASP. Specifically, Section 6.1 explains its effectiveness. Section 6.2 investigates the impact of different percentages of the training target data on the prediction effect. Section 6.3 analyzes the sensitivity of the parameters, and Section 6.4 provides the threats to the validity of LDASP.

6.1. Effectiveness of LDASP

In order to illustrate the effectiveness of the proposed approach further, we visualized the data distributions processed by LDASP. The t-SNE (t-distributed stochastic neighbor embedding) [52] was employed to reduce the dimensions of the source and target data in the metric subspace for easy display. By taking ant-1.7⇒jurby-1.1 as an example, Figure 3 and Figure 4 show the visualization results when the iteration time equals 1 and 5. In Figure 3, LDASP initializes the transformation matrix A without considering landmark weights and pseudo-labels. It can be seen that defective modules do not gather together very well, which leads to poor separability in the metric space. After multiple iterations, the modules of different categories have good separability, as shown in Figure 4. We can observe that defective and non-defective modules gather in the upper and lower parts of the coordinate system, respectively, and are easy to separate linearly. In summary, our proposed approach can match the data distributions of the source and target projects effectively while enhancing the classification ability of the transformed data.
Similar to Table 1, we also provide Table 9 to illustrate whether LDASP could reduce the negative impact of noise modules effectively. Given a source project and a target project, when the WPDP method performs better than the mixed-project HDP method, we think this is an invalid cross-project prediction combination because the source data play the role of degrading prediction performance. Therefore, the fewer invalid cross-project prediction combinations there are, the more effective the competing method is in dealing with the negative impact of the noise modules. From this table, we can find that among the competing mixed-project HDP methods, our approach has the fewest invalid cross-project prediction combinations on both indicators. The percentages of the invalid cross-project prediction combinations of our approach (i.e., 4.5% and 4.3%) are far below those of the other three methods, which demonstrates that LDASP improves the ability to eliminate the negative impact of noise modules effectively.

6.2. Effect of Different Percentages of Training Target Data

In our experiments, the percentage of training target data is fixed at 10%, as in Refs. [10,11]. In order to understand the performance of LDASP comprehensively, we explore its prediction effect under different percentages of training target data in this section. The proportion ranges from 10% to 90%, with a step size of 10%. We selected WPDP as the baseline that trains the predictor with training target data and used it to predict the test target data directly. Figure 5 presents the average indicator results of both methods under different percentages. We can see that with the growing percentage, the results of WPDP increase gradually, whereas those of LDASP remain basically stable. On the one hand, the prediction performance of WPDP improves with the increase in the training target data. On the other hand, the change in proportion has a limited impact on the performance of LDASP because the progressively selected pseudo-labels have provided relatively reliable and supervised information for it. Overall, LDASP performs better than WPDP for both indicators under most of the percentages. When the percentage is greater than about 60%, WPDP achieves a comparable prediction performance to LDASP. Hence, using WPDP directly can also obtain considerable prediction results as long as the training data within the target project are sufficient.

6.3. Sensitivity of Parameters

In order to assess the parameter sensitivity of LDASP, we investigate the impact of different hyperparameters, T and θ , on prediction performance. Figure 6 and Figure 7 exhibit the average indicator results of LDASP under different values of T and θ , respectively.
(1)
T is the iteration time that affects the convergence degree of the objective function. Normally, the more iterations there are, the easier it is for the objective function to converge. As shown in Figure 6, the prediction performance of LDASP generally keeps rising with the increase in T. When T is greater than 7, its indicator results tend to be stable. Moreover, the change in prediction performance is actually slight in terms of the difference between the indicator results of various T. Based on this, we can believe that the objective function has been fully optimized, although the setting of T = 5 is not optimal.
(2)
θ is used to restrict the proportion of landmarks that participate in distribution adaptation. LDASP θ = 1 treats all modules as having the same weight and corresponds to the optimization problem in Equation (5). As shown in Figure 7, LDASP θ = 0 and LDASP θ = 1 perform worse than LDASP with other θ values for both indicators. Furthermore, when θ is greater than 0, there is a significant improvement in the performance of LDASP. This indicates the positive impact of introducing unlabeled modules and their pseudo-labels on improving performance. In summary, our parameter setting of θ = 0.5 is reasonable, although it cannot help LDASP reach the best prediction performance.

6.4. Threats to Validity

Construct validity relates to the evaluation methods used in this work. In order to simulate the case where there is a small number of labeled data in the target project, we randomly split 10% of the target modules to make the training target data, and the remaining 90% were regarded as the test target data. This operation has been adopted from Refs. [10,11] to evaluate the performance of HDP with mixed-project data, but the changing percentages may lead to different prediction results. Moreover, we chose the F1-score and AUC, which are widely applied in HDP studies, to measure the overall prediction performance of competing methods. Other indicators (e.g., G-measure, MCC, and Popt) will be considered in our future work to improve the evaluation from different aspects.
The potential threat to internal validity may come from the replication of baselines. We have carefully implemented the compared method, which is not open-source, according to the description of the original paper. However, differences between our implementation and the original code may still exist, thus leading to biases in the prediction results.
External validity refers to the generalizability of our experimental results. In the experiments, we chose 27 projects from four datasets to validate the effectiveness of the proposed approach. A total of 530 one-to-one prediction combinations were constructed to be tested. Therefore, the conclusions of this work may not be generalized to other datasets.

7. Conclusions

Combining “within” and “cross-project” data (i.e., mixed-project data) is an effective way of improving prediction performance when there are limited historical data in the target project. In this paper, we propose a novel approach based on landmark-based domain adaptation and selective pseudo-labeling for HDP using mixed-project data. We first construct the objective function that considers marginal and conditional distribution matching and class-wise locality constraints (for projected source data) simultaneously to alleviate the heterogeneity between both projects. In order to reduce the negative impact of noise modules, we introduce the landmark weights to be learned for labeled source and unlabeled target modules. Furthermore, we also design a pseudo-label selection strategy to progressively select the pseudo-labels with high confidence and the corresponding modules for the learning process. Extensive comparisons are conducted for 27 projects from four datasets, including NASA, AEEEM, PROMISE, and JIRA. Two widely used indicators (i.e., the F1-score and AUC) are employed to evaluate the overall performance of each method. The experimental results indicate that LDASP outperforms the compared methods, and this verifies the effectiveness of each component of it.
In future work, we plan to collect more experimental data from open-source projects to test our approach. Moreover, we will extend our approach to the context of effort-aware evaluation, which considers both the prediction performance and the amount of work required to check modules.

Author Contributions

Conceptualization, Y.C. and H.C.; methodology, Y.C.; software, H.C.; validation, Y.C. and H.C.; formal analysis, H.C.; investigation, Y.C.; data curation, H.C.; writing—original draft preparation, Y.C.; writing—review and editing, H.C.; visualization, Y.C. and H.C.; project administration, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The research data are available on request from the corresponding author. The data are not publicly available due to their close relation to our future work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Menzies, T.; Greenwald, J.; Frank, A. Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 2007, 33, 2–13. [Google Scholar] [CrossRef]
  2. Hassan, A.E. Predicting faults using the complexity of code changes. In Proceedings of the 31st International Conference on Software Engineering, ICSE, Vancouver, BC, Canada, 16–24 May 2009; pp. 78–88. [Google Scholar]
  3. Tosun, A.; Bener, A.; Turhan, B.; Menzies, T. Practical considerations in deploying statistical methods for defect prediction: A case study within the Turkish telecommunications industry. Inf. Softw. Technol. 2010, 52, 1242–1257. [Google Scholar] [CrossRef]
  4. Turhan, B.; Menzies, T.; Bener, A.B.; Stefano, J.S.D. On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng. 2009, 14, 540–578. [Google Scholar] [CrossRef]
  5. Zhang, F.; Zheng, Q.; Zou, Y.; Hassan, A.E. Cross-project defect prediction using a connectivity-based unsupervised classifier. In Proceedings of the 38th International Conference on Software Engineering, ICSE, Austin, TX, USA, 14–22 May 2016; pp. 309–320. [Google Scholar]
  6. Xia, X.; Lo, D.; Pan, S.J.; Nagappan, N.; Xinyu, W. HYDRA: Massively compositional model for cross-project defect prediction. IEEE Trans. Softw. Eng. 2016, 42, 977–998. [Google Scholar] [CrossRef]
  7. Nam, J.; Kim, S. CLAMI: Defect prediction on unlabeled datasets. In Proceedings of the 30th International Conference on Automated Software Engineering, ASE, Lincoln, NE, USA, 9–13 November 2015; pp. 1–12. [Google Scholar]
  8. Jing, X.Y.; Wu, F.; Dong, X.; Qi, F.; Xu, B. Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering, FSE, Bergamo, Italy, 30 August–4 September 2015; pp. 496–507. [Google Scholar]
  9. Turhan, B.; Misirli, A.T.; Bener, A. Empirical evaluation of the effects of mixed project data on learning defect predictors. Inf. Softw. Technol. 2013, 55, 1101–1118. [Google Scholar] [CrossRef]
  10. Li, Z.; Jing, X.Y.; Zhu, X.; Zhang, H.; Xu, B.; Ying, S. On the multiple sources and privacy preservation issues for heterogeneous defect prediction. IEEE Trans. Softw. Eng. 2019, 45, 391–411. [Google Scholar] [CrossRef]
  11. Li, Z.; Jing, X.Y.; Zhu, X. Heterogeneous fault prediction with cost-sensitive domain adaptation. Softw. Test. Verif. Reliab. 2018, 28, e1658. [Google Scholar] [CrossRef]
  12. Niu, J.; Li, Z.; Chen, H.; Dong, X.; Jing, X. Data sampling and kernel manifold discriminant alignment for mixed-project heterogeneous defect prediction. Softw. Qual. J. 2022, 30, 917–951. [Google Scholar] [CrossRef]
  13. Li, Z.; Niu, J.; Jing, X.; Yu, W.; Qi, C. Cross-Project Defect Prediction via Landmark Selection-Based Kernelized Discriminant Subspace Alignment. IEEE Trans. Reliab. 2021, 70, 996–1013. [Google Scholar] [CrossRef]
  14. Van Engelen, J.E.; Hoos, H.H. A survey on semi-supervised learning. Mach. Learn. 2020, 109, 373–440. [Google Scholar] [CrossRef]
  15. Lee, D.H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Proceedings of the ICML 2013 Workshop: Challenges in Representation Learning (WREPL), Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
  16. Nam, J.; Kim, S. Heterogeneous defect prediction. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering, FSE, Bergamo, Italy, 30 August–4 September 2015; pp. 508–519. [Google Scholar]
  17. Li, Z.; Jing, X.Y.; Zhu, X.; Zhang, H. Heterogeneous defect prediction through multiple kernel learning and ensemble learning. In Proceedings of the IEEE International Conference on Software Maintenance and Evolution, ICSME, Shanghai, China, 17–22 September 2017; pp. 91–102. [Google Scholar]
  18. Li, Z.; Jing, X.Y.; Wu, F.; Zhu, X.; Xu, B.; Ying, S. Cost-sensitive transfer kernel canonical correlation analysis for heterogeneous defect prediction. Autom. Softw. Eng. 2018, 25, 201–245. [Google Scholar] [CrossRef]
  19. Tong, H.; Liu, B.; Wang, S. Kernel Spectral Embedding Transfer Ensemble for Heterogeneous Defect Prediction. IEEE Trans. Softw. Eng. 2021, 47, 1886–1906. [Google Scholar] [CrossRef]
  20. Li, Z.; Jing, X.; Zhu, X.; Zhang, H.; Xu, B.; Ying, S. Heterogeneous defect prediction with two-stage ensemble learning. Autom. Softw. Eng. 2019, 26, 599–651. [Google Scholar] [CrossRef]
  21. Yu, Q.; Jiang, S.; Zhang, Y. A feature matching and transfer approach for cross-company defect prediction. J. Syst. Softw. 2017, 132, 366–378. [Google Scholar] [CrossRef]
  22. Xu, Z.; Ye, S.; Zhang, T.; Xia, Z.; Pang, S.; Wang1, Y.; Tang, Y. MVSE: Effort-Aware Heterogeneous Defect Prediction via Multiple-View Spectral Embedding. In Proceedings of the International Conference on Software Quality, Reliability and Security, QRS, Sofia, Bulgari, 22–26 July 2019; pp. 10–17. [Google Scholar]
  23. Chen, H.; Jing, X.; Zhou, Y.; Li, B.; Xu, B. Aligned metric representation based balanced multiset ensemble learning for heterogeneous defect prediction. Inf. Softw. Technol. 2022, 147, 106892. [Google Scholar] [CrossRef]
  24. Wu, J.; Wu, Y.; Niu, N.; Zhou, M. MHCPDP: Multi-source heterogeneous cross-project defect prediction via multi-source transfer learning and autoencoder. Softw. Qual. J. 2021, 29, 405–430. [Google Scholar] [CrossRef]
  25. Zhu, K.; Ying, S.; Ding, W.; Zhang, N.; Zhu, D. IVKMP: A robust data-driven heterogeneous defect model based on deep representation optimization learning. Inf. Sci. 2022, 583, 332–363. [Google Scholar] [CrossRef]
  26. Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
  27. Long, M.; Wang, J.; Ding, G.; Sun, J.; Yu, P.S. Transfer Feature Learning with Joint Distribution Adaptation. In Proceedings of the IEEE International Conference on Computer Vision, ICCV, Sydney, Australia, 1–8 December 2013; pp. 2200–2207. [Google Scholar]
  28. Long, M.; Wang, J.; Ding, G.; Sun, J.; Yu, P.S. Transfer joint matching for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1410–1417. [Google Scholar]
  29. Tsai, Y.H.H.; Yeh, Y.R.; Wang, Y.C.F. Heterogeneous domain adaptation with label and structure consistency. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, 20–25 March 2016; pp. 2842–2846. [Google Scholar]
  30. Zhang, W.; Deng, L.; Zhang, L.; Wu, D. A Survey on Negative Transfer. IEEE/CAA J. Autom. Sin. 2023, 10, 305–329. [Google Scholar] [CrossRef]
  31. Aljundi, R.; Emonet, R.; Muselet, D.; Sebban, M. Landmarks-based kernelized subspace alignment for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Boston, MA, USA, 7–12 June 2015; pp. 56–63. [Google Scholar]
  32. Tsai, Y.H.; Yeh, Y.; Wang, Y.F. Learning cross-domain landmarks for heterogeneous domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Las Vegas, NV, USA, 27–30 June 2016; IEEE Computer Society: Washington, DC, USA, 2016; pp. 5081–5090. [Google Scholar]
  33. Shepperd, M.; Song, Q.; Sun, Z.; Mair, C. Data quality: Some comments on the NASA software defect datasets. IEEE Trans. Softw. Eng. 2013, 39, 1208–1215. [Google Scholar] [CrossRef]
  34. D’Ambros, M.; Lanza, M.; Robbes, R. Evaluating defect prediction approaches: A benchmark and an extensive comparison. Empir. Softw. Eng. 2012, 17, 531–577. [Google Scholar] [CrossRef]
  35. Marian, J.; Lech, M. Towards identifying software project clusters with regard to defect prediction. In Proceedings of the 6th International Conference on Predictive Models in Software Engineering, PROMISE, Timisoara, Romania, 12–13 September 2010; pp. 1–10. [Google Scholar]
  36. Yatish, S.; Jiarpakdee, J.; Thongtanunam, P.; Tantithamthavorn, C. Mining software defects: Should we consider affected releases? In Proceedings of the 41st International Conference on Software Engineering, ICSE, Montreal, QC, Canada, 25–31 May 2019; pp. 654–665. [Google Scholar]
  37. Ghotra, B.; McIntosh, S.; E. Hassan, A. Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models. In Proceedings of the 37th International Conference on Software Engineering, ICSE, Florence, Italy, 16–24 May 2015; pp. 789–800. [Google Scholar]
  38. Lessmann, S.; Baesens, B.; Mues, C.; Pietsch, S. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Trans. Softw. Eng. 2008, 34, 485–496. [Google Scholar] [CrossRef]
  39. Ryu, D.; Choi, O.; Baik, J. Value-cognitive boosting with a support vector machine for cross-project defect prediction. Empir. Softw. Eng. 2016, 21, 43–71. [Google Scholar] [CrossRef]
  40. Tantithamthavorn, C.; McIntosh, S.; E. Hassan, A.; Matsumoto, K. Automated parameter optimization of classification techniques for defect prediction models. In Proceedings of the 38th International Conference on Software Engineering, ICSE, Austin, TX, USA, 14–22 May 2016; pp. 321–332. [Google Scholar]
  41. Rahman, F.; Posnett, D.; Devanbu, P.T. Recalling the “imprecision” of cross-project defect prediction. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, ESEC/FSE, Cary, NC, USA, 11–16 November 2012; pp. 1–11. [Google Scholar]
  42. Chen, X.; Mu, Y.; Liu, K.; Cui, Z.; Ni, C. Revisiting heterogeneous defect prediction methods: How far are we? Inf. Softw. Technol. 2021, 130, 106441. [Google Scholar] [CrossRef]
  43. Zhou, Y.; Yang, Y.; Lu, H.; Chen, L.; Li, Y.; Zhao, Y.; Qian, J.; Xu, B. How far we have progressed in the journey? An examination of cross-project defect prediction. ACM Trans. Softw. Eng. Methodol. 2018, 27, 1–51. [Google Scholar] [CrossRef]
  44. Wang, C.; Mahadevan, S. Heterogeneous domain adaptation using manifold alignment. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence, IJCAI, Barcelona, Spain, 16–22 July 2011; pp. 1541–1546. [Google Scholar]
  45. Hollander, M.; Wolfe, D.A. Nonparametric Statistical Methods; Wiley: Hoboken, NJ, USA, 1999; p. 74. [Google Scholar]
  46. He, Z.; Shu, F.; Yang, Y.; Li, M.; Wang, Q. An investigation on the feasibility of cross-project defect prediction. Autom. Softw. Eng. 2012, 19, 167–199. [Google Scholar] [CrossRef]
  47. Ma, Y.; Luo, G.; Zeng, X.; Aiguo, C. Transfer learning for cross-company software defect prediction. Inf. Softw. Technol. 2012, 54, 248–256. [Google Scholar] [CrossRef]
  48. Chen, L.; Fang, B.; Shang, Z.; Tang, Y. Negative samples reduction in cross-company software defects prediction. Inf. Softw. Technol. 2015, 62, 67–77. [Google Scholar] [CrossRef]
  49. Fu, W.; Menzies, T. Revisiting unsupervised learning for defect prediction. In Proceedings of the 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE, Paderborn, Germany, 4–8 September 2017; pp. 72–83. [Google Scholar]
  50. Fan, R.E.; Chang, K.W.; Hsieh, C.J.; Wang, X.R.; Lin, C.J. LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res. 2008, 9, 1871–1874. [Google Scholar]
  51. Nam, J.; Pan, S.J.; Kim, S. Transfer defect learning. In Proceedings of the 35th International Conference on Software Engineering, ICSE, San Francisco, CA, USA, 18–26 May 2013; pp. 382–391. [Google Scholar]
  52. Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Figure 1. Illustration of the negative impact of noise modules.
Figure 1. Illustration of the negative impact of noise modules.
Electronics 13 00456 g001
Figure 2. Comparison results of F1-score and AUC.
Figure 2. Comparison results of F1-score and AUC.
Electronics 13 00456 g002
Figure 3. Distributions of source and target data processed by LDASP when the iteration number equals 1.
Figure 3. Distributions of source and target data processed by LDASP when the iteration number equals 1.
Electronics 13 00456 g003
Figure 4. Distributions of source and target data processed by LDASP when the iteration number equals 5.
Figure 4. Distributions of source and target data processed by LDASP when the iteration number equals 5.
Electronics 13 00456 g004
Figure 5. Comparison results between LDASP and WPDP under different percentages of training target data.
Figure 5. Comparison results between LDASP and WPDP under different percentages of training target data.
Electronics 13 00456 g005
Figure 6. Results of LDASP for the F1-scores and AUC under different iteration numbers, T.
Figure 6. Results of LDASP for the F1-scores and AUC under different iteration numbers, T.
Electronics 13 00456 g006
Figure 7. Results of LDASP for the F1-scores and AUC under different θ .
Figure 7. Results of LDASP for the F1-scores and AUC under different θ .
Electronics 13 00456 g007
Table 1. Statistics for the comparison between WPDP and CLSUP.
Table 1. Statistics for the comparison between WPDP and CLSUP.
SituationNumber of Prediction Combinations
F1-ScoreAUC
WPDP performs better than CLSUP114107
WPDP performs worse than CLSUP416423
% of invalid cross-project prediction combinations21.5%20.2%
Table 2. Details of datasets.
Table 2. Details of datasets.
DatasetProject# of
Metrics
# of Total
Modules
# of
Defective
Modules (%)
Prediction
Granularity
NASACM13732742 (12.84%)Function
MW13725327 (10.67%)
PC13770561 (8.65%)
PC3371077134 (12.44%)
PC4371287177 (13.75%)
AEEEMEQ61324129 (39.81%)Class
JDT61997206 (20.66%)
LC6169164 (9.26%)
ML611862245 (13.16%)
PDE611497209 (13.96%)
PROMISEant-1.720745166 (22.28%)Class
camel-1.420872145 (16.6%)
ivy-2.02035240 (11.36%)
jedit-4.02030675 (24.51%)
log4j-1.02013534 (25.19%)
poi-2.02031437 (11.78%)
tomcat-6.02085877 (8.97%)
velocity-1.62022978 (34.06%)
xalan-2.420723110 (15.21%)
xerces-1.32045369 (15.23%)
JIRAactivemq-5.0.0651884293 (15.55%)File
derby-10.5.1.1652705383 (14.16%)
groovy-1.6.0.beta16582170 (8.53%)
hbase-0.94.0651059218 (20.59%)
hive-0.9.0651416283 (19.99%)
jruby-1.16573187 (11.9%)
wicket-1.3.0.beta2651763130 (7.37%)
Table 3. Four kinds of defect prediction results.
Table 3. Four kinds of defect prediction results.
Actual DefectiveActual Non-Defective
Predict defectivetpfp
Predict non-defectivefntn
Table 4. Mapping Cliff’s delta values to the effectiveness levels.
Table 4. Mapping Cliff’s delta values to the effectiveness levels.
Cliff’s Delta ( δ )Effectiveness Levels
δ < 0.147Negligible (N)
0.147 ≤ δ < 0.33Small (S)
0.33 ≤ δ < 0.474Medium (M)
0.474 ≤ δ Large (L)
Table 5. Competing methods in RQ2.
Table 5. Competing methods in RQ2.
MethodDescription
LDASP n o M a r g i n a l Removing the term of marginal distribution alignment
LDASP n o C o n d i t i o n a l Removing the term of conditional distribution alignment
LDASP n o L o c a l i t y Removing the term of class-wise locality preserving
DASPWithout considering the landmark weights, i.e., α = 1 and β = 1
LDAPUsing all pseudo-labels without selection
Table 6. Comparison results for F1-score.
Table 6. Comparison results for F1-score.
TargetWPDPSCManual
Down
EAMRCLSUPsMDADSKMDAOurs
CM10.2490.3350.2820.2790.2740.2990.2810.326
MW10.2060.3030.2470.2720.2480.2710.2530.301
PC10.2170.2450.2360.2520.2280.2500.2220.268
PC30.2880.3450.3160.3320.3480.3640.2960.380
PC40.4100.3290.3280.4010.3610.3670.2710.380
EQ0.5480.5570.6190.5940.5580.5390.4480.581
JDT0.5440.5910.4660.5530.5620.5660.5320.590
LC0.2730.3370.2170.2880.3140.3210.2430.355
ML0.3330.3180.3110.3440.3640.3440.3470.392
PDE0.3250.3690.3330.3720.3660.3740.3580.385
ant-1.70.6060.6200.4310.5320.6130.6170.5830.701
camel-1.40.4390.4810.3420.3430.4840.4870.3480.522
ivy-2.00.3370.3670.2810.3810.3390.3770.4450.423
jedit-4.00.5440.5080.4570.5180.5500.5510.5450.628
log4j-1.00.6240.5770.5740.5800.6670.6250.5360.712
poi-2.00.6730.7530.5730.2990.6930.7210.2780.751
tomcat0.3180.3840.1990.3260.3610.3560.3830.420
velocity-1.60.4870.5290.5320.4980.5080.5370.5040.549
xalan-2.40.3180.3640.3490.4020.3500.3460.4610.386
xerces-1.30.2640.3310.3170.4060.3020.3340.3670.366
activemq-5.0.00.4410.5180.4710.4970.4950.4950.5930.550
derby-10.5.1.10.4820.5530.5060.4340.5010.5020.1820.596
groovy-1.6.0.beta10.2600.1130.2490.3060.3010.3080.3010.338
hbase-0.94.00.2690.3030.2710.5130.3030.3220.4060.337
hive-0.9.00.5170.5090.5500.5090.5050.5080.4190.556
jruby-1.10.3390.3800.4160.4620.3890.3900.5960.420
wicket-1.3.0.beta20.3680.4530.4040.2780.3930.3840.3520.468
Average0.3960.4250.3810.4060.4210.4280.3910.470
Win/Tie/Lose26/0/123/3/124/2/122/2/327/0/027/0/022/1/4-
N/S/M/L ( δ > 0 )0/0/0/260/0/1/220/2/0/242/5/3/140/0/0/270/0/3/240/0/0/22-
N/S/M/L ( δ < 0 )0/0/0/13/0/1/00/0/0/10/3/0/00/0/0/00/0/0/01/0/0/4-
The best results for each project and “Average” are marked in bold.
Table 7. Comparison results for AUC.
Table 7. Comparison results for AUC.
TargetWPDPSCManual
Down
EAMRCLSUPsMDADSKMDAOurs
CM10.5900.6420.6880.6010.6200.6550.6240.687
MW10.5790.6790.7290.6560.6380.6820.6090.731
PC10.6510.6340.7610.7050.6720.7350.6570.763
PC30.6420.6720.7260.7020.7330.7620.7000.782
PC40.8010.6350.7060.7570.7210.7500.6320.776
EQ0.6260.6300.6800.7640.7220.6890.6440.743
JDT0.7780.7610.7800.7890.8030.8060.7610.821
LC0.6560.7160.6340.7180.7710.7640.6400.801
ML0.6710.6270.6900.6870.7370.6980.6950.761
PDE0.6450.6710.7220.7260.7250.7370.7130.746
ant-1.70.8100.7460.5960.7830.8280.8280.8200.891
camel-1.40.6890.6890.5800.6530.7460.7480.6640.768
ivy-2.00.6940.6780.6480.7900.7360.7650.7950.821
jedit-4.00.7260.6300.6230.7440.7400.7440.7390.802
log4j-1.00.6790.6180.5680.8090.7690.7390.7510.804
poi-2.00.8460.8230.7800.6550.8470.8730.6200.890
tomcat0.6920.7060.4820.7860.7780.7700.7890.837
velocity-1.60.7250.7130.8220.6830.7530.7920.6490.797
xalan-2.40.5960.6130.6450.7400.6400.6350.7880.668
xerces-1.30.6320.7010.8120.7310.7010.7600.6460.792
activemq-5.0.00.6620.6790.7570.8200.7290.7260.8140.771
derby-10.5.1.10.7060.7080.7770.7830.7240.7190.6630.809
groovy-1.6.0.beta10.5880.3560.6530.7600.6900.7110.6000.744
hbase-0.94.00.6880.7150.7970.7920.7450.7820.6350.791
hive-0.9.00.6620.5770.6490.7760.6370.6330.6040.706
jruby-1.10.6520.6680.7930.8570.7190.7330.8490.759
wicket-1.3.0.beta20.6390.6950.6890.7980.6830.6870.7430.723
Average0.6790.6660.6960.7430.7260.7380.6980.777
Win/Tie/Lose26/0/127/0/020/3/417/7/327/0/026/1/023/2/2-
N/S/M/L ( δ > 0 )0/0/1/250/0/0/273/0/0/203/3/1/120/0/0/271/1/4/210/1/1/22-
N/S/M/L ( δ < 0 )0/0/0/10/0/0/00/1/0/34/0/2/20/0/0/00/0/0/00/1/0/2-
The best results for each project and “Average” are marked in bold.
Table 8. Comparison results against LDASP.
Table 8. Comparison results against LDASP.
TestIndicatorLDASP noMargin LDASP noCondion LDASP noLocal DASPLDAP
Win/Tie/LoseF1-score22/1/427/0/014/6/720/6/119/6/2
AUC22/2/327/0/019/5/327/0/021/6/0
N/S/M/L ( δ > 0 )F1-score0/0/0/220/0/0/273/5/3/64/10/6/44/14/3/2
AUC0/1/0/220/0/1/264/10/2/60/3/8/167/19/0/0
Table 9. Statistics for the comparison between WPDP and mixed-project HDP method.
Table 9. Statistics for the comparison between WPDP and mixed-project HDP method.
SituationNumber of Prediction Combinations
(% of Invalid Cross-Project Prediction Combinations)
F1-ScoreAUC
WPDP performs better than CLSUP114 (21.5%)107 (20.2%)
WPDP performs better than sMDA104 (19.6%)85 (16.0%)
WPDP performs better than DSKMDA228 (43.0%)201 (37.9%)
WPDP performs better than ours24 (4.5%)23 (4.3%)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Y.; Chen, H. Landmark-Based Domain Adaptation and Selective Pseudo-Labeling for Heterogeneous Defect Prediction. Electronics 2024, 13, 456. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics13020456

AMA Style

Chen Y, Chen H. Landmark-Based Domain Adaptation and Selective Pseudo-Labeling for Heterogeneous Defect Prediction. Electronics. 2024; 13(2):456. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics13020456

Chicago/Turabian Style

Chen, Yidan, and Haowen Chen. 2024. "Landmark-Based Domain Adaptation and Selective Pseudo-Labeling for Heterogeneous Defect Prediction" Electronics 13, no. 2: 456. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics13020456

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop