Knowledge Tracing Model and Student Profile Based on Clustering-Neural-Network

Xia, Jianghua; Wang, Han; Zhuge, Qingfeng; Sha, Edwin Hsing-Mean

doi:10.3390/app13095220

Open AccessArticle

Knowledge Tracing Model and Student Profile Based on Clustering-Neural-Network

School of Computer Science and Technology, East China Normal University, Shanghai 200063, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2023, 13(9), 5220; https://0-doi-org.brum.beds.ac.uk/10.3390/app13095220

Submission received: 8 March 2023 / Revised: 1 April 2023 / Accepted: 18 April 2023 / Published: 22 April 2023

Download

Browse Figures

Versions Notes

Abstract

:

Knowledge tracing models based on deep neural networks are currently widely studied to enhance personalized learning. However, to ensure the practical deployment of DNN-based KT models, prediction accuracy, training efficiency, and interpretability should be greatly improved. In this paper, we observe that the prediction accuracy of KT models can be improved by clustering the features of both students and questions. Based on this observation, a distributed KT scheme is proposed: (1) it classifies both students and questions based on clustering technology to reduce the interaction between different feature data to improve the prediction accuracy; (2) models for different classifications are trained in parallel in this distributed deployment architecture to improve the training efficiency; (3) the combination of a students’ knowledge state matrix and an RPa-LLM model is designed to display the knowledge status of students in the learning process, which can be used to build students’ portraits, thus improving the interpretability of the model. Real educational data are collected to conduct experiments. The results show that the proposed scheme improves both prediction accuracy and training efficiency by 4.08% and 67.28%, respectively, compared to the baseline methods. Furthermore, the proposed method maintains the interpretability of KT models, making it suitable for practical deployment.

Keywords:

clustering; neural network; student profile; interpretability

1. Introduction

In the age of intelligence, online learning techniques have rapidly developed, facilitated by massive educational data. This data-driven approach enables the provision of intelligent education services. By utilizing vast educational datasets, it is possible for teachers to extract and analyze students’ knowledge status, which provides them with valuable insights into students’ needs. In this case, teachers can offer personalized instruction that is tailored to each student’s individual requirements [1]. Knowledge Tracing (KT) is a technology that captures the state of students’ knowledge by learning their behavior when answering questions. It is widely studied by worldwide researchers to enhance personalized learning.

With massive learning data, KT models tend to become more complex. Recently, the state-of-the-arts (SOATs) are based on deep learning techniques and show significant improvements compared with Bayes Knowledge Tracing (BKT). Previously, KT models based on Deep Neural Networks (DNN) have been widely studied and improved. Some works improve the prediction accuracy of Deep Knowledge Tracing (DKT) through data engineering [2,3,4,5,6]. For example, Sonkar proposes question-level Deep Knowledge Tracing (qDKT), which provides richer information and more accurate prediction by estimating students’ mastery of each question [4]. Sun introduces a Classification and Regression Tree (CART) to filter all the features in the dataset to improve the characteristics of the input layer [6]. Furthermore, KT models based on the attention mechanism have been widely focused upon in recent years, and they can obviously improve the prediction performance of DKT [7,8,9,10,11]. In these works, some approaches improve the model details combined with educational and cognitive theory. For example, Cheng considers adding the “mistake” and “guess” factors to the KT model to improve the prediction accuracy of the model [12]. Liu believes that the relationship of knowledge points has an impact on sequence prediction, so they used an attention mechanism to describe the relationship between knowledge points [13]. As for DNN-based KT models, interpretability is also an important aspect because students and teachers need clear information to guide them to improve deficiencies. Thus, some works focus on improving the interpretability of KT models [14,15]. For example, Yeung uses a KT model to predict problem difficulty and student ability, improving the interpretability of the model. Using Item Response Theory (IRT) to predict student reactions based on problem difficulty and student abilities [14], Su expands each student’s knowledge state vector into a knowledge state matrix that updates over time, capturing the student’s knowledge state and improving the interpretability of the model [15].

Current works rarely consider improving the performance of KT by classifying the students and questions. We consider that different students have different reactions to different kinds of questions. Thus, if we use all students’ data and questions to train one KT model, the accuracy of models would be impacted by the noises between different groups of students and questions. In this work, we conduct an experiment (in Section 1.1) to find that clustering data with different characteristics can improve the prediction accuracy of the overall model. In this case, the most related work is the work of Minn. They believe that different methods for predicting students’ learning abilities are different, so they cluster students’ abilities based on the K-means clustering algorithm, using students’ abilities as additional inputs to improve their prediction ability [16]. However, they only considered the characteristics of the students, not the characteristics of the questions. In this work, we classify both students and questions to improve the accuracy of models. Then, based on our proposed classification structure, a distributed deployment training architecture is designed to improve the timeliness of model training. Additionally, a Latent Linear Model (RPa-LLM) is considered to improve the interoperability of KT. As for experiments, real data are collected, and the results show that the proposed scheme can improve prediction accuracy, training efficiency, and interpretability.

1.1. Motivation

To verify the effect of data characteristics on KT, we designed the following experiment. The dataset is designed as follows: (1) Type A students have a probability of answering Type A questions correctly, but Type A students must answer Type B questions incorrectly; (2) Type B students have a probability of answering Type B questions correctly, but Type B students must not answer Type A questions correctly. The results of the training are shown in Table 1.

This paper is organized as follows: Section 1 introduces the research related to KT; Section 2 illustrates the KT framework; Section 3 presents the experimental results; the discussion of the results is in Section 4; and Section 5 concludes this paper.

2. Frame of C-B-KT

This section introduces the preprocessing and training process of the Clustering-Bert-Knowledge Tracing (C-B-KT) model, and the flow chart is shown in Figure 1. In Steps S100–S200, the general service encodes the data; Step S300 clusters the encoded data; Step S400 sends the clustered data to the corresponding proxy server for neural network training; and Step S500 extracts the knowledge state matrix of the corresponding model and constructs and summarizes student portraits.

2.1. Data Encoding and Data Clustering

Question encoding is used for question vectorization to facilitate clustering. One-hot coding has been widely used in KT research [17]. In one-hot coding, each question is usually represented as a manually marked knowledge concept, but the rich information contained in the question text has not been fully mined. In order to mine text information, this paper uses the Bert model for coding. Bert uses Transformer, which is more efficient than Recurrent Neural Networks (RNNs) and can capture longer-distance dependencies [18]. The question vector after Bert encoding contains more hidden information.

Student encoding is used for student vectorization to facilitate clustering. This paper assumes that a student’s ability does not change in a short period of time. The student vector is obtained by observing the student’s response to all questions. Suppose the response sequence

s_{i} = (e_{i_{1}}, r_{i_{1}}), (e_{i_{2}}, r_{i_{2}}), \dots, (e_{i_{t}}, r_{i_{t}})

. Here,

(e_{i_{t}}, r_{i_{t}})

means that if student i answers the question

e_{i_{t}}

correctly, then

r_{i_{t}}

= 1; for a wrong answer,

r_{i_{t}}

= 1; for no answer,

r_{i_{t}}

= 2. According to the student answer sequence, the student vector

q_{i} = (r_{i_{1}}, r_{i_{2}}, \dots, r_{i_{t}})

,

q_{i}

represents the vector of student i.

Data clustering. In this paper, the K-means clustering algorithm is used to cluster the data. The algorithm has the following features: (1) the unsupervised clustering algorithm can automatically cluster data features; (2) clustering is controllable, and the number of clusters can be controlled by the K-value; (3) it benefits from high performance, meaning that it is less time consuming than hierarchical clustering algorithms; (4) and the clustering effect is the best possible. This paper studies various clustering algorithms based on experimental data (see Table A1).The clustering results are shown in Figure 2.

2.2. Model Training

For all combined data, we use a Bert−KT model (B−KT) with the following structure (Figure 3) to conduct training separately.

Question Embedding. Text information

e_{t}

automatically learns the semantic representation code

x_{t}

of each question, that is, question embedding. Each word

w_{t_{j}}

in the question

e_{t_{j}}

is converted into a

d_{0}

-dimensional word vector

x_{t_{j}}

, using the weighted average to get the question vector. The equation is

x_{t} = \frac{1}{m} \sum_{j = 1}^{m} x_{t_{j}}

(1)

where

x_{t_{j}} \in R^{d_{0}}

represents the word,

w_{t_{j}}

is the word vector obtained after Bert processing, m represents the number of words in question

e_{t}

, and

x_{t}

represents the question vector.

Student embedding is the modeling of the different training processes of different students and traces students’ hidden knowledge state in different training processes, that is, student embedding. The KT framework relies on two basic assumptions: (1) the student’s knowledge state is related to the question information and answer results; (2) students usually learn and forget their long-term continuous questions. Based on the above assumptions, this paper uses the LSTM model to embed students. This is the equation for LSTM:

\begin{matrix} f_{t} & = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f}) \\ i_{t} & = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i}) \\ \tilde{C_{t}} & = t a n h (W_{C} \cdot [h_{t - 1}, x_{t}] + b_{C}) \\ o_{t} & = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o}) \end{matrix}

(2)

where

W_{*} \in R^{4 d_{h}}

, which is the weight matrix between the input layer and the hidden layer;

b_{*} \in R^{d_{h}}

, which is the deviation vector between the input layer and the hidden layer; and

d_{h}

is the input dimension obtained in the question embedding.

In the process of training, correct or incorrect answers to questions will affect students’ status in different ways, and it is necessary to distinguish their different effects on specific students. Specifically, when training on the t-th question in the sequence, the input of the model should be embedded with question

x_{t}

and the corresponding score

r_{t}

. First, the score

r_{t}

is extended to the question embedding

x_{t}

. With the same dimension

R^{d_{0} + d_{q}}

, the feature vector

0 = (0, 0, \dots, 0)

is obtained, and then the score vector and the question vector are combined to get the input vector

x_{t}^{+}

, as follows:

x_{t}^{+} = \{\begin{matrix} [x_{t} \oplus 0] & r_{t} > 0 \\ 0 \oplus x_{t}] & r_{t} < 0 \end{matrix}

(3)

The neurons of the model are divided into upper and lower parts (half of the value of the input vector is 0), where

W_{*}

can be divided into

W_{*}^{+}, W_{*}^{-} \in R^{2 d_{h}}

. This can be understood as using two models to capture the impact of the question

e_{t}

and reflect it in real-time during the training process.

2.3. Model Output

This section presents specific strategies for predicting student performance. Student performance depends on students’ state and questions characteristics. For a typical sequence prediction task, it is assumed that the next state depends only on the current state, not on the previous sequence [19]. According to this theory, this paper uses the softmax layer to compress the hidden layer to obtain the dynamic matrix

S_{t}

, that is, the knowledge status. Inspired by DKVMN, one can predict the answers of the question

e_{t + 1}

according to the knowledge points

Q_{t + 1}

of the question. The equation is as follows:

\begin{matrix} y_{t + 1} & = R P a_L L M (S o f t m a x (h_{t}), Q_{t + 1}) \\ R P a_L L M (x_{i}, y_{i}) & = \frac{\exp (λ_{i 0} + \sum_{0}^{K} λ_{i k} ω_{i k} x_{i k}^{*})}{1 + \exp (λ_{i 0} + \sum_{0}^{K} λ_{i k} y_{i k}^{*})} \\ ω_{i k} & = 1 {x_{i k}^{*} \geq y_{i k}^{*} and y_{i k}^{*} = 1} \end{matrix}

(4)

2.4. Loss Function

This section introduces the loss function under the C-B-KT framework (the function used to measure the degree of difference between the predicted value and the actual value of the model)—the cross-entropy loss function. Since there is a log function in this function, after the derivation, the gradient is about

y - \hat{y}

. The linear function of the parameter update amplitude is proportional to the error, and the optimal solution can be found more quickly. The equation is as follows:

l o s s = - a l o g (p) - (1 - a) l o g (1 - p)

(5)

where p is the predicted value of the model, a is the true value of the data, and

l o s s

is the loss value.

3. Experiment

To verify the effectiveness of the proposed method, a series of experiments are conducted.

3.1. Setup

Public datasets do not contain the question texts that are needed for the proposed scheme; thus, in this paper, real questions are collected and labeled from a university calculus textbook. In this dataset, a total of 715 questions are labeled by n graduate students of computer science over two months. Labels include question text, knowledge points, the relations between knowledge points, and other information. The real answer data of students come from our real admission collection of nearly 1000 students, and a total of 82,517 behavior data are collected. Then, according to the real data distribution, 3000 students’ answering behaviors are simulated (See Figure A1).

To evaluate the effectiveness of the proposed method, several related works are compared. For fairness, all models are tuned for optimal performance. The details are as follows:

DKT: The classical DKT model [17], in which the data embedding is based on the one-hot method.
EERNN: Exercise-Enhanced Sequential Modeling [15]—the improved DKT model, in which the data embedding is based on the bidirectional LSTM method.
R-DKT: Random Deep Knowledge Tracing—the clustered DKT model, in which the clustering algorithm is based on the random method.
B-KT: Bert Knowledge Tracing—the improved DKT model, in which the data embedding is based on the Bert method.
C-DKT, C-EERNN: Clustering Deep Knowledge Tracing and Clustering Exercise-Enhanced Sequential Modeling—the improved DKT models, in which the clustering idea is added to the basic models.
C-B-KT: The proposed scheme, which uses the idea of clustering based on B-KT.

A qualified model for student performance prediction should have good results from both regression and classification perspectives. The output value is a continuous probability value, and regression indicators can be used to calculate the error of the continuous value. In this work, the sum of squares due to error (SSE) is used to evaluate the density of the cluster. The mean absolute error (MAE), the mean squared error (MSE), the root mean error (R2), and Root Mean Square Error (RMSE) are used to evaluate the error of the proposed methods. They are also used in other related work [15,20]. The Area Under Curve (AUC), Recall, Precision, and F1-score are used to evaluate the accuracy of models. They are also used in other related works [21,22,23,24,25]. The calculation details are as shown in Table 2.

The configuration of the proposed models is set as follows:

Question embedding supplements. The words of the mathematical questions embedded in the question are different from the traditional ones. To preserve the mathematical semantics, a formula tool in the Spotlight mechanism is used to convert each formula into its TEX code feature [26].

Frame settings. The dimension of each embedded problem,

d_{0}

, is set to 768; the dimension of the hidden state in the student embedding,

d_{h}

, is set to 100; and the dimension of the dynamic matrix in the output layer,

d_{K}

, is set to K, which is equal to the number of knowledge points.

Training settings. Initialize all parameters of C-B-KT, with the range of

0.5 \pm 0.1

. In addition, the

b a t c h

size is set to 32 and the dropout is set to

0.5

to prevent overfitting.

Determine the optimal number of clusters. As for K-means, the number of clusters should be determined manually. This article uses the elbow method (an algorithm for calculating the optimal clustering number k) to determine the optimal number of clusters, as shown in Figure 4.

In the results, when the number of clusters k increases, the samples are divided more finely and the degree of aggregation of each cluster increases, so SSE decreases. When k is less than the optimal number of clusters, an increase of k will significantly increase the degree of aggregation of each cluster and therefore decrease the SSE significantly. When k equals the optimal number of clusters, the degree of aggregation obtained by increasing k will rapidly decrease, so the downward trend of SSE will decrease and then tend to level off as the value of k continues to increase. This means that the figure composed of

S S E

and k is the shape of the elbow. When the curvature is maximum, the k value is the optimal number of clusters for the data. The curvature of the Figure 4a,b is greatest when

k = 3

. Therefore, the student data and question data are divided into three classes.

3.2. Results

3.2.1. Prediction Performance

The core of this experiment is to observe the impact of data clustering and data coding on the accuracy of the model prediction.

The prediction results are presented in Table 3. Several conclusions can be gained (see Figure A2 for a more intuitive R-P curve). Firstly, the proposed C-B-KT model shows the best in all metrics. This result proves that the proposed classification idea of the C-B-KT model can indeed improve the prediction accuracy of the KT models. Secondly, the C-DKT, C-ERNN, and C-B-KT models produce better results than DKT, EERNN, and B-KT models on all indicators. This indicates that the clustered data can be more accurately captured by neural networks. This improves the prediction accuracy of KT. Then, the B-KT model is superior to the classical DKT model and the same type of EERRN model. This indicates that the Bert model has a better ability to extract question features than the traditional one-hot model and bidirectional LSTM model. The addition of Bert to the B-KT model further reduces the information loss caused by feature-based or specific knowledge representation in existing methods.

3.2.2. Training Efficiency

This experiment aims to verify the performance improvement of the clustering-based neural network model.

Based on the results in Table 4, several conclusions can be gained. Firstly, compared with the traditional B-KT, EERNN, and DKT models, the C-B-KT, C-EERNN, and C-DKT models have an average time reduction of 67.28%. The EERNN model has the highest performance improvement, up to 75.75%. Although the R-DKT model has the highest training efficiency, its prediction accuracy did not improve compared to the DKT model. This indicates that the clustering-based neural network model can improve the training efficiency of the model through parallel training. Then, the training efficiency of the C-DKT model is higher than that of the DKT model but lower than that of the R-DKT model (random clustering makes the data amount more average). The DKT model has not undergone the processing embedded in this article and still has the same experimental results. This shows that the improvement of the model performance is mainly related to the data volume and not to the text information embedding.

After applying the clustering idea to all models, it performs very well in terms of both prediction accuracy and training efficiency, with good versatility.

3.2.3. Visualization Analysis

The core purpose of this paper is to construct student portraits. The dynamic matrix,

S_{t}

, in Figure 3 is the students’ knowledge state matrix. The data information is extracted and visualized as shown in Figure 5.

3.2.4. Clustering Model Analysis

To verify the idea that clustering can reduce the noise between different groups and improve the prediction accuracy, this section shows the prediction results of each model after clustering as presented in Figure 6.

Considering the classification indicator, AUC, except for the “31” model, the prediction accuracy of other models has been significantly improved. This indicates that after clustering data features, training data that have the same features can improve the prediction accuracy of the model, and the overall prediction accuracy after clustering is higher than that “Non-Clustering”, which indicates the high feasibility and effectiveness of clustering.

Considering the regression indicator, MAE, the regression index of the independent model has decreased. This indicates that the clustering-based model is closer to the real data than the “Non-Clustering” based model, which further proves the high effectiveness of clustering.

Observing the time indicator, the training-time of all models based on clustering is significantly shorter than that of the “Non-clustering” model, and the overall training efficiency is improved by 57.57%. It shows that the training efficiency of the model is inversely proportional to the amount of data.

In summary, although the amount of data for the model decreased, the accuracy of model prediction did not decrease, and the efficiency of model training was greatly improved.

4. Discussion

In this paper, we study the impact of data features on KT models and use the relationships to propose a KT model based on clustering neural networks.

In this study, we show that the input data features would impact the performance of KT models. Firstly, the performance of the model is significantly improved after clustering. This result shows that data clustering impacts KT models by partitioning data features. Secondly, data clustering provides a solution for distributed deployment and improves training efficiency. Finally, the student’s knowledge state matrix

S_{t}

is added to optimize the KT model and enhance interpretability.

This paper uses the control variable method to conduct experiments and analyze the data and finds that random clustering does not affect the accuracy of KT, which further shows that our method is effective and improves the credibility and accuracy of research.

In this paper, a clustering algorithm is used to train the data separately, which improves the prediction accuracy of the model and provides a distributed training scheme for the KT model. This paper uses the diagnostic classification model RPa-LLM, an educational theory, to display and trace students’ knowledge status, which further improves the interpretability of the model.

5. Conclusions

This paper makes a comprehensive study of the problem of student performance prediction. Firstly, an enhanced neural network framework, B-KT, is proposed to explore the effect of question text on KT. B-KT can effectively predict students’ performance in future questions. Secondly, the B-KT framework is extended to the C-B-KT model. The clustering algorithm can help the KT model to discover the hidden rules and structures in the dataset so that the neural network can better understand the data and improve the model’s prediction accuracy. Thirdly, the addition of the clustering module enables parallel training, which improves training efficiency. Then, to verify the effects of clustering and question embedding, the C-B-KT is compared with C-DKT and C-EERNN models. The C-B-KT can cluster and trace the historical focus state for prediction, which is superior to all other models. Finally, the classification discussion and the addition of the student’s knowledge state matrix

S_{t}

brought by clustering improve the interpretability of KT.

In this study, we only investigated the effect of K-means clustering on KT models, without considering the effect of other potential factors. In the future, different clustering algorithms can be used to explore the impact of data characteristics on KT from the perspective of data distribution changes. The deep integration of diagnostic classification models and neural networks can be further explored.

Supplementary Materials

The following supporting information can be downloaded at https://0-www-mdpi-com.brum.beds.ac.uk/article/10.3390/app13095220/s1.

Author Contributions

Conceptualization, J.X., H.W., Q.Z. and E.H.-M.S.; methodology, J.X. and H.W.; validation, J.X. and H.W.; formal analysis, J.X. and H.W.; investigation, J.X. and H.W.; data curation, J.X. and H.W.; writing—original draft preparation, J.X.; writing—review and editing, J.X. and H.W.; visualization, J.X.; supervision, Q.Z. and E.H.-M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by Shanghai Science and Technology Commission Project 20511101600 and NSFC 61972154.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article or Supplementary Material. The data presented in this study are available in the [Supplementary Material].

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1

The following Table A1 shows the analysis results of the clustering algorithm, and then determines which clustering algorithm to use in this paper.

Table A1. Cluster Index Results.

	K-Means	Agglomerative	DBSCAN
Time (s)	44.94	3120.61	81.73
Calinski-Harabasz	2113	723.31	197.32
Adjusted Rand index	0.83	1	1

Table uses three indicators, Time, Calinski-Harabasz and Adjusted Rand Index, to represent the performance, data adaptability and stability of the clustering algorithm, respectively, to determine the best clustering algorithm.

Appendix A.2

By analyzing the data distribution of real data and using data enhancement technology, this paper simulates the data of 3000 students. The following Figure A1 shows the comparison of data distribution between real data and simulation data.

Figure A1. The abscissa represents the number of knowledge points mastered by the students, and the ordinate represents the number of students corresponding to the number of knowledge points mastered.

Appendix A.3

Figure A2 is a supplementary experiment on the six models. Use R-P curve to intuitively observe the differences between the models.

Figure A2. The abscissa represents the Recall, the ordinate is the Precision, and the area under the curve is AUC. C-B-KT model shows the best predictive effect.

References

Kuh, G.D.; Kinzie, J.; Buckley, J.A.; Bridges, B.K.; Hayek, J.C. Piecing Together the Student Success Puzzle: Research, Propositions, and Recommendations. ASHE High. Educ. Rep. 2007, 32, 1–182. [Google Scholar]
Yang, H.; Cheung, L.P. Implicit heterogeneous features embedding in deep knowledge tracing. Cogn. Comput. 2018, 10, 3–14. [Google Scholar] [CrossRef]
Zhang, L.; Xiong, X.; Zhao, S.; Botelho, A.; Heffernan, N.T. Incorporating Rich Features into Deep Knowledge Tracing; ACM: Cambridge, MA, USA, 2017; pp. 169–172. [Google Scholar]
Sonkar, S.; Waters, A.E.; Lan, A.S.; Grimaldi, P.J.; Baraniuk, R.G. QDKT: Question-centric deep knowledge tracing. arXiv 2020, arXiv:2005.12442. [Google Scholar]
Zhang, N.; Du, Y.; Deng, K.; Li, L.; Shen, J.; Sun, G. Attention-based knowledge tracing with heterogeneous information network embedding. In Proceedings of the Knowledge Science, Engineering and Management: 13th International Conference, KSEM 2020, Hangzhou, China, 28–30 August 2020; pp. 95–103. [Google Scholar]
Sun, X.; Zhao, X.; Ma, Y.; Yuan, X.; He, F.; Feng, J. Muti-behavior features based knowledge tracking using decision tree improved DKVMN. In Proceedings of the ACM Turing Celebration Conference-China, Chengdu, China, 17–19 May 2019; pp. 1–6. [Google Scholar]
Pandey, S.; Karypis, G. A Self-Attentive model for Knowledge Tracing. arXiv 2019, arXiv:1907.06837. [Google Scholar]
Choi, Y.; Lee, Y.; Cho, J.; Baek, J.; Kim, B.; Cha, Y.; Shin, D.; Bae, C.; Heo, J. Towards an appropriate query, key, and value computation for knowledge tracing. In Proceedings of the Seventh ACM Conference on Learning@ Scale, Virtual, 12–14 August 2020; pp. 490–496. [Google Scholar]
Bhatt, S.; Zhao, J.; Thille, C.; Zimmaro, D.; Gattani, N. A Novel Approach for Knowledge State Representation and Prediction. In Proceedings of the Seventh ACM Conference on Learning@ Scale, Virtual, 12–14 August 2020. [Google Scholar]
Zhang, C.; Jiang, Y.; Zhang, W.; Gu, C. MUSE: Multi-Scale Temporal Features Evolution for Knowledge Tracing. arXiv 2021, arXiv:2102.00228. [Google Scholar]
Ghosh, A.; Heffernan, N.; Lan, A.S. Context-Aware Attentive Knowledge Tracing. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020; pp. 2330–2339. [Google Scholar]
Cheng, S.; Liu, Q.; Chen, E. Domain Adaption for Knowledge Tracing. arXiv 2020, arXiv:2001.04841. [Google Scholar]
Liu, D.; Zhang, Y.; Zhang, J.; Li, Q.; Zhang, C.; Yin, Y. Multiple features fusion attention mechanism enhanced deep knowledge tracing for student performance prediction. IEEE Access 2020, 8, 194894–194903. [Google Scholar] [CrossRef]
Yeung, C.K. Deep-IRT: Make Deep Learning Based Knowledge Tracing Explainable Using Item Response Theory. arXiv 2019, arXiv:1904.11738. [Google Scholar]
Su, Y.; Liu, Q.; Liu, Q.; Huang, Z.; Yin, Y.; Chen, E.; Ding, C.; Wei, S.; Hu, G. Exercise-Enhanced Sequential Modeling for Student Performance Prediction. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Minn, S.; Yu, Y.; Desmarais, M.C.; Zhu, F.; Vie, J.J. Deep Knowledge Tracing and Dynamic Student Classification for Knowledge Tracing. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 1182–1187. [Google Scholar]
Piech, C.; Bassen, J.; Huang, J.; Ganguli, S.; Sahami, M.; Guibas, L.J.; Sohl-Dickstein, J. Deep Knowledge Tracing. Comput. Sci. 2015, 3, 19–23. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Wang, S.; Tang, J.; Wang, Y.; Liu, H. Exploring Hierarchical Structures for Recommender Systems. IEEE Trans. Knowl. Data Eng. 2018, 30, 1022–1035. [Google Scholar] [CrossRef]
Liu, Q.; Huang, Z.; Yin, Y.; Chen, E.; Xiong, H.; Su, Y.; Hu, G. EKT: Exercise-Aware Knowledge Tracing for Student Performance Prediction. IEEE Trans. Knowl. Data Eng. 2021, 33, 100–115. [Google Scholar] [CrossRef]
Fogarty, J.; Baker, R.S.; Hudson, S.E. Case studies in the use of ROC curve analysis for sensor-based estimates in human computer interaction. In Proceedings of the Graphics Interface 2005, Victoria, BC, Canada, 9–11 May 2005; pp. 129–136. [Google Scholar]
Wu, R.; Xu, G.; Chen, E.; Liu, Q.; Ng, W. Knowledge or Gaming? Cognitive Modelling Based on Multiple-Attempt Response. In Proceedings of the 26th International World Wide Web Conference, Perth, Australia, 3–7 April 2017; pp. 321–329. [Google Scholar]
Zhang, T.; Su, G.; Qing, C.; Xu, X.; Cai, B.; Xing, X. Hierarchical lifelong learning by sharing representations and integrating hypothesis. IEEE Trans. 2018, 51, 1004–1014. [Google Scholar] [CrossRef]
Kuang, K.; Cui, P.; Athey, S.; Xiong, R.; Li, B. Stable Prediction across Unknown Environments. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 1617–1626. [Google Scholar]
Zhang, L.; Xiao, K.; Zhu, H.; Liu, C.; Yang, J.; Jin, B. CADEN: A Context-Aware Deep Embedding Network for Financial Opinions Mining. In Proceedings of the IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 757–766. [Google Scholar]
Yin, Y.; Huang, Z.; Chen, E.; Liu, Q.; Zhang, F.; Xie, X.; Hu, G. Transcribing Content from Structural Images with Spotlight Mechanism. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 2643–2652. [Google Scholar]

Figure 1. The KT framework based on the clustering neural network mainly consists of three parts: data encoding and data clustering (S200, S300), model training (S400), and building student portraits (S500).

Figure 2. The dimensions of the two groups of data are too large, and Principal Component Analysis (PCA) is used to compress them into two dimensions, namely PC1 and PC2. (a) shows the clustering results of the students. (b) shows the clustering results of the question. Three colors represent three clusters, and different clusters are combined for training.

Figure 3. There are two main parts to the framework process: the question embedding (gray) and the student embedding (blue).

S_{t}

is an improvement of the KT model, which represents the state of students’ knowledge and increases the interpretability.

Figure 3. There are two main parts to the framework process: the question embedding (gray) and the student embedding (blue).

S_{t}

is an improvement of the KT model, which represents the state of students’ knowledge and increases the interpretability.

Figure 4. The abscissa indicates the number of clusters, and

k = 3

indicates that the data are divided into three categories; the ordinate is the sum of the squares of the errors, which decreases as the number of clusters increases.

Figure 4. The abscissa indicates the number of clusters, and

k = 3

indicates that the data are divided into three categories; the ordinate is the sum of the squares of the errors, which decreases as the number of clusters increases.

Figure 5. (a–d) figure is an 18-dimensional radar graph, where each dimension represents a knowledge point and its numerical value represents the degree of mastery, with 0 representing a complete lack of mastery and 1 representing complete mastery. (a–c) correspond to the knowledge state under three question clusters, respectively. (d) shows the overall knowledge status calculated using the joint average algorithm.

Figure 6. “Clustering” represents the indicators of each independent model after clustering, “Non-Clustering” is the overall indicator, and “Random Clustering” is the indicator of each independent model after random clustering. “Sum of Clustering” and “Sum of Random Clustering” represents the overall result of integrating the prediction results of each independent model into a tensor for calculation. This article divides students and questions into three classes, where the abscissa “13” represents a model for training using a dataset consisting of the first class of students and the third class of questions (others similar). (a–c) show the three indicators AUC, MAE, and time of the independent model.

Table 1. Simulation data training results.

	AUC(%)	F1 (%)	Recall (%)	Precision (%)
Non-Clustering	51.42	0	0	0
Question Clustering	84.63	86.06	71.92	84.62
Student & Question Clustering	96.82	85.18	79.03	92.37

The experimental results in Table 1 are trained based on the KT model of the Long Short Term Memory (LSTM) neural network, which is the same model as the DKT model in the experimental part. It can be concluded that clustering can improve the prediction accuracy of the overall model by pre-extracting data features.

Table 2. Calculation details.

SSE	F1-Score	Recall	Precision
$\sum_{i = 1}^{m} {(x_{i} - y_{i})}^{2}$	$\frac{2 T P}{2 T P + F P + F N}$	$\frac{T P}{T P + F N}$	$\frac{T P}{T P + F P}$
MAE	MSE	RMSE	R2
$\frac{1}{m} \sum_{i = 1}^{m} \| h (x_{i}) - y_{i} \|$	$\frac{1}{m} \sum_{i = 1}^{m} {(h (x_{i}) - y_{i})}^{2}$	$\sqrt{\frac{1}{m} \sum_{i = 1}^{m} {(h (x_{i}) - y_{i})}^{2}}$	$1 - \frac{\sum_{i = 1}^{m} {(x_{i} - y_{i})}^{2}}{\sum_{i = 1}^{m} {(\bar{x} - y_{i})}^{2}}$

TP: Positive samples predicted as positive by the model; TN: Negative samples predicted as negative by the model; FP: Negative samples predicted as positive by the model; FN: Positive samples predicted as negative by the model.

x_{i}

represents the predicted value;

y_{i}

represents the true value;

\bar{x}

represents the average of the predicted values.

Table 3. Model result analysis.

Model	Classification Indicators (%)				Regression Indicators
Model	AUC	F1	Recall	Precision	MAE	RMSE	MSE	R2
DKT	81.10	72.89	74.29	71.54	0.36	0.42	0.18	0.29
R-DKT	81.21	72.22	70.69	73.81	0.34	0.42	0.18	0.29
EERNN	85.73	80.27	79.36	81.20	0.25	0.36	0.13	0.49
B-KT	89.94	81.89	81.26	82.52	0.26	0.36	0.13	0.49
C-DKT	84.55	75.70	74.52	76.92	0.33	0.40	0.16	0.36
C-EERNN	88.49	83.41	80.90	86.07	0.24	0.35	0.12	0.50
C-B-KT	91.01	83.15	81.26	85.14	0.24	0.35	0.12	0.52

Table 4. Performance Analysis.

	Knowledge Tracing							K-Means
model	DKT	C-DKT	R-DKT	EERNN	C-EERNN	B-KT	C-B-KT	MyData
Time (s)	7039	2986	1721	17051	4134	6955	2190	44.94

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xia, J.; Wang, H.; Zhuge, Q.; Sha, E.H.-M. Knowledge Tracing Model and Student Profile Based on Clustering-Neural-Network. Appl. Sci. 2023, 13, 5220. https://0-doi-org.brum.beds.ac.uk/10.3390/app13095220

AMA Style

Xia J, Wang H, Zhuge Q, Sha EH-M. Knowledge Tracing Model and Student Profile Based on Clustering-Neural-Network. Applied Sciences. 2023; 13(9):5220. https://0-doi-org.brum.beds.ac.uk/10.3390/app13095220

Chicago/Turabian Style

Xia, Jianghua, Han Wang, Qingfeng Zhuge, and Edwin Hsing-Mean Sha. 2023. "Knowledge Tracing Model and Student Profile Based on Clustering-Neural-Network" Applied Sciences 13, no. 9: 5220. https://0-doi-org.brum.beds.ac.uk/10.3390/app13095220

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Knowledge Tracing Model and Student Profile Based on Clustering-Neural-Network

Abstract

1. Introduction

1.1. Motivation

2. Frame of C-B-KT

2.1. Data Encoding and Data Clustering

2.2. Model Training

2.3. Model Output

2.4. Loss Function

3. Experiment

3.1. Setup

3.2. Results

3.2.1. Prediction Performance

3.2.2. Training Efficiency

3.2.3. Visualization Analysis

3.2.4. Clustering Model Analysis

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1

Appendix A.2

Appendix A.3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI