Next Article in Journal
Fault Detection of Resilient Navigation System Based on GNSS Pseudo-Range Measurement
Previous Article in Journal
Urban Expansion Monitoring Based on the Digital Surface Model—A Case Study of the Beijing–Tianjin–Hebei Plain
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deep Reinforcement Factorization Machines: A Deep Reinforcement Learning Model with Random Exploration Strategy and High Deployment Efficiency

School of Mechanical and Information Engineering, Shandong University, Weihai 264209, China
*
Author to whom correspondence should be addressed.
Submission received: 31 March 2022 / Revised: 13 May 2022 / Accepted: 19 May 2022 / Published: 24 May 2022

Abstract

:
In recent years, the recommendation system and robot learning are undoubtedly the two most popular application fields, and the core algorithms supporting these two fields are deep learning based on perception and reinforcement learning based on exploration learning, respectively. How to combine these two fields to better improve the development of the whole machine learning field is the dream of numerous researchers. The Deep Reinforcement Network (DRN) model successfully embedded reinforcement learning into the recommendation system, which provided a good idea for subsequent researchers. However, the disadvantage is also obvious, that is, the DRN model is built for news recommendations, meaning that the DRN model is not transferable, which is also the defect of many current recommendation system models. Meanwhile, the agent learning method adopted by the DRN model is primitive and inefficient. Among many models and algorithms that have emerged in recent years, we use the newly proposed deployment efficiency to measure their comprehensive quality and found that few models focus on both efficiency and performance improvement. To fill the gap of model deployment efficiency neglected by many researchers and to create a model of reinforcement learning agents with stronger performance, we have been exploring and trying to complete research on the Gate Attentional Factorization Machines (GAFM) model. Finally, we successfully integrated the GAFM model and reinforcement learning. The Deep Reinforcement Factorization Machines (DRFM) model proposed in this paper is based on the combination of deep learning with high perception ability and reinforcement learning with high exploration ability, centered on improving the deployment efficiency and learning performance of the model. The GAFM model is modified and upgraded using multidisciplinary techniques, and a new model-based random exploration strategy is proposed to update and optimize the recommendation list efficiently. Through parallel contrast experiments on various datasets, it is proved that the DRFM model surpasses the traditional recommendation system model in all aspects. The DRFM model is far superior to other models in terms of performance and robustness, and also significantly improved in terms of deployment efficiency. At the same time, we conduct a comparative analysis with the latest deep reinforcement learning algorithm and prove the unique advantages of the DRFM model.

1. Introduction

The creation of a recommendation system aims to solve the problem of increasing information explosion. The concept of growth is key to the recommendation system. It mainly studies the characteristics of users and items, and better generates a recommendation list for users by mining the potential or explicit connection between users and items. The algorithms and models in the recommendation system train and simulate massive user and item feature data. This process often takes a lot of time and tries to improve the final recommendation accuracy because even a little improvement will generate huge profits in practical application. Therefore, the accuracy rate is so important in the recommendation system. It has become the first indicator to measure the algorithms and models of various recommendation systems.
After decades of development, the recommendation system model has matured. From the early collaborative filtering algorithm [1] to the user-based recommendation algorithm [2] and then to the object-based recommendation algorithm [3], they have occupied the mainstream recommendation algorithm for a long period of time and have performed well until now. The proposal of matrix decomposition [4] provides a foundation for the development of factorization machines [5], and logistic regression [6], which is generated based on classification problems, has also become the mainstream algorithm of classification problem models. In the era of deep learning, the Wide and Deep model [7] combined generalization and memory for the first time, providing new ideas for subsequent models. Up to now, various models of recommendation systems have emerged endlessly, injecting new vitality into the recommendation system.
Recommendation systems are tools for people to process information more conveniently, while reinforcement learning is fundamentally different in concept. It puts forward the concept of agent, and the core of reinforcement learning is exploration and learning. That is to say, reinforcement learning aims to make use of the agent to continuously interact with the environment, constantly try and update the exploration, and finally approach or even exceed the target. It can be said that the emergence of reinforcement learning has changed people’s views on the prospect of artificial intelligence in a real sense, and countless researchers regard it as the real future of artificial intelligence. It is thanks to the continuous exploration and attempts of pioneers that reinforcement learning has continuously developed and evolved.
There is a big gap between deep learning and reinforcement learning. Machine learning, the predecessor of deep learning, has matured after decades of continuous research and development and has made outstanding contributions in many aspects and fields. In recent years, the explosive development of reinforcement learning has won the attention of researchers, but its development is still in its initial stage, with great plasticity. With the integration of the two fields, the development of deep reinforcement learning is even more difficult. The perception ability of deep learning and the exploration ability of reinforcement learning make their application fields almost finalized and deeply accepted by people. A few models can jump out of the boundary of the two and learn from each other well. Based on this, the research in this paper tries to combine the mature and efficient deep learning model with the reinforcement learning algorithm with great potential to create a new high-performance model for the industry.
Last year, we published the principle of the GAFM model [8], which we have studied for more than a year, and we have been working on the subsequent improvement of the model. The GAFM model solves the problem that the traditional recommendation system pays too much attention to accuracy, ignores running speed, and realizes the improvement of both accuracy and speed. We have found in numerous experiments that it is difficult to improve the single field of recommendation system, and a better way is to carry out a multi-disciplinary cross. Therefore, we focus on reinforcement learning, hoping to apply the exploration learning mechanism of reinforcement learning to the GAFM model. Our ultimate goal is to create a highly transferable model that allows users to avoid being stuck to a particular recommendation scenario with high recommendation accuracy and low time complexity. We are also inspired by the results of the latest papers. In addition to improving and comparing models in terms of efficiency and performance, we also add the measure of model deployment efficiency.
In this paper, we propose a deep reinforcement factorization machine (DRFM), which is the successful result of applying reinforcement learning to a recommendation system. The main contributions of this paper are as follows:
(1)
We add the random exploration strategy and learning feedback mechanism of reinforcement learning into the recommendation system model so that our model can interact with the environment and learn from the feedback to improve the model itself and improve the comprehensive performance of the model.
(2)
We break the application problem of the recommendation system model into a single scene, and the DRFM model proposed in this paper can be applied in a variety of recommendation scenes with excellent results.
(3)
Starting from multi-dimensional consideration of the model, we added deployment efficiency into the evaluation of the model, compared our results with the latest relevant research data, and finally obtained good results.
(4)
Through our parallel comparative experiments on a variety of datasets, the experimental results show that the performance of the DRFM model is better than other traditional models on various data, which proves the rationality and effectiveness of the model.

2. Related Work

The development of deep reinforcement learning is naturally inseparable from the development of deep learning and reinforcement learning. Reinforcement learning, which has reached its peak in recent years, can be traced back to 1954, when Minsky proposed the concept of reinforcement [9], bringing reinforcement learning into the academic field for the first time. In 1957, Bellman proposed the dynamic programming method for solving the optimal control problem and the stochastic discrete version of the Markov decision process [10], and the solution of this method adopted a mechanism similar to reinforcement learning trial-and-error iterative solution. Although he only used the idea of reinforcement learning to solve the Markov decision process, in fact, it led to the Markov decision process becoming the most common form of defining reinforcement learning problems. Coupled with the practical operability of his method, many researchers later believed that reinforcement learning originated from Bellman’s dynamic programming. In 1989, the Q learning proposed by Watkins further expanded the application of reinforcement learning and completed it [11]. Q learning makes it possible to find the optimal action strategy even without the knowledge of the immediate reward function and state transition function. In other words, Q learning makes reinforcement learning no longer depend on the problem model. In addition, Watkins also proved that reinforcement learning converges when the system is a deterministic Markov decision process and the returns are limited; that is, the optimal solution can be obtained. So far, Q learning has become the most widely used reinforcement learning method. Figure 1 shows the basic principle of reinforcement learning, from which reinforcement learning gradually builds a more complex framework:
Where P ( S t + 1 | S t ,   a ) represents the probability that an agent will transfer to state S t + 1 under the condition of state S t and action a . This is also the basis of Markov decision making.
The development of reinforcement learning has met a huge development block and coincides with the era of supervised learning upsurge. At this time, most researchers have turned to traditional machine learning, and the era of reinforcement learning has come to an end. The machine learning era can be divided by the deep learning era. Before the emergence of deep learning, the application of the most classic collaborative filtering algorithm can be traced back to the mail filtering system of the Xerox Research Center in 1992 [12], but its real development depends on Amazon [13] in 2003, making it a well-known classic model. In the 2006 Netflix Algorithm Competition [14], the recommendation algorithm based on matrix decomposition stood out and kicked off the popularity of matrix decomposition. The core of matrix decomposition is the expectation of generating a hidden vector for each user and item. According to the distance between vectors, there are many methods for matrix decomposition solutions, among which gradient descent has become the mainstream, which also laid a foundation for the subsequent development of neural networks. In view of the defect that matrix decomposition cannot be recommended comprehensively, the emergence of logistic regression has filled the gap well. Meanwhile, the concept of perceptron, the basis of neural networks, has appeared formally, which has many aspects of significance. As the basic constituent unit of the neural network, the structure and function of the perceptron are very simple, but it is such a simple neuron that it can form today’s complex and changeable neural network. It can be said that word perception is the key to machine learning. In order to optimize the problem of complexity, Rendle proposed the FM model [15] in 2010, using the inner product of two vectors to replace the single weight coefficient, that is, introducing a hidden vector, so as to better solve the problem of data sparsity. Specifically, let the feature be x , the weight be w , and the basic expression of FM is:
F M ( w , x ) = j 1 = 1 n j 2 = j 1 + 1 n ( w j 1 · w j 2 ) x j 1 x j 2
By adding a hidden vector, the number of weight parameters of n 2 level is directly reduced to n k , which can greatly reduce the training cost when using gradient descent.
Subsequently, the advent of the era of deep learning marks the rapid development of various models and algorithms, and various deep learning networks emerge endlessly, rapidly promoting the optimization and progress of recommendation systems. In 2016, the Wide and Deep model proposed by Google [16] expounded the concepts of memory and generalization for the first time, breaking the thinking of the traditional model and directly developing a specialized system. Among them, memory ability can be understood as the ability of the model to directly learn and use the co-occurrence frequency of items or features in historical data, while generalization ability can be understood as the ability of the model to transfer the correlation of features, and the ability to discover the correlation between rare features that are sparse or even never appeared and final tags. The Wide part is responsible for the memory of the model, while the Deep part is responsible for the generalization of the model. This way of combining the two parts of the network structure well combined the advantages of both sides and became popular at that time. In the subsequent improvement of the Wide and Deep model, the DeepFM [17] model in 2017 focused on the Wide part and improved the feature combination ability of the Wide part by using FM, while the Neural Factorization Machines (NFM) model [18] in the same year focused on improving the structure of the Deep part by adding a feature cross pooling layer. The data processing capabilities of deep have been further enhanced. Additionally, in 2017, the Attentional Factorization Machines (AFM) model [19] proposed by Ali introduced an attention mechanism based on the NFM model, again contributing to the multi-domain integration of recommendation system.
It is interesting to note that, with the outbreak of the era of deep learning, reinforcement learning, which had been dormant for many years, also ushered in a second outbreak, which has continued until now. In 2013, DeepMind published a paper on using reinforcement learning to play Atari games [20], which marked the beginning of a new decade of reinforcement learning. In October 2015, AlphaGo, a program developed by Google DeepMind, made history by beating Fan Hui, a senior human player, and became the first computer Go program to beat a professional Go player on a 19-way board without having to give any pieces. DeepMind, as promised, published their latest AlphaGo paper [21], which uses a Monte Carlo tree search combined with two deep neural networks, one of which uses a valuation network to evaluate a large number of choices and a chess network to select moves. In this design, the computer can combine the long-term inference of the tree graph, but also can spontaneously learn intuition training like the human brain, in order to improve chess ability, and then use reinforcement learning to carry out a self-game, and finally achieve the result of beating the world-class human Go player. The principle behind it is a secret that every AI enthusiast wants to know. The strongest and most recent version of AlphaGo Zero, which uses pure reinforcement learning to integrate value networks and strategy networks into a single architecture, beat the previous version of AlphaGo 100-0 after three days of training. AlphaGo has retired, but the technology lives on. Thus, a new era of reinforcement learning is coming.
In recent years, the fusion of reinforcement learning and deep learning has become a new hot topic in the field of artificial intelligence, and numerous new and innovative ideas have been proposed. In the field of recommendation systems, how to use the high exploration ability and feedback mechanism of reinforcement learning has undoubtedly become the focus of research. Afsar et al. [22] pointed out in the development of recommendation systems under the background of deep reinforcement learning that the future evolution path of recommendation systems is to add the learning ability of reinforcement learning to accelerate the training speed while improving its own perception ability to improve the accuracy of recommendation lists. Deployment efficiency is a very new concept. Matsushima et al. [23] described it as the preparation efficiency of a recommendation system before it goes online. Its proposal aims to remind researchers that they should pay more attention to the preparation of a recommendation system before it goes online and measure its efficiency. In the latest research results of the International Conference on Learning Representations 2022 (ICLR) [24,25], two newly proposed algorithms precisely focus on improving accuracy and deployment efficiency, providing novel and systematic efficient solutions. However, we read many articles and found that most research focuses on new and efficient algorithms or mathematical methods, and very little attention is paid to the combination of models and frameworks. More importantly, almost no one has considered the migration capability of models or algorithms, which we think is very important. It is very promising and necessary to propose a problem that can effectively solve multiple scenarios. Moreover, as the core of the recommendation system, the perception ability of the model should not be ignored or even abandoned. Many recent studies prefer to directly apply reinforcement learning to the recommendation system rather than consider how to combine the two to enrich the results of deep reinforcement learning.
As mentioned above, the work of this paper is aimed at merging two of the most promising fields of AI today: learning from each other and creating new efficient algorithms and models. This work is not starting from scratch because deep reinforcement learning has been successful and is leading scholars today as a new craze. As we all know, no matter whether reinforcement learning or deep reinforcement learning, the model is equivalent to the CPU of a computer, and only a good model can bring qualitative improvement to the whole system. At present, most models in the field of deep reinforcement learning are some popular traditional deep learning models or specialized models built for a specific recommendation scenario, which have problems with lagging performance and lack of generalization. Therefore, we adopt our previous research results GAFM model as the core model of DRFM proposed in this paper and build a reinforcement learning framework around it.

3. Materials and Methods

In this part, we will explain the basic structure and algorithm of a series of models related to our model in Part A; in Part B, we will formally introduce the DRFM model proposed in this paper and give a detailed algorithm explanation; in Part C, we will make a summary of the model.
A.
Related models and algorithms
(1)
GAFM
The GAFM model is the latest research result proposed by us last year. Its basic idea is to improve the Wide and Deep parts, respectively, on the basis of Google’s Wide and Deep model, and add gate structure and drop structure to control the accuracy and speed of the model. As the GAFM model is the core model of DRFM, we will elaborate its basic composition and principle in detail. Figure 2 shows the framework of the GAFM model:
1.
Gate
The gate structure is the key structure of GAFM to improve speed, and its main starting point is data complexity, that is, to solve the problem of model time-consuming from the perspective of data. To be specific, data complexity mainly comes from two aspects: one is that the data itself is too large; on the other hand, processing of the Embedding layer greatly increases the runtime, which is a major source of model time complexity. The main reasons for the slow convergence of Embedding layer are that the number of parameters is huge and the input vector is too sparse. This is mainly because the vector dimension is too high after the one-hot coding due to too many categories of the input feature vector.
Starting from the two aspects mentioned above, we can obtain two processing formulas of the door structure, respectively:
(a)
Input data processing
Assuming s is the input data size, g is the gate structure parameter, and its value is between 0 and 1, the new data size s n e w could be obtained as follows:
s n e w = s · g 1
The input data can be compressed to a customized degree using Formula (2). Parameter g is a hyperparameter. In particular, g 1 represents the parameters of the first gate structure in Formula (2), and g 2 represents the parameters of the second gate structure in formula (3) below. g 1 and g 2 are essentially the same as the parameters of the gate structure and can be represented by g . In order to reflect the difference between the two gate structures in the GAFM model, parameter g is divided into g 1 and g 2 . They have the same effect, but the values can differ when applied.
(b)
Embedding layer processing
Assuming that the number of input vector feature categories is n , and g is a gate structure parameter with a value between 0 and 1, the dimension d i m of Embedding can be concluded as follows:
d i m = n · g 2
Formula (3) is used to reduce the vector dimension of Embedding layer, and parameter g is the hyperparameter.
Generally speaking, due to the large adjustable range of parameter g , it is not necessary to use the above two methods at the same time in practical application. Therefore, parameter λ is introduced to combine the two formulas:
λ s n e w + ( 1 λ ) d i m = λ s · g 1 + ( 1 λ ) n · g 2
λ 1 is applied to compress data and λ 0 is applied to compress Embedding.
2.
Drop
The drop structure is another structure proposed by us to speed up the model. Its starting point is to reduce the model complexity, that is, to change the model complexity from the perspective of the network structure. From the perspective of implementation, there are two drop structures in the GAFM model, which are behind two concatenate layers. This is to solve the problem of balancing when large data are input. Assume that the concatenate neuron matrix is c o n c a t _ e m b e d s , the number of input neurons is n , and the value of parameter d is between 0 and 1.
c o n c a t _ e m b e d s n e w = c o n c a t _ e m b e d s [ : n · d ]
3.
Attention
The application of the attention mechanism in the model is mainly to assign a weight to each input neuron, which can reflect the different weights of input from different neurons. Specifically, the attention layer in GAFM calculates the corresponding weight of input from each neuron through the sigmoid function after the drop layer. Then, you multiply that weight times the input vector to get the new weighted vector. Suppose the input is i n p u t _ e m b e d s , the sigmoid function is σ ( ) , and the calculated weight is s c o r e , then:
s c o r e = σ ( i n p u t _ e m b e d s )
The newly empowered neurons are:
o u t p u t = i n p u t _ e m b e d s · s c o r e
(2)
DQN
The basic principle of reinforcement learning has been described above. The goal of the agent is to select actions through interaction with the environment to maximize future rewards. Here, it is assumed that future rewards will decay with time step, and the decay factor is γ . Then, the sum of future rewards at time t is:
R t = t = t T γ t t r t
We define the optimal action-value function Q * ( s , a ) as the maximum expectation of reward that can be achieved through any subsequent strategy after observing a sequence s and executing action a :
Q * ( s ,   a ) = max π E [ R t | s t = s ,   a t = a ,   π ]
where π is a policy that maps states to actions.
(3)
DRN
Another successful and important achievement in the field of deep reinforcement learning is the DRN model launched by Microsoft in 2018 [26], which aims to solve the dynamic learning problem of a news recommendation system. The Q network is adopted as the core network, and the competitive gradient descent algorithm is adopted as the iterative updating method of the network. Figure 3 shows the DRN system flow:
Figure 4 shows the structure of the core network Q network:
(4)
Directional Derivative Projection Policy Optimization (DDPPO) Algorithm
As mentioned above, ICLR2022 proposed an efficient mechanism for model-based reinforcement learning to obtain optimal strategies by interacting with the learning environment. Obviously, this is not an algorithm for recommender systems, but it is still suitable for recommender systems under the background of deep reinforcement learning. The researchers investigated the mismatch between model learning and model use. Specifically, in order to obtain the updated direction of the current strategy, an effective method is to use the differentiability of the model to calculate the derivative of the model. However, the commonly used methods simply regard model learning as a supervised learning task and use the prediction error of the model to guide the model learning, but ignore the gradient error of the model. In short, model-based reinforcement learning algorithms often require accurate model gradients but only reduce the prediction error during the learning phase, so there is the problem of inconsistent goals.
The researchers designed two different models and assigned them different roles during the learning and application phases of the models. In the model learning phase, the researchers designed a feasible method to calculate the gradient error and used it to guide the gradient model learning. In the model application stage, researchers first use the prediction model to obtain the predicted trajectory, and then use the gradient model to calculate the model gradient. Combined with the above methods, the DDPPO algorithm is proposed. Algorithm 1 is as follows:
Algorithm 1 Directional Derivative Projection Policy Optimization
1: Initialize the policy π , predictive model M ˜ θ p p , gradient model M ˜ θ g g , value function Q φ
2: Initialize environment replay buffer D e n v , model replay buffer D m o d e l
3: repeat
4:  Take an action in real environment with policy π , and add transition ( s ,   a ,   s ) to D e n v
5:  Train predictive model M ˜ θ p p and update θ p by maximizing J M ˜ p ( θ p )
6:  Train predictive model M ˜ θ g g and update θ g by maximizing J M ˜ g ( θ g )
7:  for N r o l l o u t s steps do
8:    Sample s t uniformly from D e n v
9:    Perform k -step model rollout on M ˜ θ p p starting from s t with policy π
10:    Add samples to D m o d e l
11:  end for
12:   D     D e n v     D m o d e l
13:  for N u p d a t e Q steps do
14:    Update φ using data from D :   φ     φ λ Q φ J Q ( φ )
15:  end for
16:  for N u p d a t e P o l i c y steps do
17:    Sample trajectories of length H with policy π and predictive model M ˜ θ p p
18:    Update using sampled trajectories and gradient model M ˜ θ g g :       + λ π π J π ( )
19:  end for
20: until The policy performs well in real environment M
21: return Optimal policy π *
The core of algorithm 1 is to train two different models, namely the prediction model and the gradient model. The DDPPO algorithm uses different loss functions to train these two models, respectively, and uses them appropriately in strategy optimization. In general, the DDPPO algorithm takes into account the gradient error ignored by previous studies and effectively improves accuracy through dual-model training. However, the algorithm still does not consider deployment efficiency or time complexity. The specific differences will be compared in the fourth part.
(5)
Deployment-Efficient Reinforcement Learning Algorithm
On the other hand, again for recommendation systems, researchers are focusing on deployment efficiency, trying to save time by reducing the number of tests before going online. The learning process of traditional (online) reinforcement learning (RL) can be summarized as a two-part cycle: one is to learn a policy based on collected data; the second is to deploy the policy to the environment for interaction and obtain new data for subsequent learning. The goal of reinforcement learning is to complete the exploration of the environment in such a cycle and improve the strategy until it is optimal.
However, in some practical applications, the process of deploying the policy is very tedious, and the data collection process is relatively fast after deploying the new policy. For example, in the recommendation system, the strategy is the recommendation plan, and a good strategy can accurately recommend the content that users need. Considering user experience, a company usually conducts internal tests for a long time before launching a new recommendation policy to check its performance. Due to the large user base, a large amount of user feedback data can be collected within a short period of time after deployment for subsequent policy learning. In such applications, researchers are more likely to choose algorithms that require only a small number of deployments to learn good strategies.
However, there is still a gap between existing reinforcement learning algorithms and theories and these real needs. In this paper, the researchers tried to fill this gap. From a theoretical perspective, researchers first provided a rigorous definition of deployment-efficient RL. Later, using episodic Linear MDP as a specific setting, the researchers investigated how lower bound algorithms would perform and proposed algorithm designs that would achieve optimal deployment complexity.
In the lower bound section, researchers contributed the construction and related proof of the lower bound of the theory. In the Upper Bound section, the researchers proposed a layer-by-layer exploration strategy and contributed a new algorithm framework based on covariance matrix estimation, as well as some technical innovations. The researchers’ conclusions also reveal the significant effect of deploying randomness strategies on reducing deployment complexity, which has often been overlooked in previous work.
The DE-RL algorithm solves the problem of deployment efficiency well, but it cannot improve performance relatively. How to improve the two complementary problems together is worth studying in the future.
B.
Deep Reinforcement Factorization Machines
Before introducing the main algorithm and structure of the DRFM model, we should first clarify the purpose of applying deep learning and reinforcement learning to recommendation systems. As mentioned above, the traditional recommendation system model uses different network layers and algorithms to process user and item data and strives to better mine hidden features to improve the final recommendation accuracy of the model. Therefore, the neural network’s perception ability in deep learning is so suitable for recommendation systems. It can mine the information from existing data to the maximum and make effective use of it to give the most appropriate results. So, what are the advantages of reinforcement learning? The most important one is dynamic learning, which means that it has a growing property that is constantly getting stronger. It may start out poorly, but with a lot of trial and feedback, it can quickly improve and adapt to a particular task.
Figuratively speaking, the application of deep learning in recommendation systems is more like a short-term working process, which emphasizes the use of all existing data for detailed calculation and reasoning, and finally gives the results within its ability. Although our GAFM model has many advantages compared with the traditional model, it is still a deep learning model fundamentally, with only a more thorough information mining in the short term. Reinforcement learning, on the other hand, relies on its ability to learn and update online. It doesn’t care how bad the model turns out to be at the beginning, because it can always be corrected. In other words, it is this mechanism that relies on feedback and updates that makes reinforcement learning a long-term learning model. Its learning is more like human learning and is reflected through trial and error, so its upper limit is also higher.
By now, the two modes of learning have become clear, and the pros and cons of each have been sorted out. Deep learning can give a good recommendation at one time, but after the recommendation, the work is considered over, and there is no update at all. The self-renewal ability of reinforcement learning makes the agent perform poorly at the beginning, but after repeated updates, it can exceed the limit of deep learning and reach a new height. Therefore, we combine the GAFM model with deep reinforcement learning to make the model have the ability to update long-term and shorten the updating cycle of reinforcement learning, so that the model can have high accuracy at an early stage. At the same time, GAFM, as the core of the model, ensures that it still has high perception ability and data mining ability can be retained. This is how models are deployed more efficiently.
Figure 5 shows the system framework of the DRFM. As mentioned above, we still built the framework indexed by the reinforcement learning update process. The whole model is divided into three parts: agent, environment, and state. In the agent, the GAFM model is taken as the computing point, with random exploration strategy, and the model itself is dynamically updated according to the feedback of the environment and the current state. The environment consists of a collection of user and item characteristics, which receive a list of recommendations from agents and give feedback to agents, as well as state updates. The state is composed of user features and context features. According to the current state, the agent can calculate the latest recommendation results. Next, we will go through the rationale for each section in detail:
(1)
Agent
Agents are undoubtedly the most important part of the entire framework, carrying out the key work of the model and determining the number and effect of update iterations. In DRFM, the agent consists of the core model GAFM and the exploration strategy. The introduction of FAFM has been described in detail in the previous paper, and here we focus on explaining what exploration strategy is.
Exploration strategy itself is also an important part of reinforcement learning. Its purpose is to prevent the agent from sticking to the same strategy all the time in updating and to encourage the agent to make random attempts bravely under certain conditions or probabilities, sometimes with unexpected effects. In DRFM models, exploring the strategies of the basic idea is the GAFM model in each attempt to recommend, to fine-tune model parameters are random to generate another temporary model, two model calculations give their suggestion list at the same time, through the environment gives feedback we can judge the merits of the two models and determine the course of the model parameters. In other words, the exploration strategy provides the basis for updating the model. Figure 6 shows the basic flow of the exploration strategy:
From Figure 6, we can see the updating principle of the agent. Specifically, before each recommendation, the GAFM model will conduct random exploration according to Formula (10):
W + Δ W = W ˜
Among them, W is the set of all parameters of the GAFM model, which is uniformly called the model parameters, and Δ W is the random perturbation result of the existing GAFM model. The new model parameter W ˜ m is created by adding two of them. The model corresponding to the new model parameter is GAFM ‘. The formula for generating random disturbance is shown in Formula (11):
Δ W = α · r a n d ( 1 ,   1 ) Δ W
Among them, α is the exploration factor, which determines exploration intensity. r a n d   ( 1 , 1 ) is a random number between −1 and 1. Therefore, we can combine Formula (10) and Formula (11):
W · ( 1 + α · r a n d ( 1 ,   1 ) ) = W ˜
At the end of the exploration, we have essentially two models with slightly different parameters. The two models each produce two recommendation lists, and what the model does next is merge the two recommendation lists but keep a portion of the original two recommendation lists. Therefore, we kept half of each recommendation list, merged it into a Final List, and handed it to the environment for feedback.
Based on the feedback given by the environment, we can judge which of the original two recommendation lists is more accurate, thus deciding which one we should keep as the original model parameter before the next recommendation. So, we have one update cycle, and after many updates, our model can be more accurate to meet our requirements.
(2)
Environment
The environment is the hub of the DRFM cycle, and its composition is concrete objects rather than computationally oriented vectors. Users in the environment are the key to continuously providing feedback during model updates. They will give feedback according to the list of recommendations given by the model each time. The feedback results will change because the forms of feedback are different in different recommendation scenarios. Faced with different forms of feedback, we decided to adopt the duality of feedback results; that is, the final form of feedback is to tell the model whether the specific recommendation results are correct or wrong. This is better for the model to clearly accept feedback and update its parameters. The objects in the environment are the objects to be recommended in the recommended scenario of the model.
(3)
State
The composition of the state is very simple; it is composed of user characteristics and context characteristics in the characteristic data that we process according to the input data. It acts as a guide for the agent to calculate the latest list of recommendations for each recommendation based on the current state and feedback from the updated model. It acts as a database in the entire model environment.
C.
Summary of the DRFM model
In general, DRFM is a successful attempt to apply deep reinforcement learning to recommendation systems. Its external framework is a self-renewing learning framework based on reinforcement learning, which consists of computational and iterative core agents, an interactive hub environment, and a guided database state. The key model in the agent is the efficient GAFM model, which temporarily generates another model according to the random exploration strategy before each update and generates a fusion list for the environment based on the recommendation list of the two models. In the environment, the feedback is handed over to the state update and the agent update. The agent decides whether to choose the two models according to the feedback and generates the next recommendation list according to the latest state. After many iterations, the model is eventually balanced with high accuracy and decent speed.
Therefore, compared with traditional models, DRFM has the following advantages:
(1)
Compared with other traditional recommendation system models, the GAFM model, the core of the first model, has higher recommendation accuracy and faster running speed. Second, the model integrated with deep reinforcement learning has the ability of random exploration and self-learning and will dynamically interact and update with users in the environment, which has a higher upper limit.
(2)
Compared with traditional reinforcement learning, it breaks the barrier of no perceptive ability in reinforcement learning and makes the learning and evolution of reinforcement learning have stronger direction and guidance. At the same time, the disadvantage of reinforcement learning is optimized so that the model has a good performance from the beginning and can reach a high upper limit with fewer updates.
(3)
From the perspective of application, the DRFM model breaks through the disadvantages of deep reinforcement learning to build models based on specific recommendation scenarios and enables the model to perform well for different recommended items in a variety of recommendation scenarios.

4. Experiments and Results

A.
Settings
In terms of experimental setup, the DRFM model is a deep reinforcement learning model constructed with the GAFM model as the core, so our experimental steps and experimental comparisons are based on the GAFM model. Specifically, the GAFM model has advantages in both accuracy and speed. Therefore, our experiment also focused on two questions:
Q1. Does our model significantly improve prediction accuracy compared to other existing models?
Q2. Is our model significantly faster than other existing models?
(1)
The dataset
In terms of dataset selection, in order to conduct comparative experiments, we still selected the three training sets used in GAFM model training and built a recommendation system environment around the training sets. The first dataset is the data Becker extracted from the 1994 census database [27]. Each data record a person’s age and work, etc. Income can be predicted using this dataset. The other two datasets are from public data and the Movielens [28] series. We select datasets of 1 M and 10 M, respectively, to achieve multi-classification of film grades.
(2)
Evaluation method
The model for effectiveness evaluation is mainly on two aspects. The first is the prediction accuracy, we will be according to certain scale datasets are divided into two parts of the training set and test set. The model of training in the training set, evaluation of the effect on the test set, and a certain number of iterations are carried out to ensure that the final effect after models is based on feedback learning to balance. The specific evaluation method is as follows: we add the square variance of the predicted result and the real value to get the total deviation value, and the proportion of the remaining correct value to the total value is the accuracy rate. For the evaluation of time efficiency, we counted the total time from training to completion of the evaluation of the model and made a horizontal comparison.
(3)
Baselines
Finally, we compared DRFM to the following model:
(a)
GAFM: As our core model, we first and foremost compared it in many ways.
(b)
DeepCrossing: Although an early model, it works well in a variety of small scenarios with very little time complexity, so we used it primarily as a speed comparison target.
(c)
Wide and Deep: As an original series of models, it first proposed the concepts of memory and generalization; thus, it is regarded as a basic standard of Wide and Deep models.
(d)
DeepFM: As one of the latest models in recommendation systems, it has a good performance in recommendation accuracy, so it is used as a comparison target for accuracy.
(e)
AFM: The AFM model adopts the attention mechanism, and the prediction accuracy of the model is further improved. We use its selection as the main comparison object of accuracy.
(4)
Parameter function explanation:
There are many parameters involved in this experiment, and a considerable part of them are related parameters inherited from the GAFM model. Specifically, parameter g is the parameter regulated by gate structure in the GAFM model. Its function is to compress the scale of the input data, and its value is between 0 and 1. Parameter d is the parameter adjusted by drop structure in the GAFM model. Its function is to accelerate the operation process of gradient descent by cutting out corresponding neurons from the perspective of the network structure, and its value is between 0 and 1. In the framework of reinforcement learning, the parameter we need to explain is α , which determines the exploration intensity when we adopt the random exploration strategy. It is figuratively understood as the exploration factor, and its value is between 0 and 1.
(5)
Parameter Settings:
For the GAFM model, the two most important parameters are d and g , because the model iteration in DRFM is mainly to update various parameters of the GAFM model, so the initial values of d and g are set as the optimal values obtained in our experiment. That is d = 0.82 , g = 0.91 . In the process structure of GAFM, we set the Dense layer with 512,256, and 128 units, respectively. Batchsize is 128 and epochs is 15, representing 15 iterations. For DRFM’s exploration strategy, we set the exploration factor α to 0.1.
B.
Accuracy evaluation (Q1)
In order to answer our first question, that is, the accuracy of the model, we show the accuracy effect of DRFM and other models in this part and give corresponding explanations.
(1)
Figure 7 shows the model’s performance on dataset 1:
(2)
Figure 8 shows how the model behaves in dataset 2:
(3)
Figure 9 shows how the model behaves in dataset 3:
(4)
Conclusions:
As can be seen from the performance of the three datasets, the training trend of the DRFM model is the same as that of the other models on the whole, which indicates that the DRFM model successfully breaks the defect of a large number of iterations and a low starting point of reinforcement learning. The DRFM model achieves the same starting point as other models with the help of the GAFM model’s efficient expression effect, which is a major breakthrough. In addition, it can be seen from the trend of accuracy that the DRFM model has a large increase in the early iterations and is fast and stable in the later iterations. The average accuracy is higher than that of other models and significantly higher than that of DeepCrossing model.
C.
Running time evaluation (Q2)
In this part, corresponding to the above experiment, we mainly studied the performance of the running time of each model on the dataset, that is, the evaluation of the running speed.
(1)
Figure 10 shows how the model behaves in dataset 1:
(2)
Figure 11 shows how the model behaves in dataset 2:
(3)
Figure 12 shows how the model behaves in dataset 3:
(4)
Analysis and comparison of deployment efficiency:
In this section, we show the running time of each model on the dataset. From a practical application point of view, it is a good reflection of the deployment efficiency mentioned at the beginning of this article because model-driven recommendation systems do not require additional effort to consume time and performance. From this point, we can intuitively draw the preliminary results: The deployment efficiency of DRFM in the recommendation system is quite good, second only to GAFM, and far exceeds the effect of other models. In principle, it is not difficult to understand why the time complexity of DRFM is higher than that of GAFM. This is because DRFM has added the random exploration and self-learning updating mechanism of reinforcement learning, and the result of an external loop plus an internal loop inevitably increases the time complexity of the model. However, this is necessary for the recommendation system to obtain high exploration ability and performance; otherwise, it cannot be called the deep reinforcement learning model. We can clearly see from the experimental results of the first part that DRFM is significantly better than GAFM in performance, which is of inestimable value in practical application.
On the other hand, we mentioned the two latest research achievements, the DDPPO algorithm and the DE-RL algorithm, in the second part of related work. Because they are oriented toward reinforcement learning rather than recommendation systems and it is difficult to transform the algorithm, we cannot reproduce them and make actual comparisons online in a short time. However, by comparing the public results and data of the two experiments with the results of our DRFM model, we can still find that our model is better than the two algorithms in terms of deployment efficiency or running speed, although the performance is almost equal. We are unable to give a detailed analysis, but we believe that this is due to the excellent performance of the GAFM model and the high efficiency of the random exploration strategy. We will focus on this question in the next phase of our research.
(5)
Conclusions:
The difference in running speed can be reflected in the results of running time. First, from the perspective of longitudinal comparison, it can be seen that the overall running time increases as the scale of the three datasets increases. From the perspective of horizontal comparison, the performance of the DRFM model in each dataset is not the best, but it is second only to the GAFM model. This is inevitable because DRFM adds feedback and loop mechanisms on the basis of the GAFM model, which increases the time complexity. Nevertheless, DRFM was able to maintain a fairly low uptime, even lower than the DeepCrossing model, which is another indication that the GAFM model works well.
D.
Data summary:
In this section, we will present the data results of DRFM and other models in the form of tables, and finally give the increase of accuracy of DRFM compared with other models.
(1)
Table 1 shows the performance of Accuracy.
(2)
Table 2 shows how Time behaves.
(3)
Table 3 shows the performance growth of DRFM model compared with other models.

5. Discussion

As can be seen from the experimental results in the fourth part, the DRFM model inherits the high accuracy and low running time of its core model, the GAFM model, and has many advantages compared with other models. Specifically, we found the following:
(1)
From the perspective of reinforcement learning, using more effective and technical agents can effectively reduce the number of iterations in the experiment and significantly enhance the effect. This is because stronger agents and effective exploration strategies can make the model have higher accuracy in early iterations, thus raising the lower limit of learning and, to some extent, raising the upper limit.
(2)
From the point of view of the recommendation system, the DRFM model makes further progress in the final recommendation accuracy and reaches a new height. This is due to the self-updating process integrating reinforcement learning, which enables the model to dynamically adjust parameters to obtain feedback. Although an increase in time complexity is inevitable, the overall accuracy and speed of the double advantage.
(3)
It has been proved that the model has different performances in different sizes and different types of datasets, but the overall trend is the same. Specifically, the recommendation environment built by the dataset adopted in the experiment is user-oriented and object-oriented, respectively, so the members of environment and state are different. However, the experimental proof does not affect the effective and fast learning of DRFM model and the recommendation results. In other words, the DRFM model is not only suitable for specific recommendation scenarios but also has high generality.
In this paper, several models compared with DRFM have different performances from experimental results. Except for the DeepCrossing model, the accuracy of the other models is not very different, but only DRFM and GAFM are faster in terms of running speed. The advantages of the DeepCrossing model in running speed can also be seen by comparing the other three models. In addition, in terms of model parameters, although we fixed the exploration factor α = 0.1 in the experiment, which actually represents the weight we assign to the exploration, we found through many comparative experiments that whether the exploration is large or small has a great impact on the direction and effect of the model iteration. We finally draw the conclusion that the value range of parameter α should be between 0.05 and 0.25, beyond which the effect will decrease greatly. In general, the DRFM model achieves or even exceeds our expectations, and has good effects and advantages in all aspects.

6. Conclusions and Future Work

Based on our previous research on the GAFM model, this paper is committed to creating an excellent deep reinforcement learning recommendation system with universality from the perspective of integrating reinforcement learning and deep learning to serve the recommendation system. Specifically, we will strengthen learning agent concepts applied in recommender systems, and carry on the nested GAFM model as the core. By using the method of random exploration, we generated before each recommend another model parameters, the fusion of two recommended for results after getting feedback, and update the model parameters according to the feedback, to achieve the effect of the dynamic adjustment model. Through experiments, we prove that the DRFM model has a better effect on both recommendation accuracy and running speed, and we also prove that the desired effect can be achieved with fewer iterations.
For the contribution and influence of the research proposed in this paper, first, it is a comprehensive upgrade of the GAFM model proposed in the previous stage. We added the self-renewal cycle of reinforcement learning, which overcomes the defects of the deep learning model in exploration ability and learning ability. This is a good demonstration of the results of the deep reinforcement learning model. Second, the DRFM model has a good lead effect in the application of engineering and practical scenarios, which means that the model can be migrated in multiple scenarios rather than optimized for a specific scenario. I think our breakthrough can give corresponding inspiration to other researchers and promote the further development of recommendation system models and algorithms.
In future work, we will focus on improving the environment and structure of the DRFM model. For example, in the random exploration strategy of a model update, we randomly generated a new model parameter and generated a new recommendation list based on it. We will continue to try to generate multiple model parameters randomly. More importantly, whether there is a better strategy for the fusion of recommendation lists, that is, whether we can find a better strategy for the fusion of two or more recommendation lists in a more reasonable or more even distribution, so that the feedback effect will be more accurate.

Author Contributions

Conceptualization, H.Y.; methodology, H.Y.; software, H.Y.; validation, H.Y.; formal analysis, H.Y.; writing—original draft preparation, H.Y.; writing—review and editing, J.Y.; supervision, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 61971268.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets in this study are available online at UCI Machine Learning Repository: Adult Data Set and MovieLens|GroupLens.

Acknowledgments

We thank the National Natural Science Foundation of China for funding our work, grant number 61971268.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
GAFMGate Attentional Factorization Machines
FMFactorization Machine
DeepFM Deep Factorization Machine
NFMNeural Factorization Machine
AFMAttentional Factorization Machine
DQNDeep Q Network
DRNDeep Reinforcement Network

References

  1. Storm, A.P.; Content, O.F. An Integrated Approach to TV & VOD Recommendations Archived 6 June 2012 at the Wayback Machine; Red Bee Media: London, UK, 2012. [Google Scholar]
  2. Liu, H.; Kong, X.; Bai, X.; Wang, W.; Bekele, T.M.; Xia, F. Context-Based Collaborative Filtering for Citation Recommendation. IEEE Access 2015, 3, 1695–1703. [Google Scholar] [CrossRef]
  3. Wu, Y.; Wei, J.; Yin, J.; Liu, X.; Zhang, J. Deep Collaborative Filtering Based on Outer Product. IEEE Access 2020, 8, 85567–85574. [Google Scholar] [CrossRef]
  4. Shan, H.; Banerjee, A. Generalized probabilistic matrix factorizations for collaborative filtering. In Proceedings of the 2010 IEEE International Conference on Data Mining, Sydney, NSW, Australia, 13–17 December 2010; pp. 1025–1030. [Google Scholar]
  5. Luo, X.; Zhou, M.; Xia, Y.; Zhu, Q. An efficient non-negative MatrixFactorization-Based approach to collaborative filtering for recommender systems. IEEE Trans. Ind. Informat. 2014, 10, 1273–1284. [Google Scholar]
  6. Mendes, Z. Reading and Understanding Multivariate Statistics. Anal. Psicol. 1997, 15, 162. [Google Scholar]
  7. Zheng, Z.; Yang, Y.; Niu, X.; Dai, H.N.; Zhou, Y. Wide and Deep Convolutional Neural Networks for Electricity-Theft Detection to Secure Smart Grids. IEEE Trans. Ind. Inform. 2018, 14, 1606–1615. [Google Scholar] [CrossRef]
  8. Li, Y. Gate Attentional Factorization Machines: An Efficient Neural Network Considering Both Accuracy and Speed. Appl. Sci. 2021, 11, 9546. [Google Scholar] [CrossRef]
  9. Minsky, L.M. Theory of Neural-Analog Reinforcement Systems and Its Application to the Brain-Model Problem; Princeton University: Princeton, NJ, USA, 1954. [Google Scholar]
  10. Bellman, R. A Markovian Decision Process. Indiana Univ. Math. J. 1957, 6, 15. [Google Scholar] [CrossRef]
  11. Watkins, C.J.C.H. Learning from Delayed Rewards. Ph.D. Thesis, Kings College University of Cambridge, Cambridge, UK, 1989. [Google Scholar]
  12. Goldberg, D.; Nichols, D.; Oki, B.M.; Terry, D. Using collaborative filtering to weave an information tapestry. Commun. ACM 1992, 35, 61–70. [Google Scholar] [CrossRef]
  13. Linden, G.; Smith, B.; York, J. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet Comput. 2003, 7, 76–80. [Google Scholar] [CrossRef] [Green Version]
  14. Koren, Y.; Bell, R.; Volinsky, C. Matrix factorization techniques for recommender systems. Computer 2009, 42, 30–37. [Google Scholar] [CrossRef]
  15. Rendle, S. Factorization machines. In Proceedings of the 2010 IEEE International Conference on Data Mining, Sydney, NSW, Australia, 13–17 December 2010; pp. 995–1000. [Google Scholar]
  16. Cheng, H.T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.; et al. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, Boston, MA, USA, 15 September 2016; pp. 7–10. [Google Scholar]
  17. Guo, H.; Tang, R.; Ye, Y.; Li, Z.; He, X. DeepFM: A factorization-machine based neural network for CTR prediction. Arxiv Prepr. 2017, arXiv:1703.04247. [Google Scholar]
  18. He, X.; Chua, T.S. Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Tokyo, Japan, 7–11 August 2017; pp. 355–364. [Google Scholar]
  19. Xiao, J.; Ye, H.; He, X.; Zhang, H.; Wu, F.; Chua, T.S. Attentional factorization machines: Learning the weight of feature interactions via attention networks. Arxiv Prepr. 2017, arXiv:1708.04617. [Google Scholar]
  20. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. Comput. Sci. 2013. [Google Scholar] [CrossRef]
  21. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  22. Afsar, M.M.; Crump, T.; Far, B. Reinforcement learning based recommender systems: A survey. arXiv 2021, arXiv:2101.06286. [Google Scholar]
  23. Matsushima, T.; Furuta, H.; Matsuo, Y.; Nachum, O.; Gu, S. Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization. arXiv 2020, arXiv:2006.03647. [Google Scholar]
  24. Li, C.; Wang, Y.; Chen, W.; Liu, Y.; Ma, Z.M.; Liu, T.Y. Gradient Information Matters in Policy Optimization by Back-propagating through Model. In Proceedings of the International Conference on Learning Representations 2022, ICLR 2022, Virtual. 25 April 2022. [Google Scholar]
  25. Huang, J.; Chen, J.; Zhao, L.; Qin, T.; Jiang, N.; Liu, T.Y. Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality. In Proceedings of the International Conference on Learning Representations 2022, ICLR 2022, Virtual. 25 April 2022. [Google Scholar]
  26. Zheng, G.; Zhang, F.; Zheng, Z.; Xiang, Y.; Yuan, N.J.; Xie, X.; Li, Z. DRN: A deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 167–176. [Google Scholar]
  27. Kohavi, R.; Becker, B. UCI Machine Learning Repository: Adult Data Set [EB/OL]. 1 May 1996. Available online: https://archive.ics.uci.edu/ml/datasets/adult (accessed on 30 March 2022).
  28. Harper, F.M.; Konstan, J.A. The MovieLens Datasets: History and Context. Acm Trans. Interact. Intell. Syst. 2015, 5, 1–19. [Google Scholar] [CrossRef]
Figure 1. Basic framework of reinforcement learning.
Figure 1. Basic framework of reinforcement learning.
Applsci 12 05314 g001
Figure 2. Main framework of the GAFM.
Figure 2. Main framework of the GAFM.
Applsci 12 05314 g002
Figure 3. DRN system.
Figure 3. DRN system.
Applsci 12 05314 g003
Figure 4. Q network structure.
Figure 4. Q network structure.
Applsci 12 05314 g004
Figure 5. DRFM system framework.
Figure 5. DRFM system framework.
Applsci 12 05314 g005
Figure 6. DRFM exploration strategy basic flow.
Figure 6. DRFM exploration strategy basic flow.
Applsci 12 05314 g006
Figure 7. Accuracy of each model in dataset 1.
Figure 7. Accuracy of each model in dataset 1.
Applsci 12 05314 g007
Figure 8. Accuracy of each model on dataset 2.
Figure 8. Accuracy of each model on dataset 2.
Applsci 12 05314 g008
Figure 9. Accuracy of each model in dataset 3.
Figure 9. Accuracy of each model in dataset 3.
Applsci 12 05314 g009
Figure 10. Running time of each model on dataset 1.
Figure 10. Running time of each model on dataset 1.
Applsci 12 05314 g010
Figure 11. Running time of each model on dataset 2.
Figure 11. Running time of each model on dataset 2.
Applsci 12 05314 g011
Figure 12. Running time of each model on dataset 3.
Figure 12. Running time of each model on dataset 3.
Applsci 12 05314 g012
Table 1. Accuracy of each model.
Table 1. Accuracy of each model.
Dataset1Dataset2Dataset3
DRFM0.84930.83890.8292
GAFM0.84380.83420.8238
DeepCrossing0.82690.82690.8033
Wide and Deep0.84500.83170.8211
DeepFM0.84950.83350.8225
AFM0.84270.83270.8266
Table 2. Time of each model.
Table 2. Time of each model.
Dataset1Dataset2Dataset3
DRFM16.8836619764
GAFM16.0333919189
DeepCrossing17.5639821045
Wide and Deep17.4944722356
DeepFM19.2245223452
AFM19.8447624476
Table 3. Growth of the DRFM compared with other models.
Table 3. Growth of the DRFM compared with other models.
BASELINESINPROVEMENT
GAFM+1.56%
DeepCrossing+6.03%
Wide and Deep+1.96%
DeepFM+1.19%
AFM+1.54%
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Yu, H.; Yin, J. Deep Reinforcement Factorization Machines: A Deep Reinforcement Learning Model with Random Exploration Strategy and High Deployment Efficiency. Appl. Sci. 2022, 12, 5314. https://0-doi-org.brum.beds.ac.uk/10.3390/app12115314

AMA Style

Yu H, Yin J. Deep Reinforcement Factorization Machines: A Deep Reinforcement Learning Model with Random Exploration Strategy and High Deployment Efficiency. Applied Sciences. 2022; 12(11):5314. https://0-doi-org.brum.beds.ac.uk/10.3390/app12115314

Chicago/Turabian Style

Yu, Huaidong, and Jian Yin. 2022. "Deep Reinforcement Factorization Machines: A Deep Reinforcement Learning Model with Random Exploration Strategy and High Deployment Efficiency" Applied Sciences 12, no. 11: 5314. https://0-doi-org.brum.beds.ac.uk/10.3390/app12115314

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop