Next Article in Journal
Coordinated Mitigation Control for Wideband Harmonic of the Photovoltaic Grid-Connected Inverter
Next Article in Special Issue
Production Planning Forecasting System Based on M5P Algorithms and Master Data in Manufacturing Processes
Previous Article in Journal
Advances in Gaseous and Particulate Air Pollutants Measurement
Previous Article in Special Issue
Robustness Optimization of Cloud Manufacturing Process under Various Resource Substitution Strategies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Load Balancing of Two-Sided Assembly Line Based on Deep Reinforcement Learning

1
Process Research Institution, China National Heavy Duty Truck Group Co., Ltd., Jinan 250100, China
2
5G Intelligent Manufacturing Research Center, Institute of Marine Equipment, Shanghai Jiao Tong University, Shanghai 200240, China
3
Institute of Intelligent Manufacturing and Information Engineering, School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
*
Author to whom correspondence should be addressed.
Submission received: 20 May 2023 / Revised: 12 June 2023 / Accepted: 20 June 2023 / Published: 23 June 2023
(This article belongs to the Special Issue Design and Optimization of Manufacturing Systems)

Abstract

:
In the complex and ever-changing manufacturing environment, maintaining the long-term steady and efficient work of the assembly line is the ultimate goal pursued by relevant enterprises, the foundation of which is a balanced load. Therefore, this paper carries out research on the two-sided assembly line balance problem (TALBP) for load balancing. At first, a mathematical programming model is established with the objectives of optimizing the line efficiency, smoothness index, and completion time smoothness index of the two-sided assembly line (TAL). Secondly, a deep reinforcement learning algorithm combining distributed proximal policy optimization (DPPO) and the convolutional neural network (CNN) is proposed. Based on the distributed reinforcement learning agent structure assisted by the marker layer, the task assignment states of the two-sided assembly and decisions of selecting tasks are defined. Task assignment logic and reward function are designed according to the optimization objectives to guide task selection and assignment. Finally, the performance of the proposed algorithm is verified on the benchmark problem.

1. Introduction

An assembly line (AL) is an arrangement of workstations in a streamlined manner according to the product assembly process sequence for the organization and the arrangement of production. A workstation is an assembly unit that focuses on a specific segment of production, and the workpiece is assembled in all the workstations to form a complete product. Due to the assembly line adopting the flow operation mode, the assembly operation process is standardized, and the assembly workers are generally fixed in a single station or several adjacent stations for repeated operations, which increases the utilization rate of workers, thus greatly improving production efficiency [1].
In the two-sided assembly line (TAL), the left and right sides of the same workstation can independently execute assembly of the same product with different processes in parallel, as shown in Figure 1. These are called mated stations [2]. Compared with the one-sided AL, the TAL can effectively shorten the length of theAL, improve the utilization rate of auxiliary tools, reduce the time loss caused by the movement of workers between various stations, and reduce the transportation cost of assembly parts [2].
To balance the TAL is to distribute a group of tasks evenly to each station as far as possible under certain constraints to pursue the optimization of one or more objectives [3]. However, after the assembly line is put into operation, the original balance parameters, such as cycle time, operation content, operation time, and assembly process, may change with the improvement of assembly workers’ technology, product upgrades, customer demand change, etc. In such cases, the original assembly line balance may be disrupted, prohibiting the assembly line from being a stable, efficient, and high-quality operation, and thus greatly reducing the economic benefits of the relevant enterprises. Therefore, it is necessary to optimize and improve the original balancing scheme, which involves reassigning tasks to achieve the assembly line load balance and improve the operation efficiency of the TAL.
TALBP is one of the NP-hard problems belonging to combinatorial optimization [4]. Since it was proposed in 1993 [5], it has been widely studied. The main research methods include exact algorithm, heuristic algorithm, and meta-heuristic algorithm [1,2]. Although the exact algorithm can obtain the optimal solution, its solving speed is slow and can can be used for small-sized problems; the heuristic algorithm is fast, simple, and efficient, but the solution results cannot reach the global optimal; and the meta-heuristic algorithm is relatively fast and effective, but the iterative search process is usually time-consuming and needs to be solved iteratively for each problem case. Moreover, these traditional optimization algorithms rarely make use of historical information to adjust behavior and cannot effectively use historical solving experience for learning; hence, there is great room for improvement in terms of solving large-scale problems.
In addition, different cases of problems have the same combinatorial optimization structure (the objective function or the coefficient of constraint conditions); there are only differences in specific values; for example, improving assembly workers’ skills will reduce the working time of tasks, but the correlation between tasks does not change. As an artificial intelligence algorithm closer to the way of human thinking, the deep reinforcement learning algorithm has deep association learning ability compared with the traditional optimization algorithm. When solving the cases at a same scale, historical experience of different cases—that is, existing task assignment schemes—can be associated and learned, the essential information of problems can be mined, and task assignment strategies can be obtained and automatically updated to achieve efficient solutions to similar cases. Moreover, the deep reinforcement learning algorithm has higher adaptability to the complex and ever-changing production environment, which is easy to adjust so as to alter the solution. In addition, although deep reinforcement learning algorithms have begun to be tried in solving such AL balancing problems [6,7], they are currently used to optimize simulation models and resource allocation. Therefore, this paper carries out the study on load balancing of TAL based on deep reinforcement learning for the first time.
The remaining sections are introduced as follows. Section 2 provides a literature review for the load balancing-oriented TALBP and deep reinforcement learning. Section 3 shows the mathematical model of load balancing-oriented TALBP. Section 4 describes the deep reinforcement learning algorithm combining distributed proximal policy optimization (DPPO) and convolutional neural network (CNN). The experimental verification is carried out in Section 5 and conclusions and future work avenues are presented in Section 6.

2. Literature Review

2.1. Two-Sided Assembly Line Balance Oriented to Load Balancing

According to different production stages and research objectives, TALBPs are mainly divided into two categories [8]. The first type of balancing problem occurs in the design and planning stage, the aim was to explore that how to use the minimum number of workstations and/or stations to achieve production under the premise of a given cycle time. For this type problem and its variants, many scholars have designed exact algorithms [9,10], heuristic algorithms [11,12], and meta-heuristic algorithms [13,14]. With the assembly line officially put to use, some potential problems (task assignment, resource allocation, etc.) that could not be predicted in time in the design stage of the assembly line become increasingly obvious. Moreover, assembly workers’ skills, product configuration, and customer demands may also change, and thus the task content and/or operation time changes and the original balancing plan needs to be further optimized. The second type of balance problem occurs in this stage, which is the problem of how to obtain a better cycle time by optimizing task assignment under the premise of definite workstations [15,16,17]. However, the above studies did not consider load balancing, which is the initial goal of balance research [5] and the source of maintaining the stable and efficient operation of the assembly line [18,19].
Considering the importance of load balancing, some scholars have specifically studied this kind of problem and defined it as the third type of balance problem—that is, how to optimize task assignment and balance the load of all stations as much as possible under the premise of fixed cycle time and line length. Ozcan and Toklu [20] used the assembly line smoothness index to measure load balancing and combined it with assembly line efficiency as optimization objectives to build a mathematical model of TAL load balancing. The minimum deviation method (MDM) was used to combine multi-objective problems into single-objective problems, and a tabu search algorithm was proposed. Lan and Yee [21] designed a nonlinear integer programming model to maximize the smoothness of the line for the third-type TALBP and solved it using Lingo. Purnomo and Wee [22] designed a harmonic search (HS), combining non-inferior sorting rules for TALBP considering zone constraints. The goal was to optimize the production efficiency and balance the load of stations. Li et al. [23] used an improved version of teaching and learning optimization (ITLBO) to balance multi-constraint and multi-objective TAL. The goals were the optimization of assembly line efficiency, assembly line smoothness, and total associated unit production costs. Li et al. [24] proposed an iterative local search algorithm (ILS) to realize the load balancing of TALs. A heuristic algorithm was applied to obtain the better initial population; the local search and disturbance factors were used iteratively until a local optimal solution was found. This algorithm also applies the priority decoding method based on the combination of orientation selection rules and task selection rules to reduce the load difference between the left and right stations of a workstation as far as possible so as to ensure that the sequence-related waiting time is reduced to a certain extent. Wu et al. [25] used a hybrid algorithm based on variable neighborhood search and gravity search for the TALBP considering zone constraints to achieve load balancing among workstations and reduce the series-dependent completion time of tasks as much as possible. Buyukozkan et al. [26] provided the dictionary bottleneck hybrid TALBP, balancing the load of all workstations by gradually minimizing the weighted load of the workstations with the maximum load and finally achieving the load balancing of all workstations on the assembly line. Yadav and Agrawal [27] measured load balancing by the idle time length on the workstations and established the related mathematical model. A branch and bound algorithm was programmed in the Lingo solver for this problem, and load balancing schemes of various benchmark problems were explored. After that, the workload maximization of stations was used as the measure of load balance [28], an exact algorithm was designed, and its effectiveness was verified through an engineering project. Abdullah Make and Ab Rashid [29] carried out a study on load balancing of TALs in automobile assembly workshops and designed a particle swarm optimization algorithm. To obtain a better task assignment scheme, in addition to cycle time and stations, the influences and constraints of workers’ skill levels, tools, and equipment on assembly task assignment were also considered.
To sum up, current research on the load balancing of TALs is still mainly focused on the optimization design and solution of the traditional algorithm (i.e., exact, heuristic, and meta-heuristic algorithms), and as far as we know, there is no research on load balancing-oriented TALBP based on deep reinforcement learning, which is more suitable for complex and variable manufacturing processes. In addition, for multi-objective optimization, most of them are converted into single objectives by mathematical programming methods. The operation involved weighted coefficient or unified dimensions; that is, the search direction is delimited artificially, which reduces the search space and cannot guarantee the optimality of the solution.

2.2. Deep Reinforcement Learning

The reinforcement learning algorithm (RLA) could be used to obtain optimal strategies for sequencing decision problems [30]. As shown in Figure 2, the agent of RLA interacts with the environment continuously, observes the environment state st, makes a decision action at, and obtains the feedback (reward) rt from the environment, then adjusts the strategy according to the feedback information from the environment so that the subsequent output decision action can meet the expectation.
The deep learning algorithm (DLA) is a method by which to gradually acquire the whole picture of things through multi-level learning and abstracting from simple to complex [31]. The deep reinforcement learning algorithm (DRLA) generated by the combination of reinforcement learning (RL) and deep learning (DL) shows strong data processing and environment interaction abilities in terms of self-adaptation and self-learning, has rapid decision-making abilities combined with offline training and online decision-making, and highly versatile generalization ability, which can better solve complex problems [32].
In 2015, DeepMind [33] proposed the first DRLA—the Deep Q-learning Network (DQN). It combined deep neural networks with reinforcement learning and applied them to the research of games, Go (i.e., weiqi) and other fields, achieving impressive results. Since then, research on various deep reinforcement learning methods has been rapidly carried out, and the applications have expanded from games to other fields [34,35,36]. At present, for combinatorial optimization, deep reinforcement learning algorithms have already penetrated into the classic travel salesman problem [37], the path optimization problem [38], the packing problem [39], the maximum cutting problem [40], and other operational research problems. At the same time, some scholars, aiming to solve practical production problems, have also begun to introduce the deep reinforcement learning algorithm into the research of manufacturing problems [41]. In the area of AL balancing, Li et al. [6] focused on the research of balancing AL in the digital domain, and designed a DRLA with the support of deep deterministic policy gradient (DDPG) to enhance the operation and simulation effect of the assembly line digital twin model. Lv et al. [7] combined the sequencing problem with the assembly line balance problem and proposed a new version of DRLA on the basis of DDPG, in which an iterative interaction mechanism between task assembly time and station load were designed to achieve production task sequencing and worker allocation layer by layer. The objective was minimizing the work overload.
Although there have been some preliminary achievements in solving combinatorial optimization problems by DRLA, to the best of our knowledge, there are no studies of the deep reinforcement learning algorithm for the TALBP so far.

3. Mathematical Model for Load Balancing-Oriented TALBP

3.1. Assumptions

(1) The assembly line task assignment scheme is available, and the specific task assignment is known;
(2) The product variety is unique, and the processes are determined;
(3) Cycle time is deterministic;
(4) Execution time of takes and precedence relationship between tasks are known;
(5) Buffers, parallel stations and tasks are not considered.

3.2. Parameters

n s : Number of workstations.
n m : Number of stations.
I : Task set, I = 1 ,   2 ,   ,   i ,   ,   m .
J : Workstation set, J = 1 ,   2 ,   ,   j ,   ,   n .
j , k : the specific station of the workstation j, i.e., the operating orientation of the station, k = 1, represents left station, k = 2, represents right station.
A L : Task set that can only be assembled in the left station, A L I .
A R : Task set that can only be assembled in the right station, A R I .
A E : Task set that can be assembled in both left and right stations, A E I .
P i : Task set that contains all immediate precedence tasks of task i.
P a i : Task set that contains all precedence tasks of task i.
S i : Task set that contains all immediate successor tasks of task i.
S a i : Task set that contains all successor tasks of task i.
P C : Set of tasks without precedence tasks.
C i : Task set opposite to operating orientation of task i. C ( i ) = A L , i A R ; C ( i ) = A R , i A L ; C ( i ) = Φ , i A E .
K ( i ) : Set of operating orientation indication of a task i. K ( i ) = 1 , i A L ; K ( i ) = 2 , i A R ; K ( i ) = 1 , 2 , i A E .
t i s : Start time of the task i.
t i f : Finish time of the task i.
t i : Operation time of the task i, t i = t i f t i s .
c t : Cycle time.
μ : A constant with a larger value.
x i j k = 0 , 1 : If the task i is assigned to the workstation j , k , the value is 1, otherwise it is 0.
z i p = 0 , 1 : If task i and task p are assigned to the same workstation, the task i is assigned earlier than task p, the value is 1, otherwise 0.

3.3. Mathematical Model

Equations (1)–(3) are the objective functions, and their calculation formulae are shown in Equations (4)–(6). ST jk = i S jk t i , is the total working time of tasks which are assigned to station (j,k), and ST max = max ST jk is the maximum one of them. Ct ( j , k ) represents the completion time of station (j,k), and   Ct max = max Ct ( j , k ) is the maximum thereof. Equation (7) indicates that any task can only be assigned to one station. Equations (8) and (9) show cycle time constraints—that is, the completion time of each station must be less than cycle time. Equation (10) represents the precedence constraint. Equations (11)–(13) represent sequence-dependent constraints. Equations (14)–(17) give the definition of each variable.
max L E
min S I
min C S I
L E = i = 1 n m t i c t n m × 100 %
S I = j J k = 1 2 ( S T max S T j k ) 2 n m
C S I = j J k = 1 2 ( C t max C t ( j , k ) 2 ) n m
j J k K ( i ) x i j k = 1 , i I
t i f c t , i I
t f t i , i I
g J k K ( i ) g x h j k j J k K ( i ) j x i j k , i I P 0 , h P ( i )
t f t h f + μ ( 1 k K ( h ) x h j k ) + μ ( 1 k K ( i ) x i j k ) t i , i I P 0 , h P ( i ) , j J
t p f t i f + μ ( 1 x p j k ) + μ ( 1 x i j k ) + μ ( 1 z i p ) t p , i I , p r | r I ( P a ( i ) S a ( i ) C ( i ) ) , i r , j J , k K ( i ) K ( p )
t i f t p f + μ ( 1 x p j k ) + μ ( 1 x i j k ) + μ z i p t i , i I , p r | r I ( P a ( i ) S a ( i ) C ( i ) ) , i r , j J , k K ( i ) K ( p )
x i j 1 = 0 , 1 , i A L , j J
x i j 2 = 0 , 1 , i A R , j J
x i j k = 0 , 1 , i A E , j J
z i p = 0 , 1 , i I , p r | r I ( P a ( i ) S a ( i ) C ( i ) ) , i r

4. Deep Reinforcement Learning Algorithm Based on DPPO and CNN (DPPO–CNN)

4.1. DPPO–CNN Agent

The architecture of DPPO–CNN with distributed multiple processes is shown Figure 3; the main process and m subprocess are turned on simultaneously. The main process is responsible for network training and update (gradient calculation and update), while the subprocess is only responsible for data acquisition without gradient calculation. The main process includes the experience pool and the main Actor–Critic network, and each subprocess includes the child Actor–Critic network. The Actor network is used to make task assignment decisions  a t  based on the environmental status  s t  of the TAL, while the Critic network is used to evaluate the quality of task allocation decisions  a t . Actor–Critic network structures are shown in Figure 4 and Figure 5.
(1) Initialize the main Actor–Critic network of the main process and send its parameters to each subprocess through an orbit.
(2) The child Actor–Critic network of the subprocess loads the main Actor–Critic network and then interacts with the environment and transmits interactive trajectory (state matrix s, task assignment decisions a, reward function r) to the main process through an orbit.
(3) The main process stores the interaction experience of all subprocesses in the experience pool. When the amount of experience stored in the experience pool exceeds the capacity of the experience pool, it is packaged as a training set to train the main Actor–Critic network.
(4) Transfer the updated main Actor–Critic network to each subprocess again and go back to (1).
The Actor network is the strategy network, which approximates the optimal task assignment strategy p θ ( a t | s t ) by using the neural network with parameter θ . The network structure is shown in Figure 4, including a two-layer convolutional network and a three-layer fully connected network. The dimensional matrix  M × N  is the input of the network at time  t  corresponding to the environmental status  s t . M  is the number of feature vectors, and  N  is the number of total assembly tasks. After the matrix is input, two layers of convolution operations are carried out on it. The convolution Kernel size is as follows: Kernel = (1, 3). The feature vector obtained after convolution is flattened and input into the three fully connected layers. The number of nodes in the first two layers is 256, and the number of nodes in the last layer is  N . Then, it is normalized by the SoftMax function and outputs  p θ ( a t | s t ) , which is the probability of output task to be assigned  a t  when the Actor policy network is in the status  s t .
The Critic network is evaluation network, which approximates the optimal strategy evaluation value  v Ψ ( s t | a t ) by using the neural network with parameter Ψ . The network structure is shown in Figure 5. In this paper, the first several layers of the Critic network structure are the same as that of the Actor network structure, but the last layer is the linear regression layer; that is, the SoftMax layer in the Actor network is replaced with v Ψ ( s t | a t ) = f ( h ( t ) ; Ψ ) = ω h ( t ) + b , where v Ψ ( s t | a t ) is the output of the Critic network at time t and h t is the output of the previous fully connected layer. Ψ is the parameter of the internal unit node of the network, including the weight ω and the bias item b .
In each subprocess, the learning process is shown in Figure 6. The agent observes environmental status s t of task assignment, makes the selection decision, and outputs the task to be assigned a t and updates the status and gives back the reward r t after the assignment of task a t . The agent can solve the problem through continuous interaction with the environment. After each problem is solved, the trajectory of the interaction between the agent and the environment (including status, decision task, and reward) is stored in the experience pool. These dates are preprocessed to obtain training data in the experience pool. The interaction between the agent and the environment is suspended when the amount of data stored in the experience pool reaches its capacity limit. Parts of the training data are selected randomly by the agent from the experience pool, the network (task allocation strategy) is updated, and the experience is learned. The higher reward value is obtained through repeated trial and error, and eventually, the maximization of the cumulative reward and the optimization of the task allocation strategy can both be realized.

4.2. Task Assignment State of TAL

As shown in Table 1, there are 18 task assignment state features for the load balancing-oriented TALBP considered. Values of features 1–5 represent the related sequence numbers tasks; features 6–18 represent the overall situation of task assignment scheme of TAL; in particular, values of state features 11–16 are rounded.
Original state feature information is abstracted and preprocessed by one-hot encoding to give it two-dimensional arrangement feature information, which is suitable for subsequent processing by convolutional neural network. If the number of tasks in TAL is N, the matrix with dimension 18 × N, as the environment state s t , can be obtained after preprocessing.
Figure 7 shows the two types of the initial state matrix of P16 (Figure 8), and the state feature value of the initial state is shown in Table 2 (Figure 9).

4.3. Decision of Selecting Tasks

The mask layer is introduced to ensure that only the tasks that meet the constraints of precedence, operation orientation, and sequence dependence can be selected. Take the P16 problem as an example; the network parameters θ of the agent Actor strategy adopt orthogonal initialization. In the state s 1 of task assignment in the TAL environment, the output p θ ( a 1 | s 1 ) is shown in Figure 9. At this time, if sampling is conducted according to the probability distribution, the task to be assigned is 3, which does not satisfy the precedence constraint. After processing at the mark layer—as shown in Figure 10—only task 1 and task 2 satisfy constraints, i.e., they can be selected. At this time, if sampling is conducted according to probability distribution, task 2 can be selected to be assigned.

4.4. Task Assignment

Task assignment procedure is shown in Figure 11 and the contents are as follows.
Step 1: Initialize to generate the initial state of the TAL environment.
Step 2: Start a new workstation and set the earliest assignable task time of to 0.
Step 3: Generate a task set A t without precedence tasks; alternatively, all of its precedence tasks have already been assigned, according to precedence constraint.
Step 4: Select the station (left or right) with a smallest start time as the assigned station m. If the start time is the same, choose the left station.
Step 5: Check whether the current workstation is the last one. If yes, perform Step 6; otherwise, turn to Step 7.
Step 6: Generate the set of assignable tasks B t for station m.
Generate rules are as follows: (1) select the task without real-time predecessors; (2) select the task for which the completion time of its immediate predecessors is less than the earliest task-assignable time of the station; (3) select the task whose operating orientation is the same as the operating edge of station m.
Step 7. Check whether the task output by the agent is in the set B t . If it is, turn to Step 11; otherwise, go to Step 8.
Step 8. If A t  is an empty set, stop this algorithm; otherwise, perform Step 9.
Step 9. If the earliest task assignable time of station (t1) is earlier than the earliest task assignable time of the mated station (t2), t1 = t2, and return to Step 5; otherwise, perform Step 10.
Step 10. If both sides of the station have been scanned, start a new workstation and return to Step 2; otherwise, scan the other side and return to Step 5.
Step 11. Assign the selected task by DPPO–CNN agent.
Step 12. Update the number of immediate predecessors and their related completion time of all the unassigned tasks, the earliest task assignable time of station m, and the assembly line state; then, return to Step 3.

4.5. Reward Function

Reinforcement learning agents can achieve the optimal task assignment strategy p θ ( a t | s t ) by maximizing cumulative rewards ( r s u m ) and then realize the optimization objectives. Sparse reward is adopted by the traditional deep reinforcement learning algorithm, i.e., rewards r 1 , r 2 , …, r n 1 are all equal to 0 until all tasks have been assigned. The environment gives feedback reward r n  to the agent; then, the accumulate reward r s u m = r n . This is an easy way to converge the algorithm because for many tasks, the agent can obtain positive samples with a certain probability by conducting random exploration in the environment, and the positive samples occupy a relatively significant proportion of the total samples at the early stage of learning.
However, with the increase in task complexity, the probability of obtaining positive samples through random exploration becomes small, and the sparse feedback signals cannot indicate the exploration direction for the agent. The algorithm will be difficult to converge or the convergence speed will be very slow. To overcome this problem, it is necessary to add other reward items or punishment items to make the reward function become dense, and to guide the agent to explore the environment more efficiently [39]. In this paper, for objective LE, if both the operation time of the tasks and the number of stations are determined, the smaller the ct, the higher the LE. For objective SI, if the workloads of tasks which are assigned to a station are substantially equal to cycle time of that station, the difference between the S T max and the S T i  will be small, and so will the SI. Similarly, for objective CSI, if the completion time of a station is substantially equal, the value of CSI is small, which means the operation of the assembly line is stable and balanced. Therefore, the values of SI and CSI will decrease if the idle time of each station is reduced.
There are two types of idle time that can be generated on a two-sided assembly line due to its parallel structure and the sequence-dependent relationship of its tasks, as shown in Figure 12: (1) End idle time—If the completion time of the current station will exceed the cycle time when the new task is assigned to the current station, the new task has to be assigned to the next station. In this case, the completion time of the current station will be less than the cycle time, and the time difference between the completion time of the current station and cycle time is the end idle time. As shown in Figure 12, on workstation 1, after task 5 is completed, task 4 needs to be assigned, but if the operation time of task 4 is 9, whether it is assigned to the left or right stations of workstation 1, the completion time will exceed cycle time 18; therefore, task 4 has to be assigned to workstation 2, and there has to be idle time at the end of the left and right stations of workstation 1. (2) Sequence-dependent idle time—If sequence-related tasks are assigned to the left and right stations of the same workstation, due to the absolute sequence constraint between them, this may lead to a situation in which some tasks may have to sit idle and wait, and this kind of idle time inside stations is called sequence-dependent idle time. As shown in Figure 12, on workstation 2, task 7, assigned to right station, can not be started from time 0 because its immediate precedent, task 4, is assigned to left station of the same workstation. Task 7 has to wait before task 4 is completed; therefore, there are nine time unitsof sequence-dependent idle time before task 7, which is caused by task 4.
Moreover, the logic of task assignment in this paper may cause the cycle time of the last workstation to be much larger than that of all other workstations, as shown in Figure 12. Due to fewer tasks being assigned to the first two stations, more tasks are assigned to the last workstation (3), resulting in the overloading of the completion time of workstation 3. Therefore, the difference between the actual cycle time of the last station (i.e., its completion time) and the ideal cycle time should be controlled.
Under these conditions, the reward function is set as follows during the task assignment process:
(1) If a task is the last task at the current station and there is end idle time at the current station, the reward function r is calculated as Equation (18):
r =   ( e n d   i d l e   t i m e   o f   t h e   c u r r e n t   s t a t i o n ) ;
(2) If there is idle time in the station caused by a between thesequence-dependent relationship of tasks, i.e., sequence-dependent idle time, the reward function r is calculated as Equation (19):
r =   ( s e q u e n c e d e p e n d e n t   i d l e   t i m e ) ;
(3) If all tasks have been assigned in the last station, the reward function r is calculated as Equation (20):
r = ( l e L E + S I s i + C S I c s i + c t a c t u a l - c t i d e a l ) ,
where c t a c t u a l represents the actual cycle time of the last station (i.e., its completion time) and c t i d e a l represents the ideal cycle time;
(4) The other state is equal to 0, as shown in Equation (21):
r = 0 .

4.6. Overall Flow of DPPO–CNN

The overall flow of the proposed DPPO–CNN for TALBP-BL (Algorithm 1) is as follows.
Algorithm 1 Load balancing-oriented TALBP based on DPPO–CNN
1: Initializes the Actor–Critic network parameters θ , φ for the main process, maximum iteration number, experience pool capacity buffer size, maximum experience pool capacity max buffer size, batch data size, number of network updates of each round epoch; Subprocess Actor–Critic network parameters θ , φ .
2: for each episode do:
3: t = 1.
4: The main process empties the experience pool, sends its Actor–Critic network parameters θ , φ to each child process; each sub-process bilateral assembly line environment 1, 2, 3, …, m is initialized, generates states S t 1 , S t 2 , S t 1 , …, S t m ; each agent 1, 2, 3, …, m loads the network parameters of the main process.
5: while buffer size < max buffer size do:
6: while all tasks of m’s respective bilateral assembly lines have not been allocated do:
7: Agent 1, 2, 3, …, m observes the environment status S t 1 , S t 2 , S t 1 , …, S t m , respectively, and takes the tasks a t 1 , a t 2 , a t 1 , …, a t m to be assigned according to strategy p θ ( a t | s t ) .
8: Environment 1, 2, 3, …, m allocates tasks a t 1 , a t 2 , a t 1 , …, a t m separately, and feedback reward r t 1 , r t 2 , r t 1 , …, r t m .
9: t = t + 1 .
10: Environment 1, 2, 3, …, m update states S t 1 , S t 2 , S t 1 , …, S t m .
11: end while
12: Agent 1, 2, 3, …, m stores the interactive trajectory (solving experience) τ 1 , τ 2 , τ 3 , …, τ m , respectively, in the experience pool and packages it as training data.
13: end while
14: for epoch in {1, 2, …, epochs} do:
15: Randomly extract the training data with the size of batch size from the experience pool.
16: Calculate the network loss function of the actor strategy of the master process; evaluate the critic loss function of the master process.
17: Update the main process Actor’s policy network p θ ( a t | s t ) .
18: Update the main process Critic’s evaluation network v φ ( s t , a t ) .
19: end for
20: θ o l d , φ o l d θ , φ .
21: end for

5. Experimental Verification and Discussions

5.1. Implementation of the DPPO–CNN

In this paper, 59 instances of the benchmark problem (P9 [42] (Figure A1), P12 [42] (Figure A1), P16 [6] (Figure 8), P24 [42] (Figure A1), P65 [6] (Figure A2), P148 [5,6] (Table A1), and P205 [6] (Table A2)) are utilized to test the performance of the proposed DPPO–CNN. In addition, the DPPO–CNN algorithm is programmed in Python 3.6 and runs on a personal computer with Ubuntu 20.04LTS, 2.90 GHz CPU frequency, and the 16 g memory.
The parameters of DPPO–CNN are mainly decided according to the empirical values and the actual data of the agent interaction process, as listed in Table 3.

5.2. Verification of DPPO–CNN in Term of Model Training

To verify the performance of distributed multiple processes of DPPO–CNN, the model training results of DPPO–CNN are compared with the deep reinforcement learning with single process, named the PPO–CNN, which uses the same parameters, state matrix, and reward function. To clarify the effect of distributed multiple processes, P16 is used as the representative of small-scale cases and P65 as the representative of large-scale cases for detailed explanation, as shown in Figure 13 and Figure 14.
Figure 13a shows the cumulative reward curves of DPPO–CNN and PPO–CNN for P16; we can see that: (1) the cumulative rewards of both DPPO–CNN and PPO–CNN are increasing gradually, and they converge after 60 rounds of training; this verifies the effectiveness of our algorithm indirectly; (2) the cumulative reward curve of DPPO–CNN algorithm is rising faster than that of PPO–CNN algorithm, which shows that DPPO–CNN is better than PPO–CNN; (3) the gap is not obvious because of P16 is relatively simple and both algorithms can obtain a good solution strategy quickly. However, from the related results of large-scale case P65 (Figure 14a,b), the advantages of DPPO–CNN are very prominent; the convergance of PPO–CNN needs 1200 rounds of training, whereas that of DPPO–CNN only needs 600 rounds. The study concludes that the distributed architecture designed in this paper is helpful in solving large-scale cases. A similar conclusion can be also obtained from the comparison between DPPO–CNN and PPO–CNN in terms of loss curves in P16 and P65, respectively, as shown in Figure 13b,c and Figure 14c,d.

5.3. Verification of DPPO–CNN in Term of Solutions

The trained Actor–Critic network model of the DPPO–CNN agent is saved and utilized to solve the load balancing-oriented TALBP of all 59 instances with different scales. Both DPPO–CNN and PPO–CNN algorithms are ran 20 times for problem instance and the best results are record and exhibited in Table 4.
As can be seen from Table 4, the solution of the DPPO–CNN algorithm is superior to that of the PPO-TALBP algorithm in 49 cases out of all 59 test cases, which shows the absolute advantage of DPPO–CNN. Furthermore, (1) both DPPO–CNN and PPO–CNN perform well in solving the load balancing-oriented TALBP, especially in small-scale cases (P9, P12, P16, and P24), which shows that the main architecture of the deep reinforcement learning algorithm combining distributed proximal policy optimization (DPPO) and the convolutional neural network (CNN) is tenable; (2) DPPO–CNN has outstanding performance in solving large-scale cases (P65, P148, and P205). Take P148 with cycle time 255 as an example, LE = 91.34%, SI = 34.17, and CSI = 26.78 obtained by PPO–CNN, while LE = 99.53%, SI = 1.98, and CSI = 1.98 obtained by DPPO–CNN. Obviously, both SI and SCI decrease significantly, which means the load is more balanced. In addition, through calculation, it is found that in large-scale cases (P65, P148, and P205) that the solutions obtained by DPPO–CNN have an average increase of 7.89% in LE, an average decrease of 76.71% in SI, and an average decrease of 74.92% in CSI compared to the solutions obtained by PPO–CNN. (3) Although both DPPO–CNN and PPO–CNN perform excellently in terms of calculation time, the solution speed of DPPO–CNN is significantly better than that of PPO–CNN. For example, each kind of P65 instance can be resolved by DPPO–CNN within 0.04 s, while PPO–CNN resolves them 0.1 s; the calculation time here refers to the online solving time of the instance after the training is complete. Considering the fact that model training time makes up the majority of the running time of deep reinforcement learning algorithms, the comparison of the offline training time of the proposed DPPO–CNN and PPO–CNN is shown in Table 5. We can see that compared with PPO–CNN, DPPO–CNN also performs better in terms of offline training time.

6. Conclusions and Future Work Avenues

In this article, the load balancing-oriented TALBP has been studied and a deep reinforcement learning algorithm combining distributed proximal policy optimization (DPPO) and convolutional neural network (CNN) has been presented. To the best of our knowledge, this is the first attempt to solve the load balancing-oriented TALBP based on deep reinforcement learning. A mathematical model with objectives of optimizing line efficiency (LE), smoothness index (SI), and completion time smoothness index (CSI) is provided. In the proposed deep reinforcement learning algorithm, distributed multiple processes have been proposed to improve the search speed and capability of solutions. A total of 18 task assignment state features for the load balancing-oriented TALBP environment have been considered, which ensure that the agent can obtain more useful information from the environment and perform optimal selection as completely as possible, and different reward functions according to objectives and their implicit information have been proposed to guide good solution direction. Fianlly, the performance of the proposed algorithm has been verified on all scales of benchmark instances via a comparison with the single-process deep reinforcement learning algorithm in terms of model training and solution results.
Although the proposed algorithm performs better in this study, there is still the possibility of improvement; for example, we could directly obtainin the Pareto-optimal solution set using the deep reinforcement learning algorithm and/or extend the application of the proposed algorithm to other versions of the TALBP or other types of assembly lines, such as linear and parallel assembly lines, as well as similar production planning problems, such as the two-sided disassembly line balancing problem [43], the assembly sequence problem [44], and so on.

Author Contributions

Conceptualization, G.J. and S.S.; methodology, G.J. and Y.Z.; validation, C.W. and B.L.; formal analysis, G.J. and C.W.; investigation, C.W.; resources, S.S. and B.L.; data curation, Y.Z.; writing—original draft preparation, G.J., S.S. and B.L.; writing—review and editing, Y.Z., X.H., and C.W.; supervision, X.H.; project administration, Y.Z. and X.H.; funding acquisition, Y.Z. and X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by China National Heavy Duty Truck Group Co., Ltd., (Project No. 22H010100926), and New young teachers research start-up Fund of Shanghai Jiao Tong University (Project No. 22X010503668).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors are grateful to the anonymous reviewers, whose valuable comments and suggestions helped a lot to improve this paper, and to the editors for their kind and sincere support.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Figure A1. Small-scale cases: (a) P9; (b) P12; (c) P24.
Figure A1. Small-scale cases: (a) P9; (b) P12; (c) P24.
Applsci 13 07439 g0a1
Figure A2. P65.
Figure A2. P65.
Applsci 13 07439 g0a2
Table A1. P148.
Table A1. P148.
NO.TimeSideImmediate SucessorsNO.TimeSideImmediate Sucessors
116E5, 6, 7, 875101E88, 97
230E3765E77
37E4, 5, 6, 77728E78
447E8788E79
529E1479111E 80
68E9807E81
739E148126E106
837E108210E83, 89, 143, 146
932E148321E
1029E148426E85
1117E128520E
1211E138621E
1332E 8747E
1415E15, 168823E111
1553L178913E90
1653R179019E79
178E18, 1991115E105
1824L209235E135
1924R209326L
208E21, 22, 23, 249446E
217R25, 26, 27, 289520E101
228L25, 26, 27, 289631E104
2314L25, 26, 27, 289719E
2413R25, 26, 27, 289834E101
2510R299951E100
2625R2910039E101
2711L2910130E102, 103
2825L2910226E127
2911E3110313E127
3029R 10445E
3125E3610558E119
3210L3410628E107
3314R351078E108
3441L3610843E109
3542R3610940E110
3647R3711034E
377R 38, 4511123E112
3880R39112162L113
397R4011311L114, 116, 120, 123, 128
4041R41, 48, 5511419E115
4147R 11514E125
4216L4311631E117
4332L4411732E118
4466L 11826E126
4580L4611955E
467L4712031E121
4741L48, 49, 5512132E122
4813E 12226E126
4947L 12319E124
5033E5112414E125
5134L53,6912519E
5211L5312648E
53118L 12755E
5425L1331288L129
557R 54, 72, 76, 87, 8812911L130
5628E7313027L131, 137
5712L7913118L
5852L84, 8613236E135
5914E75, 8713323L135
603E 13420R135
613E6213546E136
628E6313664E
6316E6713722L
6433R65, 71, 7213815E139
658E66,9913934E140
6618E6714022E
6710E68141151L142
6814E95,98142148R143, 146, 147, 148
6928R8214364L
7011R71144170L145
71118R 145137R147, 148
7225R 13414664R
7340E84, 86, 87, 88, 9614778L
7440E7514878R
Table A2. P205.
Table A2. P205.
NO.TimeSideImmediate SucessorsNO.TimeSideImmediate Sucessors
1692E3610468R113
242E3, 4105232L106, 107
3261R5106122L108
4261L5107151E108
5157E7, 1310831L113
690E3610997E113
754R8110308R113
867R9111116L113
930R10112312R113
10106R1111334E114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 161, 162, 163, 169, 171, 174, 203, 204, 205
1132R12114128L160
1262R3611554E160
1354L14116175R160
1467L1511755E160
1530L16118306E126
16106L1711959E126
1732L1812059E126
1862L3612166E126
1956E3612266E126
2067E2212323E126
2186E22124244E125
2237E2312554E126
2341E24, 34126294R127, 128, 129
2472E26, 27, 2812784E135
2586R2812861E135
2616L3512957E135
2751R3513038R136
2866R29131944E132
2941R30, 33132511R133
3072R31, 32133625R189
3151R35134445R189
3216R3513568L136, 137, 138, 139, 140, 141, 142, 144, 145, 147, 148, 149, 150, 151, 152, 153, 158
3315R3513653L189
3415L3513749E160
3585E3613892E160
3659E37, 40, 41, 42, 62, 69, 72, 75, 83, 110, 111, 112139236E160
3723L38140116L143
3813L39141265L143
3919L45142149L143
40108E43, 5414374L160
41214E92144332E160
4280E43, 54145324E146
4337L44146104L160
4484L4514751L160
4518L46, 48, 51, 5314858R160
4612L4714967R160
4729L9215049R160
4837L49151107E160
4913L5015238L160
5070L9215327L154
51217L5215468E155
5272L92155207E156
5385L92156202E157
5443R5515783E189
5597R56, 59, 6115835R159
5637R5715958R189
5713R5816042E164, 170, 178, 179, 184
5835R9216168R167
59217R6016268R165
6072R9216368R164
6185R92164103R165
6225E63165103R166
6337E64166103R167
6437E65, 68167103R168
65103E66168103R177
66140E6716968L170
6749E80170103L172
6835E8017168L172
6951E70172103L173
7088E71173103L175
7153E7317468L175
72144E73175103L176
73337E74176103L177
74107E7617710E185, 186, 187, 188, 194, 195
75371E92178187E180
7697E77, 78, 79179134L180
77166E80, 8218089L181, 183
7892L8018158L182
7992R8018249L
80106E81183134L
8149E8418453L
8292E92185334E189
83371E9218624R189
8487E8518776R189
85162E86, 88, 9018876L189
8696E87189192E190, 191, 193
8779E9219098E
8896E89191258R192
8942E92192165E
9088R9119338R
9190R92194115E197
9297R93, 94, 95, 96, 97, 98, 9919583L196
93270R13519656R197
94452E13519729R198, 199, 201
9548R113198303R
96338E11319918R200
9734E10020029R
9865E100201154L202
9950E10020290L
100112E101, 103, 105, 109, 130, 131, 13420393L
10148E10220494E
102117E113205165E
10350E104

References

  1. Zhang, Y.; Hu, X.; Wu, C. A modified multi-objective genetic algorithm for two-sided assembly line re-balancing problem of a shovel loader. Int. J. Prod. Res. 2018, 56, 3043–3063. [Google Scholar] [CrossRef]
  2. Zhang, Y.; Hu, X.; Wu, C. Improved imperialist competitive algorithms for rebalancing multi objective two-sided assembly lines with space and resource constraints. Int. J. Prod. Res. 2020, 58, 3589–3617. [Google Scholar] [CrossRef]
  3. Hu, X.; Wu, E.; Bao, J.; Jin, Y. A Branch-and-bound Algorithm to Minimize the Line Length of a Two-sided Assembly Line. Eur. J. Oper. Res. 2010, 206, 703–707. [Google Scholar]
  4. Hu, X. Two-Sided Assembly Line Balancing Algorithm and Its Application, 1st ed.; Science Press: Beijing, China, 2015; pp. 3–10. [Google Scholar]
  5. Bartholdi, J.J. Balancing Two-sided Assembly Lines: A Case Study. Int. J. Prod. Res. 1993, 31, 2447–2461. [Google Scholar] [CrossRef]
  6. Li, J.; Pang, D.; Zheng, Y.; Le, X. Digital Twin Enhanced Assembly Based on Deep Reinforcement Learning. In Proceedings of the 11th International Conference on Information Science and Technology (ICIST), Chengdu, China, 21–23 May 2021; pp. 432–437. [Google Scholar]
  7. Lv, Y.; Tan, Y.; Zhong, R.; Zhang, P.; Wang, J.; Zhang, J. Deep reinforcement learning-based balancing and sequencing approach for mixed model assembly lines. IET Coll. Intell. Manuf. 2022, 4, 181–193. [Google Scholar] [CrossRef]
  8. Lee, T.O.; Kim, Y.; Kim, Y.K. Two-sided assembly line balancing to maximize work relatedness and slackness. Comput. Ind. Eng. 2001, 40, 273–292. [Google Scholar] [CrossRef]
  9. Hu, X.; Wu, E.; Jin, Y. A station-oriented enumerative algorithm for two-sided assembly line balancing. Eur. J. Oper. Res. 2008, 186, 435–440. [Google Scholar] [CrossRef]
  10. Wei, N.; Liu, S.; Chen, C.; Xu, Y.; Shih, Y. An Integrated Method for Solving the Two-Sided Assembly Line Balancing Problems. J. Adv. Manuf. Syst. 2023, 22, 181–203. [Google Scholar] [CrossRef]
  11. Li, Z.; Tang, Q.; Zhang, L. Two-sided assembly line balancing problem of type I: Improvements, a simple algorithm and a comprehensive study. Comput. Oper. Res. 2016, 79, 78–93. [Google Scholar] [CrossRef]
  12. Álvarez-Miranda, E.; Pereira, J.; Vargas, C.; Vilà, M. Variable-depth local search heuristic for assembly line balancing problems. Int. J. Prod. Res. 2023, 61, 3102–3120. [Google Scholar] [CrossRef]
  13. Yılmaz, D.; Emel, K.A.; Uğur, Ö.; Mehmet, S.İ. A modified particle swarm optimization algorithm to mixed-model two-sided assembly line balancing. J. Intell. Manuf. 2017, 28, 23–36. [Google Scholar]
  14. Huang, D.; Mao, Z.; Fang, K.; Yuan, B. Combinatorial Benders decomposition for mixed-model two-sided assembly line balancing problem. Int. J. Prod. Res. 2022, 60, 2598–2624. [Google Scholar] [CrossRef]
  15. Kim, Y.K.; Song, W.S.; Kim, J.H. A Mathematical Model and a Genetic Algorithm for Two-sided Assembly Line Balancing. Comput. Oper. Res. 2009, 36, 853–865. [Google Scholar] [CrossRef]
  16. Lei, D.; Guo, X. Variable neighborhood search for the second type of two-sided assembly line balancing problem. Comput. Oper. Res. 2016, 72, 183–188. [Google Scholar] [CrossRef]
  17. Kang, H.; Lee, A.H.I. An evolutionary genetic algorithm for a multi-objective two-sided assembly line balancing problem: A case study of automotive manufacturing operations. Qual. Technol. Quant. Manag. 2023, 20, 66–88. [Google Scholar] [CrossRef]
  18. Azizoglu, M.; Imat, S. Workload smoothing in simple assembly line balancing. Comput. Oper. Res. 2018, 89, 51–57. [Google Scholar] [CrossRef]
  19. Walter, R.; Schulze, P. On the performance of task-oriented branch-and-bound algorithms for workload smoothing in simple assembly line balancing. Int. J. Prod. Res. 2022, 60, 4654–4667. [Google Scholar] [CrossRef]
  20. Uğur, Ö.; Bilal, T. A tabu search algorithm for two-sided assembly line balancing. Int. J. Adv. Manuf. Technol. 2009, 43, 822–829. [Google Scholar]
  21. Lan, C.H.; Yee, M.S. Construct an INLP Mathematical Model to solve the Two-sided Assembly Line Balancing problem of Type-3. Adv. Mater. Res. 2012, 383–390, 4302–4305. [Google Scholar] [CrossRef]
  22. Purnomo, H.D.; Wee, H.M. Maximizing production rate and workload balancing in a two-sided assembly line using Harmony Search. Comput. Ind. Eng. 2014, 76, 222–230. [Google Scholar] [CrossRef]
  23. Li, D.; Zhang, C.; Shao, X.; Lin, W. A multi-objective TLBO algorithm for balancing two-sided assembly line with multiple constraints. J. Intell. Manuf. 2016, 7, 725–739. [Google Scholar] [CrossRef]
  24. Li, Z.; Tang, Q.; Zhang, L.; Hu, J. Balancing two-sided assembly line with a simple and effective iterated local search algorithm. ICIC Express Lett. 2015, 9, 2695–2701. [Google Scholar]
  25. Wu, Y.; Tang, Q.; Zhang, L. A hybrid gravitational search algorithm for two-sided assembly line balancing problem with zoning constraints. ICIC Express Lett. Part B Appl. 2016, 7, 2633–2640. [Google Scholar]
  26. Buyukozkan, K.; Kucukkoc, I.; Satoglu, S.I.; Zhang, D.Z. Lexicographic bottleneck mixed-model assembly line balancing problem-Artificial bee colony and tabu search approaches with optimised parameters. Expert Syst. Appl. 2016, 50, 151–166. [Google Scholar] [CrossRef]
  27. Yadav, A.; Agrawal, S. Minimize idle time in two sided assembly line balancing using exact search approach. In Proceedings of the 2019 International Conference on Management Science and Industrial Engineering, Phuket, Thailand, 24–26 May 2019; pp. 220–227. [Google Scholar]
  28. Yadav, A.; Verma, P.; Agrawal, S. Mixed model two sided assembly line balancing problem: An exact solution approach. Int. J. Syst. Assur. Eng. 2020, 11, 335–348. [Google Scholar] [CrossRef]
  29. Abdullah Make, M.R.; Ab Rashid, M.F.F. Optimization of two-sided assembly line balancing with resource constraints using modified particle swarm optimisation. Sci. Iran. 2022, 29, 2084–2098. [Google Scholar]
  30. Chen, X. Research on Scheduling and Navigation Strategies Based on Reinforcement Learning. Master’s Thesis, Zhejiang University, Hangzhou, China, 2021. [Google Scholar]
  31. Tercan, H.; Meisen, T. Machine learning and deep learning based predictive quality in manufacturing: A systematic review. J. Intell. Manuf. 2022, 33, 1879–1905. [Google Scholar] [CrossRef]
  32. Li, C.; Zheng, P.; Yin, Y.; Wang, B.; Wang, L. Deep reinforcement learning in smart manufacturing: A review and prospects. CIRP J. Manuf. Sci. Technol. 2023, 40, 75–101. [Google Scholar] [CrossRef]
  33. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  34. Chen, X.; Yao, L.; McAuley, J.; Zhou, G.; Wang, X. Deep reinforcement learning in recommender systems: A survey and new perspectives. Knowl.-Based Syst. 2023, 264, 110335. [Google Scholar] [CrossRef]
  35. Goby, N.; Brandt, T.; Neumann, D. Deep reinforcement learning with combinatorial actions spaces: An application to prescriptive maintenance. Comput. Ind. Eng. 2023, 179, 109165. [Google Scholar] [CrossRef]
  36. Kallestad, J.; Hasibi, R.; Hemmati, A.; Sorensen, K. A general deep reinforcement learning hyper heuristic framework for solving combinatorial optimization problems. Eur. J. Oper. Res. 2023, 309, 446–468. [Google Scholar] [CrossRef]
  37. Zhang, Z.; Liu, H.; Zhou, M.; Wang, J. Solving Dynamic Traveling Salesman Problems With Deep Reinforcement Learning. IEEE Trans. Neural Netw. Learn. 2023, 34, 2119–2132. [Google Scholar] [CrossRef] [PubMed]
  38. De Sirisuriya, S.C.M.S.; Fernando, T.G.I.; Ariyaratne, M.K.A. Algorithms for path optimizations: A short survey. Computing 2023, 105, 293–319. [Google Scholar] [CrossRef]
  39. Jiang, Y.; Cao, Z.; Zhang, J. Learning to Solve 3-D Bin Packing Problem via Deep Reinforcement Learning and Constraint Programming. IEEE Trans. Cybern. 2023, 53, 2864–2875. [Google Scholar] [CrossRef]
  40. Dai, H.; Khalil, E.B.; Zhang, Y.; Dilkina, B.; Song, L. Learning combinatorial optimization algorithms over graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 6349–6359. [Google Scholar]
  41. Zhang, C.; Wu, Y.; Ma, Y.; Song, W.; Le, Z.; Cao, Z.; Zhang, J. A review on learning to solve combinatorial optimisation problems in manufacturing. IET Collab. Intell. Manuf. 2023, 5, e12072. [Google Scholar] [CrossRef]
  42. Kim, Y.K.; Kim, Y.; Kim, Y.J. Two-sided assembly line balancing: A genetic algorithm approach. Prod. Plan. Control 2000, 11, 44–53. [Google Scholar] [CrossRef]
  43. Wang, K.; Li, X.; Gao, L. A multi-objective discrete flower pollination algorithm for stochastic two-sided partial disassembly line balancing problem. Comput. Ind. Eng. 2019, 130, 634–649. [Google Scholar] [CrossRef]
  44. Raju Bahubalendruni, M.V.A.; Biswal, B.B. An efficient stable subassembly identification method towards assembly sequence generation. Natl. Acad. Sci. Lett. 2018, 41, 375–378. [Google Scholar] [CrossRef]
Figure 1. Two-sided assembly line.
Figure 1. Two-sided assembly line.
Applsci 13 07439 g001
Figure 2. Solving process of reinforcement learning.
Figure 2. Solving process of reinforcement learning.
Applsci 13 07439 g002
Figure 3. The architecture of DPPO–CNN.
Figure 3. The architecture of DPPO–CNN.
Applsci 13 07439 g003
Figure 4. Actor network.
Figure 4. Actor network.
Applsci 13 07439 g004
Figure 5. Critic network.
Figure 5. Critic network.
Applsci 13 07439 g005
Figure 6. Learning process of each subprocess.
Figure 6. Learning process of each subprocess.
Applsci 13 07439 g006
Figure 7. State matrix of P16: (a) state matrix s 1 generated for third column of Table 2; (b) state matrix s 2 generated for fourth column of Table 2.
Figure 7. State matrix of P16: (a) state matrix s 1 generated for third column of Table 2; (b) state matrix s 2 generated for fourth column of Table 2.
Applsci 13 07439 g007
Figure 8. P16.
Figure 8. P16.
Applsci 13 07439 g008
Figure 9. Task assignment schemes of P16: (a) task assignment scheme I; (b) task assignment scheme II.
Figure 9. Task assignment schemes of P16: (a) task assignment scheme I; (b) task assignment scheme II.
Applsci 13 07439 g009
Figure 10. Mask layer process.
Figure 10. Mask layer process.
Applsci 13 07439 g010
Figure 11. Task assignment.
Figure 11. Task assignment.
Applsci 13 07439 g011
Figure 12. A task assignment scheme of P16.
Figure 12. A task assignment scheme of P16.
Applsci 13 07439 g012
Figure 13. Comparison between DPPO–CNN and PPO–CNN in term of model training (P16): (a) cumulative reward curves; (b) loss curves of the Actor network; (c) loss curves of the Critic network.
Figure 13. Comparison between DPPO–CNN and PPO–CNN in term of model training (P16): (a) cumulative reward curves; (b) loss curves of the Actor network; (c) loss curves of the Critic network.
Applsci 13 07439 g013
Figure 14. Comparison between DPPO–CNN and PPO–CNN in term of model training (P65): (a) cumulative reward curve of PPO–CNN; (b) cumulative reward curve of DPPO–CNN; (c) loss curves of the Actor network; (d) loss curves of the Critic network.
Figure 14. Comparison between DPPO–CNN and PPO–CNN in term of model training (P65): (a) cumulative reward curve of PPO–CNN; (b) cumulative reward curve of DPPO–CNN; (c) loss curves of the Actor network; (d) loss curves of the Critic network.
Applsci 13 07439 g014
Table 1. Task assignment state of TAL.
Table 1. Task assignment state of TAL.
No.FeaturesDescription
1PTimeA task with the longest operation time in the set of tasks without precedence tasks
2MFlowA task with the largest spare time caused by matched task in the set of tasks without precedence tasks
3SucNumA task with the largest number of successor tasks in the set of tasks without precedence tasks
4AllSucNumA task with the largest number of successor tasks
5MTNumA task with the smallest number of matched tasks in the set of tasks without precedence tasks
6SideThe operating orientation of a task with the smallest sequence number in the set of tasks without precedence tasks (0 is E type of tasks; 1 is non-E type of tasks)
7FANThe number of tasks can be selected at present
8NPNThe number of tasks without predecessors at present
9NERThe number of remaining E type of tasks
10NRThe number of remaining non-E type of tasks
11LRDiffoverAEThe load difference of remaining non-E type of tasks/the average time of remaining E type of tasks
12RToverAveptORemaining spare time of current station/the average time of tasks of current station
13RToverAveptTRemaining spare time of matched station/the average time of tasks of matched station
14OPToverRemainThe operation time/remaining time of tasks
15ARPWImproved position weight
16RRPWReverse position weight
17PreNumA task with the largest number of immediate successor tasks in the set of tasks without precedence tasks
18AllPreNumA task with the smallest number of precedence tasks
Table 2. Each state feature value of the initial state for P16.
Table 2. Each state feature value of the initial state for P16.
No.FeaturesState Feature Value
(Task Assignment Scheme I)
State Feature Value
(Task Assignment Scheme II)
1PTime11
2MFlow11
3SucNum11
4AllSucNum11
5MTNum11
6ARPW11
7RRPW11
8PreNum11
9AllPreNum11
10Side00
11FAN21
12NPN22
13NER109
14NR66
15LRDiffoverAE11
16RToverAveptO33
17RToverAveptT32
18OPToverRemain11
Table 3. Parameters of DPPO–CNN.
Table 3. Parameters of DPPO–CNN.
ParameterValueParameterValue
Number of hidden layers on the Actor and Critic network256Learning rate of the Actor network10−4
Activation function of hidden layers on the Actor and Critic networkLeaky ReluLearning rate of the Critic network2 × 10−4
Activation function of output layers on the Actor networkSoftmaxSample size256
Activation function of output layers on the Critic networkLeaky ReluMaximum capacity of the experience pool4096
Convolution kernel of convolution layers on the Actor and Critic network(1, 3)Study times per round8
Initialize network parametersOrthogonal initializationReturn discount factor0.99
OptimizerAdamCutting coefficient0.2
Table 4. Result comparison.
Table 4. Result comparison.
InstancectnsPPO-CNNDPPO–CNN
LESICSITime(s)LESICSITime(s)
P93394.44%0.410.410.00494.44%0.410.410.001
4385.00%1.871.870.00494.44%0.410.410.001
5285.00%1.121.120.00485.00%1.121.120.001
6270.83%2.501.580.00585.00%1.121.120001
7260.71%3.573.570.00585.00%1.121.120.001
P124478.13%1.371.370.00778.13%1.371.370.002
5383.33%1.230.910.00783.33%1.230.910.002
6383.33%2.972.970.00883.33%1.230.910.002
7289.29%1.121.120.00889.29%1.121.120.003
8278.13%1.941.230.00889.29%1.121.120.003
9269.44%3.351.870.00989.29%1.121.120.003
P1615478.095%7.187.090.00978.095%7.187.090.003
16385.42%2.942.520.00985.42%2.942.520.004
18375.93%4.972.550.01085.42%2.942.520.004
19371.73%5.862.380.01185.42%2.942.520.006
20368.33%6.812.830.01185.42%2.942.520.006
21365.08%8.566.180.01285.42%2.942.520.007
22293.18%1.871.580.01393.18%1.871.580.007
P2418497.22%0.710.610.01497.22%0.710.610.009
20487.5%2.922.030.01597.22%0.710.610.009
24397.22%1.000.710.01797.22%1.000.710.010
25395.33%2.522.520.01797.22%1.000.710.010
30377.78%7.925.210.02097.22%1.000.710.013
352100%0.000.000.020100%0.000.000.014
40287.50%6.822.550.020100%0.000.000.014
P65326897.76%9.999.940.07698.36%8.798.790.024
381795.59%27.4225.030.07998.98%6.106.100.027
435697.68%12.3711.580.08399.28%4.594.590.030
490686.72%102.1140.070.08799.28%4.594.590.032
512599.59%2.702.700.09299.59%2.702.700.037
544593.73%43.7636.760.10099.59%2.702.700.038
P1482041396.61%9.529.520.17599.53%1.661.660.094
2281293.64%22.6117.380.18199.30%2.432.430.097
2551191.34%34.1726.780.19399.53%1.981.980.101
306993.03%32.5820.760.21199.35%2.082.080.113
357889.71%57.3452.670.24299.46%2.182.180.118
378796.83%20.445.090.25699.73%2.002.000.123
408789.71%72.7931.400.26399.73%2.002.000.127
454694.05%40.9613.820.27199.77%1.581.580.136
459693.03%41.9533.690.28199.77%1.581.580.144
510683.73%133.08128.980.29799.77%1.581.580.149
P20511331193.66%103.25576.980.76198.44%24.4416.490.391
12751091.55%157.90108.270.78398.34%26.7519.200.420
1322998.11%35.0029.710.78998.70%29.5012.090.438
1455989.14%202.97119.140.82198.70%29.5012.090.445
1510896.63%63.6434.840.84398.85%25.9417.830.461
1650888.43%265.0096.170.85598.85%25.9417.830.474
1699798.15%46.3019.520.85798.79%30.9313.990.480
1888788.32%374.85253.900.86498.79%30.9313.990.499
1920786.85%447.20371.800.86998.79%30.9313.990.502
2077693.67%197.45121.670.88898.90%36.0811.420.518
2100692.64%205.35157.830.89198.90%36.0811.420.523
2266685.85%459.64395.980.90798.90%36.0811.420.542
2300684.58%516.95456.030.91298.90%36.0811.420.549
2454595.13%219.1663.680.92099.08%34.6911.020.558
2500593.38%251.86112.880.94499.08%34.6911.020.565
2643588.33%419.17277.990.95299.08%34.6911.020.577
2800583.38%683.82659.150.98199.08%34.6911.020.582
2832582.43%735.43711.630.99299.08%34.6911.020.589
Table 5. Comparison of the offline training times of DPPO–CNN and PPO–CNN.
Table 5. Comparison of the offline training times of DPPO–CNN and PPO–CNN.
InstanceTraining Time (h)
DPPO–CNNPPO-CNN
P90.150.10
P120.200.12
P160.350.20
P240.500.35
P651.000.68
P1481.500.95
P2054.002.34
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jia, G.; Zhang, Y.; Shen, S.; Liu, B.; Hu, X.; Wu, C. Load Balancing of Two-Sided Assembly Line Based on Deep Reinforcement Learning. Appl. Sci. 2023, 13, 7439. https://0-doi-org.brum.beds.ac.uk/10.3390/app13137439

AMA Style

Jia G, Zhang Y, Shen S, Liu B, Hu X, Wu C. Load Balancing of Two-Sided Assembly Line Based on Deep Reinforcement Learning. Applied Sciences. 2023; 13(13):7439. https://0-doi-org.brum.beds.ac.uk/10.3390/app13137439

Chicago/Turabian Style

Jia, Guangpeng, Yahui Zhang, Shuqi Shen, Bozu Liu, Xiaofeng Hu, and Chuanxun Wu. 2023. "Load Balancing of Two-Sided Assembly Line Based on Deep Reinforcement Learning" Applied Sciences 13, no. 13: 7439. https://0-doi-org.brum.beds.ac.uk/10.3390/app13137439

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop