Next Article in Journal
Path Planning of a Mechanical Arm Based on an Improved Artificial Potential Field and a Rapid Expansion Random Tree Hybrid Algorithm
Next Article in Special Issue
Parallel Implementation of the Algorithm to Compute Forest Fire Impact on Infrastructure Facilities of JSC Russian Railways
Previous Article in Journal
Risk Assessment Algorithm for Power Transformer Fleets Based on Condition and Strategic Importance
Previous Article in Special Issue
A Parallel Algorithm for Dividing Octonions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Load Balancing Strategies for Slice-Based Parallel Versions of JEM Video Encoder

by
Héctor Migallón
*,†,
Otoniel López-Granado
,
Miguel O. Martínez-Rach
,
Vicente Galiano
and
Manuel P. Malumbres
Computer Engineering Departement, Miguel Hernández University, 03202 Elche, Spain
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Submission received: 30 September 2021 / Revised: 27 October 2021 / Accepted: 30 October 2021 / Published: 1 November 2021
(This article belongs to the Collection Parallel and Distributed Computing: Algorithms and Applications)

Abstract

:
The proportion of video traffic on the internet is expected to reach 82% by 2022, mainly due to the increasing number of consumers and the emergence of new video formats with more demanding features (depth, resolution, multiview, 360, etc.). Efforts are therefore being made to constantly improve video compression standards to minimize the necessary bandwidth while retaining high video quality levels. In this context, the Joint Collaborative Team on Video Coding has been analyzing new video coding technologies to improve the compression efficiency with respect to the HEVC video coding standard. A software package known as the Joint Exploration Test Model has been proposed to implement and evaluate new video coding tools. In this work, we present parallel versions of the JEM encoder that are particularly suited for shared memory platforms, and can significantly reduce its huge computational complexity. The proposed parallel algorithms are shown to achieve high levels of parallel efficiency. In particular, in the All Intra coding mode, the best of our proposed parallel versions achieves an average efficiency value of 93.4%. They also had high levels of scalability, as shown by the inclusion of an automatic load balancing mechanism.

1. Introduction

The High Efficiency Video Coding (HEVC) standard [1] was developed by the Joint Collaborative Team on Video Coding (JCT-VC) in 2013, and replaced the previous H.264/Advanced Video Coding (AVC) standard [2]. The HEVC standard obtains savings in terms of bit rate of almost 50%, with the same visual quality as the previous H.264/AVC standard. However, this reduction is obtained at the expense of a huge increase in the computational complexity of the encoding process [3].
Recently, Cisco released a report called “Forecast and Trends: 2017–2022 White Paper” [4], in which they state that IP video traffic will form 82% of all IP traffic by 2022, representing a four-fold increase between 2017 and 2022. This represents a situation where each second, a million minutes of video content travel through the network. The report also predicts a constant increase in novel services such as video-on-demand (VoD), live internet video, virtual reality (VR) and augmented reality (AR). VoD traffic is expected to double by 2022, mainly due to the increasing numbers of consumers and higher video resolution (4 K and 8 K), bringing the amount of VoD traffic to the equivalent of 10 billion DVDs per month. The impact of user devices on global traffic is even more important when we consider popular services such as ultra-high-definition (UHD) video streaming. We need to take into account the fact that the bit rate for 4 K video is about 15 to 18 Mbps, more than double the bit rate for HD video and nine times more than standard definition (SD) video. The Cisco report estimates that by 2022, nearly 62% of the flat-panel TV sets installed will be UHD.
In order to deal with this increase in IP video traffic, new video coding techniques are required to obtain higher compression rates. Since the release of HEVC, both the ITU-T Video Coding Expert Group (VCEG) and the ISO/IEC Moving Picture Expert Group (MPEG) have been studying and analyzing new video coding technologies in order to improve the compression capability compared to that obtained by HEVC. To achieve this, a framework of collaboration has been created called the Joint Video Exploration Team (JVET).
The compression enhancements studied by the JVET have been implemented in a software package known as the Joint Exploration test Model (JEM) [5]. Its main purpose is to explore new coding tools that can provide significant improvements at the video coding layer. Following an analysis of the new coding tools that have been proposed within the last few years, JVET has begun developing a future video coding standard called Versatile Video Coding (VVC) [6]. The main goal of this coding standard is to achieve bit rate savings of between 25% and 30% compared to HEVC [7,8].
Preliminary results obtained with the new model (JEM 3.0) show an 18% reduction in bit rate using the All Intra (AI) coding mode configuration [9]. However, this is achieved at the expense of an extremely large increase in computational complexity (60x) with respect to HEVC.
This increase requires the introduction of acceleration techniques that leverage hardware architectures to reduce encoding time. Since JEM is an exploration model, only a few articles have been published on the subject, and most of them are focused on rate distortion (R/D) comparisons between JEM, HEVC and AV1 codecs [10]. Recently, the authors of [11] proposed a pre-analysis algorithm that was designed to extract motion information from a frame in order to accelerate the motion estimation (ME) stage. Their proposal showed that around 27% of the reference frames could be skipped, and that a time saving of more than 62% was achieved on the integer ME operation, with a negligible impact of 0.11% on the Bjøntegaard delta rate (BD rate) [12]. The authors in [13] proposed parallel algorithms based on the group of pictures (GOP) structure, increasing the BD rate when temporal redundancy is exploited.
In this paper, we present two JEM parallel encoder versions that are specifically designed for shared memory platforms, in order to speed up the encoding process for the All Intra (AI) coding mode, as this coding mode is especially useful for video editing. We performed several experimental tests to illustrate the behavior of the parallel versions in terms of their parallel efficiency and scalability. In the first parallel algorithm, a synchronous algorithm named JEM-SP-Sync, a domain decomposition is performed, in which the computational load is not balanced, although the data are almost equally distributed. The second parallel algorithm, named JEM-SP-ASync, is an asynchronous algorithm, also based on a domain decomposition but able to balance the load automatically.
The rest of this paper is organized as follows. In Section 2, we present a brief description of the coding tools introduced in the new JEM video coding standard. Section 3 describes the parallel algorithms developed for the AI coding mode of the JEM standard. Experimental numerical results are presented in Section 4, and in Section 5, some conclusions are drawn.

2. Description of the New Characteristics of the Joint Exploration Test Model (JEM)

The JEM codec is based on the HEVC reference software, called HM, meaning that the overall architecture of both codecs is quite similar since they share a hybrid video codec design. However, some of the coding stages are modified in the JEM implementation in order to improve the previous standard [14,15]. The R/D performance of JEM is better than in HEVC due to the use of these techniques, but this is achieved at the expense of an increased computational cost for the intra-prediction stage. This section describes the main improvements offered by JEM in comparison to the previous standards, as these could lead to a load imbalance when using parallel algorithms such as those proposed in this work.

2.1. Picture Partitioning

The way in which a video frame is split into a set of non-overlapping blocks is called picture partitioning. These non-overlapping blocks are arranged into a quadtree structure, where the root is called a coding tree unit (CTU) [16], and each CTU is further partitioned into smaller blocks. Figure 1 shows the division of a 1280 × 720 pixel frame into 240 CTUs, split into 20 columns by 12 rows, where the last row is composed of incomplete CTUs. The complete CTUs are composed of 64 × 64 pixels.
Two of the major differences between HEVC and JEM are the way in which a CTU is further partitioned and the size of the CTU itself. In HEVC, the maximum CTU size is 64 × 64 pixels, and there is the option to further recursively partition it into four square coding units (CU) whose sizes range from 64 × 64 pixels (i.e., no partitioning) to 8 × 8 pixels. The leaf blocks in a CU quadtree form the roots of two independent trees that contain prediction units (PUs) and transform units (TUs).
A PU can have the same size as the CU, or can be further split into smaller PUs of up to 8 × 8 pixels. The PUs store the prediction information in the form of motion vectors (MVs). In intra-prediction mode, HEVC uses a quadtree structure with only square PUs, while in inter prediction mode, asymmetric splitting of PUs is possible, giving up to eight possible partitions for each PU block: 2N × 2N, 2N × N, N × 2N, N × N, 2N × nU, 2N × nD, nL × 2N and nR × 2N. 1,0,0 Figure 2 shows an example of a CTU partition in HEVC and the relationship between CU partitioning, PU partitioning and TU partitioning.
The picture partitioning schema is modified in JEM in order to simplify the prediction and transform stages, and further partitions of CUs to form PU and TU trees are avoided. The JEM partitioning schema, called quadtree plus binary tree (QTBT), offers a better match with the local characteristics of each frame [17]. The highest level is a CTU, as in HEVC, but the main change is that block splitting below each branch is a binary partition giving the leaves.
The size of the CTU is larger than in HEVC, with a maximum of 256 × 256 pixels, and only the first partition needs to be square partitioned. Lower partitions can be partitioned further in a quadtree schema, but at the desired level the binary tree ends the partitioning schema. There are two types of splitting in the binary tree: symmetric horizontal and symmetric vertical. The binary tree leaf node is the CU, which is used for prediction and transformation with no further partitioning. Hence, in most cases, the CU, PU and TU have the same size. An example of QTBT is shown in Figure 3; here, the quadtree has two levels (continuous line), after which the binary tree starts (dotted lines).
In JEM, a CU can have either a square or a rectangular shape, and consists of coding blocks (CBs) of different color components; for example, a CU may contain one luma CB and two chroma CBs in the YUV420 chroma format.
In HEVC, inter prediction for small blocks is restricted in order to reduce memory accesses for motion compensation, i.e., bi-directional prediction for 4 × 8 and 8 × 4 blocks is not allowed, and 4 × 4 inter-prediction is also disabled. In QTBT, these restrictions are removed, which increases the computational cost of the JEM codec.
The CUs are not partitioned further for transforming or prediction unless the CU is too large for the maximum transform size. The maximum transform size is 128 × 128 pixels, which improves the coding efficiency for higher resolution video, e.g., 1080 p and 4 K sequences.
The following parameters are defined in order to obtain efficient partitioning in a QTBT tree:
  • CTU size: The root node size of a quadtree; the same concept as in HEVC.
  • MinQTSize: The minimum allowed size of the leaf node in the quadtree.
  • MaxBTSize: The maximum allowed size of the root node in the binary tree.
  • MaxBTDepth: The maximum allowed depth of the binary tree.
  • MinBTSize: The minimum allowed size of the leaf node in the binary tree.
MaxBTSize and minQTSize are two factors that are critical to the R/D performance and the encoding time. In JEM, these two parameters of the current slice are set adaptively larger when the average CU size of the previous encoded picture is larger, and vice versa for only the P and B slices [17].
At the transform stage, only the lower-frequency coefficients are maintained for transform blocks with sizes (width or height) larger than or equal to 64. For example, for an M × N transform block (where M is the width and N the height), when M is larger than or equal to 64, only the left M/2 columns of transform coefficients are retained. Similarly, when N is larger than or equal to 64, only the top N/2 rows of the transform coefficients are retained. This behavior can be skipped using skip mode for large blocks.
The proposed QTBT approach in JEM uses more partition types than HEVC in order to adapt the resulting partition tree to the contents of the scene. It is guided in this task by a trade-off between rate reduction and distortion reduction, and as we will see in the next section, this is a computationally expensive task. The whole video frame is first partitioned into equally sized (up to 256 × 256 ) CTUs, and each CTU is then further partitioned into CUs based on the scene contents, i.e., the time needed to process a whole CTU depends on the complexity of the underlying scene in each CTU. A summary of the main differences between HEVC and JEM related to picture partitioning is shown in Table 1.

2.2. Spatial Prediction

In order to be able to capture the finer edge directions presented in natural videos, the directional intra-modes in JEM have been extended from 33, as defined in HEVC, to 65. The addition of planar and DC modes gives a total of 67 different prediction modes for JEM. These denser directional intra-prediction modes (see Figure 4) are applied to all PU sizes and both luma and chroma intra-predictions.
The partitioning schema described in the previous section is directed by a rate-distortion optimization (RDO) algorithm that recursively searches for the best possible partitioning schema in terms of an R/D estimation. This algorithm tries all directional intra-modes for each of the possible partitions, (i.e., no partitioning, vertical partitions, horizontal partitions and quadtree partitions), at each recursion level, to find the one with the lowest cost. For a CTU in which the underlying scene is smooth, this recursion ends rapidly, and the number of trials for the 67 directional modes is therefore much lower than if the CTU belongs to a highly textured area within the scene. Thus, the computational effort is not evenly distributed over the CTUs in a video frame, and depends on the content of the scene.
In JEM, the list of most probable modes (MPMs) is extended from the three used in HEVC to six, and the selection procedure for these modes is also changed. In HEVC, the method proposed in [18] was adopted in the standard for building the MPMs list. The 35 intra-modes are divided into two groups: three MPMs and 32 remaining modes. The three MPMs are derived based on the modes of the PUs to the left of and above the current PU. The new procedure followed in JEM [19] uses five neighbors of the current PU: left, above, above left, below left and above right, as shown in Figure 5 [19].
The improvements in JEM described in the following paragraphs, which increase the complexity of the encoder, can also lead to a load imbalance when parallel processing the slices of a frame.
In addition to the changes in the JEM encoder mentioned above, there are also differences in the entropy coding of the MPM list between HEVC and JEM, as explained in [17,19], which lead to a reduction of the contexts used in the entropic encoder to signal the MPM index from nine to three, corresponding to the vertical, horizontal or non-angular class MPM modes.
The interpolation filters are also changed in JEM with respect to HEVC [15]. In HEVC, a two-tap linear interpolation filter is used to generate the intra-prediction block in the directional prediction modes (i.e., excluding the planar and DC predictors). In the JEM, four-tap intra-interpolation filters are used for directional intra-prediction filtering. Cubic interpolation filters are used for blocks smaller than or equal to 64 samples, and Gaussian interpolation filters are used for larger blocks. The filter parameters are set based on the block size, and the same filter is used for all modes.
Another improvement to JEM is made in the boundary prediction filters [15]. In HEVC, after the prediction block has been generated for the vertical or horizontal intra-modes, the leftmost column or the top row of the predicted block are adjusted further using the values of the boundary samples. In JEM, the number of boundary samples is increased from one to four (rows or columns) in order to obtain the predicted value using a two-tap filter (for the first and last angular modes, corresponding to intra-modes 2 and 34 in HEVC) or a three-tap filter (for modes between intra-modes 3–6 and 30–33 in HEVC), as shown in the example in Figure 6.
In JEM, the results of the intra-prediction planar mode are also improved by including a position-dependent intra-prediction component (PDPC) method that processes a specific combination of the unfiltered boundary reference samples with the filtered ones, thus improving the perceived quality of the predicted block when the planar mode is used. This process uses different weights and filter sizes (three-tap, five-tap, seven-tap) based on the block size.
To reduce some of the redundancy that remains after the prediction process between the luma and chroma components, JEM uses cross-component linear model (CCLM) prediction. In this process, the chroma samples are predicted based on the reconstructed down-sampled luma samples of the same CU, using a linear model for square blocks. For non-square blocks, additional down-sampling is needed to match the shorter boundary. There are two CCLM modes: single- and multiple-model CCLM modes (MMLM). In the single-model CCLM mode, JEM employs only one linear model to predict the chroma samples, while in MMLM, there can be two models. In MMLM, the models are built based on two groups of boundary samples that serve as a training set for deriving the linear models [15]. A summary of the main differences between HEVC and JEM related to spatial prediction is shown in Table 2.

3. Parallel Approaches

Slices are fragments of a frame formed by correlative (in raster scan order) CTUs (see Figure 7). These are regions of the same frame that can be decoded (and also encoded) independently, which offers a valid parallelization approach for video encoding and decoding processes. However, slice independence has a drawback in that the existing redundancy between data belonging to different slices cannot be exploited, and thus the coding efficiency in terms of the R/D performance decreases. Moreover, slices are composed of a header and data, and although this is useful in terms of providing an encoded video sequence with error resilience features (since the loss of a single slice does not prevent the rest of the slices in the same frame from being properly decoded), the inclusion of a header in each slice also causes a decrease in the R/D performance.
Based on the slice partitioning of the JEM video encoder, each video frame is divided into as many slices as threads to be spawned. The number of threads in the parallel region can be set as a parameter or can be obtained depending on the current state of the computer system. Hence, the number of threads (and consequently the number of slices), and the slice size are computed before starting the encoding process. In this work, we have developed two algorithms, the first of which requires synchronization processes while the second is a completely asynchronous algorithm.

3.1. Synchronous Algorithm: JEM-SP-Sync

Algorithm 1 shows the parallel algorithm, called JEM-SP-Sync, which includes synchronization processes. In Algorithm 1, the size of the slice is first computed in numbers of CTUs. To do this, we initially compute the number of horizontal ( F r W i d t h ) and vertical ( F r H e i g h t ) CTUs, and the total number of CTUs ( N o C T U s ) available in a frame, which will depend on the video resolution. It is worth noting that both the right-hand and bottom CTUs in the frame may be incomplete (see lines 4 and 8). Furthermore, since the slices may not have the same number of CTUs, the algorithm sets the size of the last slice as equal to or smaller than the size of the rest of the slices in order to achieve a better load balance (lines 13–18).
Before starting the encoding process of the whole video sequence, each thread computes the CTUs of the slice assigned to that thread, which always remains the same throughout the encoding process (lines 20–26). The encoding process starts by reading the frame to be encoded and storing it in memory. This initial process is performed by a single thread, and a synchronization point is therefore needed to ensure that the process waits until the frame is available (line 30). In a similar way, the reconstructed frame is also stored in the shared memory. This task is carried out by each individual thread after encoding the assigned slice, meaning that another synchronization point is required before applying the “loop filter” process (line 32). The encoded video data stream (i.e., the bit stream) is organized into network abstraction layer units (NALUs), where each NALU is a packet containing an integer number of bytes. To finish the encoding process, the NALUs corresponding to each slice must be written in the correct order to form the final bit stream (line 35). It is worth mentioning that the slice-based parallel strategy for HEVC proposed in [20] obtained good speed-ups when all slices had the same number of CTUs, or differed by a maximum of a single CTU. However, this behavior changes greatly when the JEM encoder is used. As explained in Section 2, changes to the coding procedure introduced in the JEM with respect to HEVC result in significant differences in computational cost when encoding different CTUs. Note that this load imbalance is mainly due to the intrinsic characteristics of the video content.
Algorithm 1 JEM-SP-Sync: Slice-based parallel algorithm with synchronization processes.
  1:
Set N o T to the number of threads (equal to the number of slices)
  2:
procedure Compute the number of CTUs per frame
  3:
     N o H z C T U s = F r W i d t h / C T U S i z e
  4:
    if  F r W i d t h % C T U S i z e ! = 0  then
  5:
         N o H z C T U s + +
  6:
    end if
  7:
     N o V r C T U s = F r H e i g h t / C T U S i z e
  8:
    if  F r H e i g h t % C T U S i z e ! = 0  then
  9:
         N o V r C T U s + +
10:
    end if
11:
     N o C T U s = N o H z C T U s N o V r C T U s
12:
end procedure
13:
procedure Compute the number of CTUs per slice
14:
     N o C T U s S l i c e = N o C T U s / N o T h r e a d s
15:
    if  N o C T U s % N o T h r e a d s ! = 0  then
16:
         N o C T U s S l i c e + +
17:
    end if
18:
end procedure
19:
IN PARALLEL (NoT):
20:
procedure Assign the dimms of the slice( T i d )
21:
     i n i C T U T i d = T i d N o C T U s S l i c e
22:
     e n d C T U T i d = i n i C U T i d + N o C T U s S l i c e
23:
    if  T i d = = ( N o T 1 )  then
24:
         e n d C T U T i d = N o C T U s 1
25:
    end if
26:
end procedure
27:
for i = 1 to N u m F r a m e s I n S e q u e n c e  do
28:
    Single thread:
29:
        Read frame i it in global memory.
30:
    Synchronization point #1
31:
    Encode Slice ( i n i C T U T i d to e n d C T U T i d )
32:
    Synchronization point #2
33:
    Apply the loop filter process to whole frame
34:
    Generate NALUs
35:
    Ordered:
36:
        Write NALUS to bitstream.
37:
end for
Figure 8 shows graphically the structure of the synchronous parallel algorithm JEM-SP-Sync. As explained in Algorithm 1, each thread ( T x ) always processes the same slice ( S x ) in all frames. Before starting the processing of a new frame, the threads must synchronize (Sync) to compose the bitstream of the last frame (line 35 of Algorithm 1).

3.2. Asynchronous Algorithm: JEM-SP-ASync

To solve the load imbalance drawback, we designed a parallel algorithm called JEM-SP-ASync, as shown in Algorithm 2, which does not use any type of synchronization during the overall encoding process. The calculation of both the number of CTUs per frame and the number of CTUs per slice (line 3) is identical to that in Algorithm 1 (lines 2–18). Before starting the parallel region, the dimensions of all slices are calculated (lines 3–10) and stored in memory, which will be configured as shared memory (the iniCTUs and endCTUs arrays) since these values will be used by all threads. At the beginning of the parallel region, the first slice to be encoded by every thread is set (line 12) based on the thread identifier ( T i d ). However, the mapping of slices to threads will change for every new frame, following a round-robin-like scheduling (lines 24–27). For this reason, each thread must update the CTUs to be encoded when starting the encoding of a new frame (lines 15 and 16). Since there are no synchronization points, the encoded NALUs must be stored in shared memory.
When the encoding process for the slice is complete, each thread checks whether there is a frame for which all of the slices are encoded; if so, this thread stores the complete encoded frame in the bit stream. This procedure should be performed within a parallel critical region (lines 20–23). The proposed mapping of slices to threads included in Algorithm 2 provides an automatic load balancing mechanism without the need for synchronization points or a coordinating process. By alternating the coding slice for each thread from one frame to another, the computational cost per thread tends to balance, with a greater probability as the number of frames to be encoded increases.
In the asynchronous algorithm, shown in Figure 9, there are no synchronization points; to compose the bit stream, each thread after processing a slice checks if all the slices of the frame following the last fully encoded one are already encoded, and in that case the thread will compose that frame and remove the stored data. This process (line 17 in Algorithm 2) does not involve any synchronization point. Furthermore, each thread ( T x ) processes a different slice ( S x ) in each frame.
Algorithm 2 JEM-SP-ASync: Slice-based parallel algorithm without synchronization processes.
  1:
Set N o T to the number of threads (equal to the number of slices)
  2:
N o C T U s and N o C T U s S l i c e computed as in Algorithm JEM-SP-Sync
  3:
procedure Compute dimmensions of all slices
  4:
    for  i = 0 to ( N o T 2 )  do
  5:
           i n i C T U s [ i ] = i N o C T U s S l i c e
  6:
           e n d C T U s [ i ] = ( i + 1 ) N o C T U s S l i c e 1
  7:
    end for
  8:
     i n i C T U s [ N o T 1 ] = ( N o T 1 ) N o C T U s S l i c e
  9:
     e n d C T U s [ N o T 1 ] = N o C T U s 1
10:
end procedure
11:
IN PARALLEL (NoT):
12:
i d S l i c e = = T i d
13:
for  i = 1 to N u m F r a m e s I n S e q u e n c e  do
14:
    Read slice i d S l i c e from frame i in private memory.
15:
     c u r r e n t I n i C T U = i n i C T U s [ i d S l i c e ]
16:
     c u r r e n t E n d C T U = e n d C T U s [ i d S l i c e ]
17:
    Encode Slice ( c u r r e n t I n i C T U to c u r r e n t E n d C T U )
18:
    Apply the loop filter process to own slice
19:
    Store NALU(s) in global memory
20:
    critical:
21:
    if there are any fully encoded frames then
22:
            Write all NALUS of that frame to bitstream.
23:
    end if
24:
     i d S l i c e + +
25:
    if  i d S l i c e > ( N o T 1 )  then
26:
         i d S l i c e = 0
27:
    end if
28:
end for

4. Experimental Results

The reference software used in our experiments was JEM-7.0rc1 [21], and the parallel algorithms were developed and tested using the GCC v.4.8.5 compiler [22] and OpenMP [23]. The shared memory platform used consisted of two Intel XEON X5660 hexacores, with up to 2.8 GHz and 12 MB cache per processor, and 48 GB of RAM. The operating system used as CentOS Linux 5.6 for x86/64-bit systems.
The proposed algorithms were tested for the video sequences listed in Table 3, each of which had a different resolution and a different frame rate. In all of the experiments, we used the same number of frames to be encoded (120) for all video sequences. Hence, the number of seconds to be encoded varied depending on the frame rate of the video sequence tested. We used a small number of frames for encoding in order to evaluate the load balancing capability of the JEM-SP-ASync algorithm (Algorithm 2). Note that as the number of frames to be encoded increases, the automatic load balancing method of Algorithm 2 is expected to improve, since it is statistically more likely that the computational cost per thread will be balanced. The values of the quantization parameter (QP) used were 37, 32, 27 and 22.
Before addressing the efficiency of the parallel algorithms described in Section 3, we analyze the theoretical load balance index, which will depend on both the resolution of the video sequence and the number of slices (i.e., the number of threads). In all of the experiments, the CTU size was set to 128, based on common testing conditions [24]. Table 4 shows the dimensions of the different video resolutions tested, in numbers of CTUs of 128 × 128 pixels. As mentioned above, there may be incomplete CTUs at the right-hand and bottom edges of the video sequence. This occurs when the number of horizontal/vertical CTUs is not an integer, as shown in Table 4. This is the primary source of potential load imbalance, even if the computational costs associated with different CTUs are similar.
Table 5 shows the theoretical size of the slices, in number of CTUs, required to obtain a balanced load. As previously explained for Algorithm 1, when the number of CTUs per dimension is not an integer, the size is rounded up to the next integer value; otherwise, the number of slices would not match the number of threads.
Table 6 shows the size differences, in numbers of CTUs, between the last slice and the other slices in the same frame. As can be seen, when 12 slices are used in the HD video sequence, the difference reaches nine CTUs. This is the second source of potential load imbalance, and as in the previous case, this holds even if the computational costs associated with different CTUs are similar. In addition, the use of a given number of threads is not recommended in certain cases. For example, when 11 threads are used in the 1280 × 720 video sequence (60 CTUs in Table 6), all slices have six CTUs while the last has none, meaning that it will remain idle throughout the encoding process.
The third and final source of potential load imbalance depends on the encoding complexity of JEM, which does not depend on the number of CTUs but on the intrinsic characteristics of the intra-prediction, which may be affected by the CTU contents. As explained in Section 2, this load imbalance cannot be predicted, as it depends on the intrinsic characteristics of the video content to be encoded, which modifies the amount of processing required to encode each CTU.
To show that domain decomposition using slices does not ensure that the load is balanced, we experimentally obtained the computational cost of each slice for the sequences listed in Table 3. Table 7, Table 8, Table 9 and Table 10, show the experimental percentages of the computational cost assigned to each slice for the Park Scene, Four People, Party Scene and BQ Square video sequences, respectively. These tables include different numbers of slices per frame setup for each video sequence, and show the relative computation time required by each slice at different compression rates (QPs). As can be observed, none of these schemes achieve correct load balancing, regardless of whether or not the volume of data assigned to each process is balanced.
Table 11 shows the computation times (in seconds) needed to encode 120 frames using the AI coding mode for all video sequences tested, and the average computation time per frame for all QPs tested. As can be seen, the sequential algorithm required an average of 14 min to encode a frame of the Park Scene video sequence, meaning that 28.2 h would be needed to encode the 120 frames of the video sequence (five seconds of video).
Table 12 and Table 13 show the parallel efficiency and the computational times, respectively, of the JEM-SP-Sync algorithm for the BQ Square, Party Scene, Four People and Park Scene video sequences encoded at four QP values (22, 27, 32, 37). The computation times were obtained using OpenMP functions and the parallel efficiency, in percentage, was calculated according to Equation (1).
E f f i c i e n c y = S e q u e n t i a l _ t i m e P a r a l l e l _ t i m e N o T 100
The JEM-SP-Sync algorithm, which includes synchronization points, generally obtains good parallel efficiency when few threads are used (set by NoT value), but does not have good scalability, meaning that its efficiency degrades as the number of threads increases. The high computational cost of the JEM video encoder means that if the load is not perfectly balanced, the parallel performance is strongly affected. In this case, and as mentioned above, the load imbalance may be caused by differences in slice sizes, by some slices having incomplete CTUs, or by the difference in the intrinsic complexity of the CTUs in the coding process in JEM. As shown in Table 7, Table 8, Table 9 and Table 10, the computational cost is not balanced despite the quasi-balanced domain decomposition according to the volume of data assigned to each processor. The load imbalance is due to the nature of the data, i.e., the content of each CTU. Since the JEM-SP-Sync algorithm is synchronous, when a thread has completed the processing of the assigned CTUs of one frame, it must wait in an idle state until the rest of threads also complete the assigned CTUs, which in turn decreases parallel efficiency, as shown in Table 12.
The second algorithm, JEM-SP-ASync, was developed to avoid the use of synchronization processes, and can improve the parallel efficiency if load balancing is achieved. The process implemented in this algorithm to balance the load assigns a different slice to each thread depending on the frame to be encoded, thus providing automatic load balancing. Table 14 and Table 15 show the parallel efficiency and the computational times obtained by the JEM-SP-ASync algorithm for all video sequences tested here. The results confirm that the automatic load balancing process implemented in the JEM-SP-ASync algorithm works correctly and shows very good scalability, especially compared to the JEM-SP-Sync algorithm (Table 12).
It should be noted that in order to develop both algorithms, several procedures of the reference software had to be modified. The number of modified processes is greater in JEM-SP-ASync than in JEM-SP-Sync. However, none of these changes affect the encoding processes implemented in the reference software. Obviously, in the reference software, the encoded slices are totally independent, which is necessary in order to develop parallel algorithms without modifying the standard encoding process. We detected that slice encoding depends on the order in which it is encoded. That is, when the slices are independently encoded in a disordered way, there is a small change in the reference software. In any case, regardless of the order of encoding of the slices, the slices are totally independent. Table 16 and Table 17 show the bit rate and PSNR values for the sequential and parallel algorithms, respectively. As can be seen, the differences between the two types of algorithm are almost negligible, although the bit rate is slightly lower and the PSNR slightly better in the parallel algorithm. This is because in the reference software, the initial values of some of the variables that slightly modify the intra-prediction procedure vary from one slice to the next, whereas in our parallel algorithms, these variables have the same initial values for all slices.

5. Conclusions

Some of the most important features of the JEM video encoder in relation to intra-prediction have been briefly described here. These features focus on reducing the bitrate in order to minimize the bandwidth required for transmission. These new features cause a dramatic increase in the computational cost of encoding compared with previous video encoder standards, including HEVC. The parallel algorithms developed here make use of slices to implement domain decomposition; however, if the domain decomposition does not allow the volume of data assigned to each process to be perfectly balanced, the high computational cost causes significant cost imbalances. Moreover, a perfect balance of the data to be encoded by each process also does not ensure load balancing, unlike in the case of the HEVC encoder. In the JEM approach, it is not guaranteed that perfect domain decomposition gives rise to a perfect computational load balance. The JEM-SP-Async parallel algorithm was proposed to solve these drawbacks, which, as explained above, did not arise in previous standards. The automatic computational cost balancing system included in the design of the proposed parallel algorithms was validated based on the experimental results. These results show that the average value of efficiency rose from 71.3% to 93.4% when the JEM-SP-ASync algorithm was used instead of the JEM-SP-Sync algorithm. This significantly improved the parallel scalability, e.g., average efficiency, by coding the FOUR video sequence using 12 processes, which increased from 51.6% to 88.8%. These results were obtained by encoding only 120 frames (corresponding to 2.4 or 5 s, depending on the frame rate of the video sequence), and demonstrate correct load balancing even for short video sequences.

List of Acronyms

AI—All Intra
AVC—Advanced Video Coding
CB—Coding Blocks
CCLM—Cross Component Linear Model
CTU—Coding Tree Unit
CU—Coding Unit
HEVC—High Efficiency Video Coding
JCT-VC—Joint Collaborative Team on Video Coding
JEM—Joint Exploration test Model
JVET—Joint Video Exploration Team
ME—Motion Estimation
MMLM—Multiple Model CCLM Mode
MPEG—Moving Picture Expert Group
MPM—Most Probable Modes
MV—Motion Vectors
NALU—Network Abstraction Layer Unit
PDPC—Position Dependent intra-Prediction Component
PSNR—Peak Signal to Noise Ratio
PU—Prediction Unit
QP—Quantization Parameter
QTBT—Quadtree Plus Binary Tree
R/D—Rate/Distortion
RDO—Rate Distortion Optimization
TU—Transform Unit
VCEG—Video Coding Expert Group
VoD—Video on Demand
VVC—Versatile Video Coding

Author Contributions

H.M. and M.O.M.-R. conceived the analytical study; O.L.-G., H.M. and M.P.M. designed experimental test; H.M., V.G. and M.O.M.-R. performed the validation; H.M., M.O.M.-R., O.L.-G. and M.P.M. analyzed the data; M.O.M.-R. and H.M. wrote the original draft; O.L.-G., V.G. and M.P.M. reviewed and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Spanish Ministry of Science, Innovation and Universities and the Research State Agency under Grant RTI2018-098156-B-C54 co-financed by FEDER funds, and by the Valencian Ministry of Innovation, Universities, Science and Digital Society under Grant GV/2021/152.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Sullivan, G.; Ohm, J.; Han, W.; Wiegand, T. Overview of the High Efficiency Video Coding (HEVC) standard. Circuits Syst. Video Technol. IEEE Trans. 2012, 22, 1648–1667. [Google Scholar] [CrossRef]
  2. ITU-T.; ISO/IEC JTC 1. Advanced Video Coding for Generic Audiovisual Services. Available online: https://www.itu.int/rec/T-REC-H.264-201610-S (accessed on 30 September 2021).
  3. Ohm, J.; Sullivan, G.; Schwarz, H.; Tan, T.K.; Wiegand, T. Comparison of the Coding Efficiency of Video Coding Standards - Including High Efficiency Video Coding (HEVC). Circuits Syst. Video Technol. IEEE Trans. 2012, 22, 1669–1684. [Google Scholar] [CrossRef]
  4. Cisco. Cisco Visual Networking Index: Forecast and Methodology, 2017–2022. Technical Report. 2019. Available online: https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-741490.html (accessed on 30 September 2021).
  5. Chen, J.; Alshina, E.; Sullivan, G.J.; Ohm, J.R.; Boyce, J. Algorithm Description of Joint Exploration Test Model 7. Available online: https://mpeg.chiariglione.org/standards/exploration/future-video-coding/n17055-algorithm-description-joint-exploration-test-model (accessed on 30 September 2021).
  6. Bross, B. Versatile Video Coding (Draft 2). Available online: https://mpeg.chiariglione.org/standards/mpeg-i/versatile-video-coding (accessed on 30 September 2021).
  7. Alshina, E.; Alshin, A.; Choi, K.; Park, M. Performance of JEM 1 tools analysis. Technical report. In Proceedings of the JVET-B0044 3rd 2nd JVET Meeting, San Diego, CA, USA, 20–26 February 2016. [Google Scholar]
  8. Schwarz, H.; Rudat, C.; Siekmann, M.; Bross, B.; Marpe, D.; Wiegand, T. Coding Efficiency Complexity Analysis of JEM 1.0 coding tools for the Random Access Configuration. Technical report. In Proceedings of the JVET-B0044 3rd 2nd JVET Meeting, San Diego, CA, USA, 20–26 February 2016. [Google Scholar]
  9. Karczewicz, M.; Alshina, E. JVET AHG Report: Tool Evaluation (AHG1). Available online: http://phenix.it-sudparis.eu/jvet/doc_end_user/documents/4_Chengdu/wg11/JVET-D0001-v4.zip (accessed on 30 September 2021).
  10. Grois, D.; Nguyen, T.; Marpe, D. Performance Comparison of AV1, JEM, VP9, and HEVC Encoders. Available online: https://www.researchgate.net/publication/319925128_Performance_comparison_of_AV1_JEM_VP9_and_HEVC_encoders_Conference_Presentation (accessed on 30 September 2021).
  11. García-Lucas, D.; Cebrián-Márquez, G.; Díaz-Honrubia, A.J.; Cuenca, P. Acceleration of the integer motion estimation in JEM through pre-analysis techniques. J. Supercomput. 2018, 75, 1–12. [Google Scholar] [CrossRef]
  12. Bjontegaard, G. Calculation of average PSNR differences between RD-curves. In Technical Report VCEG-M33; Video Coding Experts Group (VCEG): Austin, TX, USA, April 2001. [Google Scholar]
  13. López-Granado, O.; Migallón, H.; Martínez-Rach, M.; Galiano, V.; Malumbres, M.P.; Van Wallendael, G. A highly scalable parallel encoder version of the emergent JEM video encoder. J. Supercomput. 2019, 74, 1429–1442. [Google Scholar] [CrossRef]
  14. Sullivan, G.; Ohm, J.R. Meeting Report of the 6th meeting of the Joint Video Exploration Team (JVET). Technical report. In Proceedings of the Joint Video Exploration Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, Hobart, AU, USA, 31 March–7 April 2017. [Google Scholar]
  15. Chen, J.; Sullivan, G.; Ohm, J.R. Algorithm Description of Joint Exploration Test Model 7 (JEM 7). Technical report. In Proceedings of the Joint Video Exploration Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11-JVET-G1001-v1, Turin, IT, USA, 13–21 July 2017. [Google Scholar]
  16. Sze, V.; Budagavi, M.; Sullivan, G. High Efficiency Video Coding (HEVC); Springer: New York City, NY, USA, 2014; pp. 1–375. [Google Scholar]
  17. ITU-T.; ISO/IEC. Algorithm Description of Joint Exploratory Test Model (JEM); Technical Report; Joint Video Exploration Team (JVET): Geneva, Switzerland, May 2015. [Google Scholar]
  18. Zhang, X.; Liu, S.; Lei, S. Intra mode coding in HEVC standard. In Proceedings of the 2012 Visual Communications and Image Processing, San Diego, CA, USA, 27–30 November 2012; pp. 1–6. [Google Scholar] [CrossRef]
  19. Seregin, V.; Zhao, X.; Said, A.; Karczewicz, M. Neighbor based intra most probable modes list derivation. Technical Report JVET-C0055, Joint Video Exploration Team (JVET). In Proceedings of the 3rd Meeting, Geneva, CH, USA, 26–27 January 2016. [Google Scholar]
  20. Piñol, P.; Migallón, H.; López-Granado, O.; Malumbres, M.P. Slice-based parallel approach for HEVC encoder. J. Supercomput. 2015, 71, 1882–1892. [Google Scholar] [CrossRef]
  21. JEM Reference Software. 2017. Available online: https://jvet.hhi.fraunhofer.de/svn/svn_HMJEMSoftware/tags/HM-16.6-JEM-7.0rc1/ (accessed on 30 September 2021).
  22. GCC, the GNU Compiler Collection. Free Software Foundation, Inc. 2009–2012. Available online: http://gcc.gnu.org (accessed on 30 September 2021).
  23. OpenMP ARB (Architecture Review Boards). OpenMP Application Program Interface, Version 3.1. 2011. Available online: http://www.openmp.org (accessed on 30 September 2021).
  24. Suehring, K. JVET common test conditions and software reference configurations. In Proceedings of the Technical Report JVET-B1010, Joint Video Exploration Team (JVET), San Diego, CA, USA, 15–21 October 2016. [Google Scholar]
Figure 1. Division of a 1280 × 720 pixel frame into CTUs: 240 (20 × 12).
Figure 1. Division of a 1280 × 720 pixel frame into CTUs: 240 (20 × 12).
Algorithms 14 00320 g001
Figure 2. HEVC QT partition schema and the relationship between CU, PU and TU partitioning.
Figure 2. HEVC QT partition schema and the relationship between CU, PU and TU partitioning.
Algorithms 14 00320 g002
Figure 3. JEM QTBT partition schema.
Figure 3. JEM QTBT partition schema.
Algorithms 14 00320 g003
Figure 4. Extended (red) prediction modes in JEM [17].
Figure 4. Extended (red) prediction modes in JEM [17].
Algorithms 14 00320 g004
Figure 5. MPMs: neighbors of the current PU in JEM.
Figure 5. MPMs: neighbors of the current PU in JEM.
Algorithms 14 00320 g005
Figure 6. Examples of limit prediction filters for intra-modes corresponding to modes 30–34 in HEVC [15].
Figure 6. Examples of limit prediction filters for intra-modes corresponding to modes 30–34 in HEVC [15].
Algorithms 14 00320 g006
Figure 7. Division of a full-HD frame (1920 × 1080 pixels) into 10 slices of 51 CTUs each.
Figure 7. Division of a full-HD frame (1920 × 1080 pixels) into 10 slices of 51 CTUs each.
Algorithms 14 00320 g007
Figure 8. JEM-SP-Sync: Graphical scheme of the slice-based synchronous parallel algorithm.
Figure 8. JEM-SP-Sync: Graphical scheme of the slice-based synchronous parallel algorithm.
Algorithms 14 00320 g008
Figure 9. JEM-SP-ASync: Graphical scheme of the slice-based asynchronous parallel algorithm.
Figure 9. JEM-SP-ASync: Graphical scheme of the slice-based asynchronous parallel algorithm.
Algorithms 14 00320 g009
Table 1. Differences between HEVC and JEM with respect to picture partitioning.
Table 1. Differences between HEVC and JEM with respect to picture partitioning.
CharacteristicsHEVCJEM
CTU size 64 × 64 256 × 256
CTU partitionQuad-Tree with separate
tress for CU, PU and TU
Quad-Tree+Binary-Tree
QTBT (shared by CU, PU, TU)
Inter-PredictionNo bi-directionalBi-directional allowed
for 4 × 8 and 8 × 4 sizes
Max transform unit size 32 × 32 128 × 128
Table 2. Differences between HEVC and JEM with respect to the spatial prediction.
Table 2. Differences between HEVC and JEM with respect to the spatial prediction.
CharacteristicsHEVCJEM
Intra-modes3367
List MPMs36
N° of Neighbors for MPMs derivation25
Interpolation filters2-tap Linear4-tap Cubic or Gaussian
Boundary filter samples14
Table 3. Video sequences.
Table 3. Video sequences.
Video Seq.AcronymResolutionRate (Hz)N. FramesTime (s)
Park ScenePARK1920 × 1080241205
Four PeopleFOUR1280 × 720601202
Party ScenePART832 × 480501202.4
BQ SquareBQSQ416 × 240601202
Table 4. Number of 128 × 128 CTUs for each video sequence resolution.
Table 4. Number of 128 × 128 CTUs for each video sequence resolution.
ResolutionHorizontal CTUsVertical CTUsNumber of CTUs
1920 × 108015158.49135
1280 × 72010105.6660
832 × 4806.573.8428
416 × 2403.2541.928
Table 5. Number of CTUs per slice.
Table 5. Number of CTUs per slice.
Number of Slices (and Threads)
N. CTUs24578101112
Slice size (in CTUs)
13567.533.827.019.316.913.512.311.3
6030.015.012.08.67.56.05.55.0
2814.07.05.64.03.52.82.52.3
84.02.01.61.11.00.80.70.7
Slice size (in CTUs) rounded up
1356834272017141312
6030151298665
28147644333
842221111
Table 6. Difference in size of the last slice (in CTUs).
Table 6. Difference in size of the last slice (in CTUs).
Number of Slices (and Threads)
Number of CTUs24578101112
Diff. in Size of the Last Slice
135−1−10−5−1−5−8−9
60000−3−40−60
2800−20−4−2−5−8
800−2−60−2-3−4
Table 7. Computational cost per slice for Park Scene video sequence.
Table 7. Computational cost per slice for Park Scene video sequence.
Slice
sizeQP(0–68)(68–135)
682253%47%
2753%47%
3253%47%
3753%47%
(0–45)(45–90)(90–135)
452237%36%28%
2737%36%28%
3237%35%27%
3737%36%27%
(0–34)(34–68)(68–102)(102–135)
342227%27%29%18%
2727%26%29%17%
3227%26%30%16%
3728%25%31%16%
(0–27)(27–54)(54–81)(81–108)(108–135)
272222%21%22%23%13%
2722%21%22%23%12%
3222%21%22%24%11%
3723%19%23%26%10%
(0–23)(23–46)(46–69)(69–92)(92–115)(115–135)
232218%20%17%19%18%7%
2717%21%17%20%19%6%
3218%21%16%21%19%6%
3718%20%16%22%19%5%
(0–20)(20–40)(40–60)(60–80)(80–100)(100–120)(120–135)
202215%17%17%15%17%16%4%
2715%17%17%15%17%15%4%
3215%17%17%15%18%15%3%
3716%17%16%16%18%14%3%
(0–17)(17–34)(34–51)(51–68)(68–85)(85–102)(102–119)(119–135)
172214%14%13%14%14%15%13%4%
2713%14%13%14%14%16%13%4%
3214%14%12%14%14%16%12%4%
3714%14%12%13%14%17%12%3%
(0–15)(15–30)(30–45)(45–60)(60–75)(75–90)(90–105)(105–120)(120–135)
152212%13%12%12%11%12%13%11%4%
2712%13%12%12%11%12%13%11%4%
3212%13%12%12%11%13%13%10%3%
3712%14%12%11%11%13%14%10%3%
Table 8. Computational cost per slice for Four People video sequence.
Table 8. Computational cost per slice for Four People video sequence.
Slice
sizeQP(0–30)(30–60)
302256%44%
2756%44%
3256%44%
3757%43%
(0–20)(20–40)(40–60)
202231%52%17%
2730%54%15%
3231%54%15%
3732%54%14%
(0–15)(15–30)(30–45)(45–60)
152225%31%32%11%
2725%31%34%10%
3226%31%34%9%
3727%31%33%9%
(0–12)(12–24)(24–36)(36–48)(48–60)
122219%24%29%19%8%
2718%25%30%20%8%
3218%26%29%19%7%
3719%26%29%19%7%
(0–10)(10–20)(20–30)(30–40)(40–50)(50–60)
102214%18%24%27%11%5%
2713%18%25%28%10%5%
3213%18%26%29%10%5%
3714%18%25%28%10%5%
(0–9)(9–18)(18–27)(27–36)(36–45)(45–54)(54–60)
92212%16%22%22%17%8%3%
2712%16%23%22%17%7%3%
3212%16%23%22%17%7%3%
3713%16%23%22%17%7%2%
(0–8)(8–16)(16–24)(24–32)(32–40)(40–48)(48–56)(56–60)
82211%15%16%19%22%8%6%2%
2711%15%17%19%23%8%6%2%
3212%15%17%18%23%7%5%2%
Table 9. Computational cost per slice for Party Scene video sequence.
Table 9. Computational cost per slice for Party Scene video sequence.
Slice
sizeQP(0–14)(14–28)
142252%48%
2751%49%
3251%49%
3751%49%
(0–10)(10–20)(20–28)
102236%40%24%
2735%41%23%
3235%42%22%
3735%43%22%
(0–7)(7–14)(14–21)(21–28)
72225%27%27%21%
2724%27%28%21%
3224%27%29%19%
3725%26%30%19%
(0–6)(6–12)(12–18)(18–24)(24–28)
62222%22%23%20%12%
2722%22%24%20%12%
3221%21%25%20%12%
3721%21%25%20%13%
(0–5)(5–10)(10–15)(15–20)(20–25)(25–28)
52218%19%19%21%15%8%
2717%19%19%23%14%8%
3217%19%19%24%13%9%
3717%19%18%24%13%9%
(0–4)(4–8)(8–12)(12–16)(16–20)(20–24)(24–28)
42215%14%16%15%18%11%12%
2714%14%15%15%19%11%12%
3214%15%14%15%20%10%12%
3714%15%13%15%21%9%12%
Table 10. Computational cost per slice for BQ Square video sequence.
Table 10. Computational cost per slice for BQ Square video sequence.
Slice SizeQP(0–4)(4–8)
42256%44%
2757%43%
3258%42%
3761%39%
(0–3)(3–6)(6–8)
32251%31%18%
2751%31%18%
3253%29%18%
3756%27%17%
(0–2)(2–4)(4–6)(6–8)
22235%21%26%18%
2736%20%25%18%
3238%20%24%18%
3740%21%22%17%
(0–1)(1–2)(2–3)(3–4)(4–5)(5–6)(6–7)(7–8)
12218%18%16%5%13%13%14%4%
2720%17%15%5%13%12%14%4%
3222%17%15%5%12%12%14%4%
3722%17%16%5%11%11%13%4%
Table 11. Sequential computational times.
Table 11. Sequential computational times.
Time (s.)Time per Frame (s.)
Video
seq.
Reso-
lution
QP
22
QP
27
QP
32
QP
37
QP
22
QP
27
QP
32
QP
37
PARK1920 × 1080163,395114,68878,01050,4741362956650421
FOUR1280 × 72046,80533,76324,92818,272390281208152
PART832 × 48043,84436,79629,75422,121365307248184
BQSQ416 × 240989680996265501082675242
Table 12. Parallel efficiency for Algorithm JEM-SP-Sync.
Table 12. Parallel efficiency for Algorithm JEM-SP-Sync.
Parallel Efficiency
Video
sequence
NoTQP
22
QP
27
QP
32
QP
37
BQSQ285.0%85.2%82.6%79.1%
361.5%61.3%59.4%57.2%
467.6%65.2%62.5%60.3%
861.7%56.7%52.2%51.1%
PART294.1%96.1%96.9%94.1%
377.3%76.3%76.0%75.5%
486.1%83.1%82.4%78.9%
579.6%78.7%75.6%74.2%
673.1%69.0%66.0%64.6%
774.2%70.9%67.8%64.7%
865.4%61.8%59.2%56.8%
1072.5%70.4%68.4%67.1%
FOUR288.4%88.1%86.3%85.3%
362.2%60.1%57.9%60.1%
474.5%70.9%70.7%71.5%
565.0%62.9%63.8%65.3%
659.6%56.1%55.1%55.5%
761.0%57.8%57.8%58.3%
855.5%50.2%50.1%51.2%
959.6%56.8%57.0%57.8%
1059.5%54.5%54.7%56.3%
1153.8%50.1%49.9%50.8%
1255.7%51.7%49.5%49.3%
PARK291.7%92.9%92.1%93.9%
387.9%89.0%89.8%91.0%
488.6%88.3%84.3%81.6%
589.8%87.0%84.3%80.1%
682.1%80.0%78.8%78.6%
783.6%82.4%78.0%78.1%
884.5%80.8%77.0%73.1%
986.3%84.1%79.3%78.2%
1081.9%77.9%73.3%69.4%
1178.5%75.5%72.6%70.2%
1277.0%73.2%66.9%63.7%
Table 13. Computational times (s.) for Algorithm JEM-SP-Sync.
Table 13. Computational times (s.) for Algorithm JEM-SP-Sync.
Computational Time (s.)
Video
sequence
NoTQP
22
QP
27
QP
32
QP
37
BQSQ25882475138573215
35415440235772964
43694369436943694
82026202620262026
PART223,30519,15315,34611,753
318,91116,06913,0539763
412,73311,07090307007
511,012934778675964
69991888775185703
78445741262664882
88384744462874864
106051522443493295
FOUR226,47919,17214,44110,709
325,10318,71614,36210,141
415,70511,89888096385
514,40710,73278185595
613,08510,03475475484
710,960834861584476
810,545841062184464
98727660648603512
107861620045583244
117909612845463269
127007544041943089
PARK295,11766,94645,84629,237
366,20246,56531,34820,128
449,21835,23425,01716,821
538,85728,58320,02513,708
635,41125,91517,84511,654
729,83021,57615,45310,048
825,81119,24313,6959396
922,45816,42611,8317804
1021,31915,96511,5117916
1120,20714,97810,5677110
1218,89414,15910,5077188
Table 14. Parallel efficiency for Algorithm JEM-SP-ASync.
Table 14. Parallel efficiency for Algorithm JEM-SP-ASync.
Parallel Efficiency
Video
sequence
NoTQP
22
QP
27
QP
32
QP
37
BQSQ293.1%96.1%95.0%96.7%
393.7%92.6%94.6%95.2%
492.7%92.7%94.8%95.7%
886.2%88.3%88.8%88.9%
PART298.3%98.6%99.4%99.0%
395.3%96.3%97.1%97.9%
494.4%95.1%96.7%96.4%
593.3%95.2%94.6%93.2%
692.6%93.4%93.0%92.2%
792.0%91.2%94.2%93.3%
891.3%91.3%92.1%92.5%
1091.9%90.0%93.7%90.8%
FOUR297.9%98.9%97.2%98.8%
395.2%94.5%94.3%95.2%
495.0%92.3%94.1%94.9%
597.8%93.9%94.7%94.6%
693.9%93.1%92.7%91.6%
795.3%93.8%93.3%93.5%
894.1%92.2%92.6%91.0%
995.0%93.8%92.9%93.0%
1092.6%91.2%91.8%90.0%
1289.9%89.5%87.6%88.4%
PARK298.5%98.2%97.4%98.1%
395.4%95.1%93.1%94.7%
494.0%94.2%90.5%94.6%
594.2%95.7%92.9%93.1%
694.7%93.6%89.9%93.4%
793.6%91.2%93.2%94.2%
893.7%93.4%91.8%92.7%
994.2%92.6%91.2%92.9%
1093.3%92.4%91.8%89.1%
1192.4%92.2%87.9%89.9%
1289.5%89.4%88.5%88.5%
Table 15. Computational times (s.) for Algorithm JEM-SP-ASync.
Table 15. Computational times (s.) for Algorithm JEM-SP-ASync.
Computational Time (s.)
Video
sequence
NoTQP
22
QP
27
QP
32
QP
37
BQSQ25368421533542629
33556291522441781
42695269526952695
81449144914491449
PART222,30218,66814,96811,172
315,34312,73010,2167532
411,613967776935737
59403772962884748
67891656853323997
76807576145133386
86000503940402990
104769408731762435
FOUR223,89317,07212,8229247
316,39211,90988116401
412,311914366244811
59569719552633862
68306604644823325
77019514038182791
86220457833672509
95473400029832182
105057370327142029
124341314423701723
PARK282,97858,40140,02825,722
357,11940,20227,93517,760
443,44530,44921,54313,332
534,68323,95616,80310,840
628,75420,41614,4619007
724,93917,96711,9617656
821,80915,35610,6246805
919,27513,76795086034
1017,51212,40884995666
1116,08411,30480645105
1215,20710,69573474750
Table 16. Comparison of bitrate.
Table 16. Comparison of bitrate.
Bitrate (Kbps)
Video
sequence
NoT and
sequence
Slice
size
QP
22
QP
27
QP
32
QP
37
BQSQ24Seq.13,024865255783474
Par.12,579836153563370
Diff.−3.4%−3.4%−4.0%−3.0%
42Seq.13,366885557363563
Par.12,640841053933398
Diff.−5.4%−5.0%−6.0%−4.6%
81Seq.13,508911358983644
Par.12,826854854953483
Diff.−5.0%−6.2%−6.8%−4.4%
PART214Seq.45,45428,76017,5069660
Par.45,29128,64317,3629675
Diff.−0.4%−0.4%−0.8%0.2%
47Seq.46,01829,00117,6789736
Par.45,49928,79417,4689750
Diff.−1.1%−0.7%−1.2%0.1%
84Seq.46,55929,55817,8339810
Par.45,68928,91917,5529814
Diff.−1.9%−2.2%−1.6%0.0%
FOUR230Seq.27,22516,81710,4986513
Par.26,86616,63110,3986464
Diff.−1.3%−1.1%−1.0%−0.8%
415Seq.27,71717,16010,7386671
Par.27,18216,87110,5756595
Diff.−1.9%−1.7%−1.5%−1.1%
88Seq.28,18017,39710,9606831
Par.27,50617,10610,7466719
Diff.−2.4%−1.7%−1.9%−1.6%
PARK268Seq.49,20926,91114,2277118
Par.49,08426,87314,2227132
Diff.−0.3%−0.1%0.0%0.2%
434Seq.49,28826,98714,2847154
Par.49,21826,98314,3017182
Diff.−0.1%0.0%0.1%0.4%
817Seq.49,45227,13514,3957232
Par.49,46827,16614,4367271
Diff.0.0%0.1%0.3%0.5%
Table 17. Comparison of PSNR.
Table 17. Comparison of PSNR.
PSNR (dB)
Video
sequence
NoT and
slices
Slice
size
QP
22
QP
27
QP
32
QP
37
BQSQ24Seq.42.761638.673534.993531.4817
Par.42.943538.853835.246531.7929
Diff.0.4%0.5%0.7%1.0%
42Seq.42.655938.598534.928231.3503
Par.42.943538.853835.246531.7929
Diff.0.7%0.7%0.9%1.4%
81Seq.42.633838.488134.793431.2209
Par.42.931238.841035.234831.7715
Diff.0.7%0.9%1.3%1.8%
PART214Seq.42.105038.007134.276730.7610
Par.42.145938.052834.464630.9932
Diff.0.1%0.1%0.5%0.8%
47Seq.42.003737.962934.18830.6701
Par.42.143438.046734.45530.9788
Diff.0.3%0.2%0.8%1.0%
84Seq.41.893237.796234.118730.5650
Par.42.140938.044034.452530.9748
Diff.0.6%0.7%1.0%1.3%
FOUR230Seq.45.110442.680839.989937.1100
Par.45.182842.784840.146937.3029
Diff.0.2%0.2%0.4%0.5%
415Seq.45.063142.624439.900736.9927
Par.45.180242.780040.138737.2943
Diff.0.3%0.4%0.6%0.8%
88Seq.45.037242.625039.844836.9191
Par.45.178542.775640.133437.2841
Diff.0.3%0.4%0.7%1.0%
PARK268Seq.42.462139.564136.856534.2881
Par.42.529539.627736.937234.3736
Diff.0.2%0.2%0.2%0.2%
434Seq.42.419439.522136.814734.2439
Par.42.526439.625336.934434.3672
Diff.0.3%0.3%0.3%0.4%
817Seq.42.388639.493636.793934.2173
Par.42.521539.618836.928834.3554
Diff.0.3%0.3%0.4%0.4%
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Migallón, H.; López-Granado, O.; Martínez-Rach, M.O.; Galiano, V.; Malumbres, M.P. Load Balancing Strategies for Slice-Based Parallel Versions of JEM Video Encoder. Algorithms 2021, 14, 320. https://0-doi-org.brum.beds.ac.uk/10.3390/a14110320

AMA Style

Migallón H, López-Granado O, Martínez-Rach MO, Galiano V, Malumbres MP. Load Balancing Strategies for Slice-Based Parallel Versions of JEM Video Encoder. Algorithms. 2021; 14(11):320. https://0-doi-org.brum.beds.ac.uk/10.3390/a14110320

Chicago/Turabian Style

Migallón, Héctor, Otoniel López-Granado, Miguel O. Martínez-Rach, Vicente Galiano, and Manuel P. Malumbres. 2021. "Load Balancing Strategies for Slice-Based Parallel Versions of JEM Video Encoder" Algorithms 14, no. 11: 320. https://0-doi-org.brum.beds.ac.uk/10.3390/a14110320

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop