Next Article in Journal
Prospective Primary Teachers’ Didactic-Mathematical Knowledge in a Service-Learning Project for Inclusion
Previous Article in Journal
Para-Ricci-like Solitons with Arbitrary Potential on Para-Sasaki-like Riemannian Π-Manifolds
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Single-Image Super-Resolution Neural Network via Hybrid Multi-Scale Features

1
Shenzhen Key Laboratory of Virtual Reality and Human Interaction Technology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
2
Department of Computer Science and Engineering, Chinese University of Hong Kong, Hong Kong, China
3
School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
*
Author to whom correspondence should be addressed.
Submission received: 24 January 2022 / Revised: 12 February 2022 / Accepted: 16 February 2022 / Published: 19 February 2022
(This article belongs to the Topic Machine and Deep Learning)

Abstract

:
In this paper, we propose an end-to-end single-image super-resolution neural network by leveraging hybrid multi-scale features of images. Different from most existing convolutional neural network (CNN) based solutions, our proposed network depends on the observation that image features extracted by CNN contain hybrid multi-scale features: both multi-scale local texture features and global structural features. By effectively exploiting these multi-scale and local-global features, our network involves far fewer parameters, leading to a large decrease in memory usage and computation during inference. Our network benefits from three key modules: (1) an efficient and lightweight feature extraction module (EFblock); (2) a hybrid multi-scale feature enhancement module (HMblock); and (3) a reconstruction–restoration module (DRblock). Experiments on five popular benchmarks demonstrate that our super-resolution approach achieves better performance with fewer parameters and less memory consumption, compared to more than 20 SOTAs. In summary, we propose a novel multi-scale super-resolution neural network (HMSF), which is more lightweight, has fewer parameters, and requires less execution time, but has better performance than the state-of-the-art methods. Compared to SOTAs, this method is more practical and better suited to run on constrained devices, such as PCs and mobile devices, without the need for a high-performance server.

1. Introduction

Single-image super-resolution (SISR) seeks to reconstruct a high-resolution image with the high-frequency information (meaning the details) restored from its low-resolution counterpart [1]. SISR offers many practical applications, such as video monitoring, remote sensing, video coding and medical imaging. On the one hand, SISR reduces the cost of obtaining high-resolution images, allowing researchers to acquire HR images, using personal computers instead of sophisticated and expensive optical imaging equipment. On the one hand, SISR reduces the cost of obtaining high-resolution images, allowing researchers to acquire HR images, using personal computers instead of sophisticated and expensive optical imaging equipment. On the other hand, SISR reduces the cost of information transmission, i.e., high-resolution images can be obtained by decoding the transmitted low-resolution image information using SISR. Many efforts have been made to deal with such a challenging yet ill-posed problem, due to the unknown high-resolution version of a low-resolution image.
Many traditional methods [2,3,4] have been proposed to obtain high resolution (HR) images from their low resolution (LR) versions by establishing a mapping relationship between LR images and HR images. These methods are fast, lightweight and effective, which make them preferable as basic tools in SISR tasks [5]. However, there is a shared and inherent problem in applying them: the tedious parameter adjustment. Obtaining desired results relies on continually tweaking parameters to accommodate various inputs. This inconvenience has an adverse impact on both efficiency and the user experience.
In recent years, there have been considerable efforts made with convolutional neural networks—CNN-based SISR methods—by EDSR [6], DRRN [7], LapSRN [8], MemNet [9], CARN [10], IDN [11], MADNet [12], MRFN [13], and DRFN [14]. Such pioneering methods [15,16], despite having made use of only a few convolution layers, have validated that CNN exhibits better performance than do many traditional SISR methods. Subsequent efforts have focused mainly on improving the network structure by increasing its depth and width. Doing so can result in better performance with more trainable parameters to establish a more stable mapping relationship between LR and HR images. Extensive studies (e.g., [17]) with very deep neural networks [13] have shown that a deeper network [18] will have better performance than a shallower one [15,16]. However, there are many parameters to optimize for deep neural networks; furthermore, the training process consumes massive amounts of data to avoid overfitting. A common drawback is that such efforts have considered only the deepening of the network, while not sufficiently leveraging various extracted features.
Reducing the number of both the parameters and calculations is important in actual application scenarios. For example, it would be difficult to execute a model on mobile devices if the model were to require so much memory; also, it would provide an unfriendly user experience, due to the slow execution speed. To achieve better performance via a lightweight model, we define the hybrid multi-scale features in SISR as local multi-scale features and global multi-scale features, containing the thickness of texture features and structural features from detail to structure. We use the feature map obtained by interpolation as the basic frame of SISR. After the extraction of features by the efficient lightweight residual feature extraction module, the local multi-scale and global multi-scale enhancement modules and reconstruction modules are enhanced and fused, and the Charbonnier loss function is used for training. Finally, the mapping relationship between the LR and HR images is established, and the result is the SR image. We focus on fewer parameters to obtain good enhancement effects. Unlike other networks that only use convolutional layers to extract features, ours makes use of an efficient and lightweight feature extraction module, EFblock, to extract features from LR images, then inputs the features into the hybrid multi-scale enhancement module HMblock, which includes the local multi-scale feature extraction module called RF as well as a global feature enhancement module. After the hybrid multi-scale features are extracted and enhanced, through the reconstruction and fusion module DRBlock, the features are restored to the up-sampled LR image to obtain the SR image. In terms of mixed multi-scale features, the deeper convolutional neural network module is sufficient to extract the structural features of the image, while the shallower convolutional neural network module can extract the texture features of the image. The residual connection can compensate for feature extraction loss of the global characteristics in the process. In terms of local multi-scale features, different receptive field sizes of the convolution kernel can extract texture features of different thicknesses. Therefore, we study the dependence and complementation solution between the depth and the size of the convolution kernel receiving field between the structural features and the texture features. In summary, the main contributions of this article are as follows:
  • We propose a novel efficient and lightweight feature extraction module, called EFblock. It uses both grouped convolution and point convolution, and introduces both a global and local residual connection to improve the integrity of the feature extraction. Although EFblock has fewer parameters and less computation, it exhibits better performance than existing CNN-based methods.
  • We formulate the concept of hybrid multi-scale features, qualitatively dividing multi-scale features into local multi-scale and global multi-scale features; unlike other other multi-scale based solutions, ours uses only shallow local multi-scale features. This shallow local multi-scale extraction block mainly extracts texture features, while the deep convolutional layer extracts skeleton and structural features; together, they constitute hybrid multi-scale features.
  • Existing multi-scale based methods have a large number of parameters and are not flexible enough. They only focus on regular receptive areas and cannot fully extract multi-scale information. To solve this problem, we propose a bottleneck stack structure RF to extract the local multi-scale texture feature. Different from other schemes that only use different sizes of convolution kernels [19] or stacked 3 × 3 convolution kernels [13], we flexibly utilize a dilated convolution [20], and also obtain different sizes of receptive fields by controlling the dilation rate, as well as adding deformable convolution [21] to further yield irregular receptive areas. Because of the use of the bottleneck structure, the RF module not only yields multi-scale receptive fields, but also greatly reduces the number of parameters and the burden of calculation.

2. Related Works

In the early days, the task of SISR was defined as a mathematical mapping problem [2,3,4]. Some of them used regression to restore the image [4], some of them used random forest to tackle the problem [2], and some of them [22] used decision-making theory [23,24] to restore the LR images. Recently, deep learning was successful in many computer vision tasks, including SISR. Based on the needs of practical applications, lightweight models are currently the focus of attention. Here, we will summarize the SISR methods based on deep learning, while focusing on lightweight SISR methods.

2.1. Deep CNN-Based SISR

Like other computer vision tasks, SISR has made significant progress through deep convolutional neural networks. Dong et al. first proposed SRCNN [15] based on shallow CNNs. That method involves up-samples of images through bicubic interpolation. With three convolutional layers—as well as patch extraction and representation, plus nonlinear mapping and image reconstruction—the network was established. Later, that team proposed FSRCNN [16], while Shi et al. proposed ESPCN [25]. Meanwhile, Lai et al. proposed a Laplacian pyramid super-resolution network [8], which takes low-resolution images as input and gradually reconstructs the sub-band residuals of high-resolution images. Tai et al. used a persistent memory network (MemNet) [9] by using a very deep network. Tian et al. proposed a coarse-to-fine CNN method [26] that, from the perspective of low-frequency features and high-frequency features, adds heterogeneous convolutions and refinement blocks to extract and process high-frequency and low-frequency features separately. Wei et al. [27] used cascading dense connections to extract features of different fineness from different depth convolutional layers. Jin et al. adopted a framework [28] to flexibly adjust the architecture of the network, adapting different kinds of images. DRCN [29] used a deeply recursive convolutional network to improve performance without introducing new parameters for additional convolutions. DRRN [7] improved DRCN by using residual networks. Lim et al. proposed an enhanced deep residual network (EDSR) [6]. Liu et al. [30] proposed an improved version of U-Net based on a multi-level wavelet. Li et al. [31] proposed exploiting self-attention and facial semantics to obtain a super-resolution face image. Most studies of SISR achieved better performance by deepening the network or by adding the residual connection. However, deep depths make these methods difficult to train, while more parameters not only cause excessive memory consumption during inference, but also slow down the execution speed. Therefore, we introduce a lightweight and efficient SISR model.
In terms of lightweight models, Hui et al. proposed IDN [11] by knowledge distillation to distill and extract features of each layer of the network and learn the complementary relationship among them to reduce parameters. CARN [10] used a lightweight cascaded residual network; the local and global levels use cascading mechanisms to integrate features from different scale layers in order to receive more information. However, that method still involves 1.5 M parameters, and consumes too much memory. Ahn et al. [32] proposed a lightweight residual network that uses grouped convolution to reduce the number of parameters, as well as weight classification to enhance the effect of super-resolution. Yao et al. proposed GLADSR [33] with dense connections. Tian et al. proposed LESRCNN [34], using dense cross-layer connections and advanced sub-pixel convolution to reconstruct images. Lan et al. proposed MADNet [12], which contains many kinds of networks. He et al. [13] introduced a multi-scale residual network.
Existing lightweight SISR methods can compress the number of parameters and calculations, but doing so results in loss of performance. In contrast, our method can achieve better super-resolution performance despite a small number of parameters and reduced memory consumption.

2.2. Lightweight Neural Networks

Many recent super-resolution methods have focused on the lightweight nature of neural networks. We also focus on these features. Many lightweight network structures have been proposed, including dense networks [10,34], which use dense connections or residual connections to fully reuse functions. These methods are an efficient improvement for deep neural networks, but are inadequate for lightweight networks. Therefore, we need to pay more attention to efficient lightweight network skeletons. In subsequent works, researchers have proposed several derivative versions, with the introduction of cross-layer connections within the network, reusing functions to achieve better performance. Iandola et al. proposed SqueezeNet [35], using a squeeze layer and a convolution layer with a kernel size of 1 × 1 to convolve the feature map of the previous layer, thereby reducing the dimensionality of the feature map. Shufflenet V1 [36] and V2 [37] flexibly used pointwise grouped convolution and channel shuffle to achieve efficient classification effects on ImageNet [38]. MobileNet [39] constructed an effective network by applying—in a subsequent version—the deep separable convolution introduced by Sifre et al. MobileNet-V2 [40] also made use of methods, such as grouped convolution and point convolution, and introduced an attention mechanism. The design of the MobileNet-V3 [41] network utilized the NAS (network architecture search [42]) algorithm to search for a very efficient network structure. In contrast, the EFblock that we propose uses global and local residual connections, deep separable convolution, grouped convolution and point convolution. Our method comprehensively considers the needs of light weight and super-resolution, and extracts features efficiently with a small number of parameters.

2.3. Multi-Scale Feature Extraction

Multi-scale feature extraction is widely used in computer vision tasks, such as in semantic segmentation, image restoration and image super-resolution. The most basic feature is that filters with different convolution kernel sizes can extract features of different fineness. Szegedy et al. proposed a multi-scale module [19] called the Inception module. It uses convolution filters with different convolution kernel sizes to extract features in parallel, enabling the network to obtain different sizes of receptive fields, then extracting different characteristics of fineness. In a subsequent version, the authors processed batch normalization in Inception-V2 [43], which accelerates the training of the network. In Inception-V3 [44], the authors added a new optimizer and asymmetric convolution. Recently, the application of multi-scale convolutional layers was widely demonstrated in tasks such as deblurring and denoising. He et al. [13] introduced a multi-scale residual network with image features to significantly improve the performance of the image super-resolution. However, these methods focus only on local multi-scale features, ignoring the concept of a global scale. There is room for further improvement to realize the multi-scale network structure. As discussed above, we propose a hybrid multi-scale that, broadly, can be defined as local multi-scale and global multi-scale: the “local multi-scale” refers to the texture feature, and the “global multi-scale” refers to the structure feature. We experimented with this idea; the specific experimental details are introduced later.

3. Methodology

We propose a hybrid multi-scale feature neural network (HMSF). In this section, we first introduce the overall structure of HMSF and analyze the detailed information of each component. Next, we focus on analyzing our hybrid multi-scale enhancement module, HMblock.

3.1. Network Structure

The proposed method consists of three main parts (see Figure 1): an efficient feature extraction module, EFblock; a global and local multi-scale enhancement module, HMblock; and an image reconstruction module, DRBlock.

3.1.1. EFblock

As shown in Figure 2, many lightweight super-resolution methods use either a scale of 3 × 3 convolutional layers or 1 × 1 convolutional layers to extract features. We believe that the features extracted in this way are inefficient because adding a 3 × 3 convolutional layer increases many parameters. In light of that dilemma, and inspired by other efficient networks [35,36,41], we designed a lightweight and efficient feature extraction module, EFblock. Its performance is stronger than the standard convolution, and its parameters and calculations are relatively small. We first use a point convolution layer to upgrade the original image to the size that needs to be processed, then use grouped convolution, where, to extract features, the input size is equal to the output size. Finally, grouped point convolution is used to promote the feature to the dimension of the output. This combination of grouped convolution and point convolution can effectively reduce parameters and calculations.
As shown in Table 1, our proposed EFblock is compared with the standard convolutional module. If we use the standard 3 × 3 convolution, there are about 19,000 parameters. If we use EFblock, the number of parameters is reduced to about 16,000. In terms of the number of calculations, we use multi-adds as the evaluation standard; the standard convolution module is 4.4 G, and EFblock is 3.7 G, thereby reducing the number of both parameters and calculations. With the dimensional changes, we find that EFblock has more activation functions, making feature extraction more nonlinear and stronger in characterization. We next consider the global residual connection and the local residual connection. We believe that, in the process of extracting image features, global information will be lost. Therefore, we use global residual connection to compensate for the loss of information, and use point convolution to match the input dimension with the compensation dimension. Between the grouped convolution, we add a local residual connection to compensate for the lack of feature correlation between channels caused by too many groups.
The overall expression of EFblock is as follows. F i is the feature output by the i-th layer module; PGconv represents the grouped point convolution; Pconv represents the point convolution; and Gconv represents the grouped convolution. GR and LR represent global and local residual connections, respectively.
F i = PGConv Gconv PConv F i 1 + LR + GR
If the two residuals cannot be directly connected to the required dimension, point convolution is needed to increase or decrease the dimension:
LR = PConv F i 1 C o u t = C i n Gconv
GR = PConv F i 1 C o u t = C o u t Gconv

3.1.2. HMblock

HMblock is the core part of the method. In this module, we define global and local multi-scale features: local texture features and global structural features. As shown in Figure 3, HMblock is divided into shallow multi-scale texture feature extraction blocks as well as deep structure feature extraction blocks. When the number of convolutional layers is shallow, the feature information extracted by the network is rich, and will include the internal texture and external contour of the object in the picture. As the number of convolutional layers becomes deeper and deeper, the network will ignore some detailed textures, saving only the important skeletons in the picture. Of course, the texture feature also contains some structural information, but the information is too rich, and the texture and structure are messy and difficult to distinguish.
The role of HMblock is to merge together texture features that already contain various information with the original structural features, so that the reconstructed SR image possesses an accurate structure and rich texture features. When extracting texture features, we consider multi-scale texture features. Usually, we use convolutional layers with different convolution kernel sizes. The inception [19] module uses 3 × 3, 5 × 5, and 7 × 7 convolutional layers to achieve multi-scale features. Instead of using convolutional layers with different convolution kernel sizes, we construct a bottleneck structure, and use dilated convolution to obtain different sizes of the receptive fields. If we use standard discrete convolutional layers, with ⊗ as the operator of convolution, and defining a discrete function O ϵ Z 2 and a discrete convolution kernel P ϵ Z 2 , the expression can be given as
( O P ) ( i ) = s + t = i O ( s ) P ( t )
When using a standard convolutional layer, the size of the receptive field depends on the size of the convolution kernel. Some methods such as MRFN [13], which uses repeated stacking of 3 × 3 convolutional layers to obtain a large receptive field, will lead to a large increase in the number of parameters and calculations. To minimize that problem, we use dilated convolution and a control dilation factor r to control the size of the receptive field. The operation of dilated convolution can be stated as
( O r P ) ( i ) = s + r t = i O ( s ) P ( t )
The dilated convolution [20] operator of dilation r can be referred to as r . Figure 4 shows that, by using dilation factors of 1 and 2 and 3, we can obtain the receptive fields of 3 × 3, 5 × 5 and 7 × 7.
The features of scenes and objects in natural images are often irregular. In the RF module, even though multi-scale receptive fields have been obtained, the ability to learn irregular features remains inadequate.
For the model to further learn irregular features, we use deformable convolution [21] with offset:
( O d P ) ( i ) = s + t + Δ x n = i O ( s ) P ( t )
Δ x n is a learnable offset, and should be processed by bilinear interpolation to match the features. We flexibly use residual connection to avoid the checkerboard effect, and compensate for the information lost due to the enlargement of the receptive field. Before and after the RF module, we have added a global residual connection; furthermore, in the middle layer of RF5 and RF7, we have added two local residual connections. Table 2 shows that the number of parameters of the multi-scale method is greatly reduced, compared to Inception style [19] and MRFN style [13].
In the super-resolution task, the structural similarity between the SR image and the HR image is very important, since problems such as distortion and deformation often crop up during processing. After obtaining multi-scale texture features, we consider structural features. Many existing methods [13] quit increasing the depth of this module after extracting the multi-scale texture features. However, doing so is not enough; although texture features contain structural features, the amount of information remains too large—with a gap between texture details and structural details. Without an obvious boundary, the contour of the reconstructed image remains blurred and too smooth. Our idea is to use a deeper convolutional layer to extract the structural features contained in the multi-scale texture features, and merge them again to strengthen the structure outline and fix the texture details to an accurate position.
As shown in Figure 3, we input the features into three multi-scale blocks RF, and connect them with the residual to compensate for the loss of global information:
F [ i , T ] = r = 3 , 5 , 7 ( R F r F [ i 1 ] + R e s r )
After obtaining the local multi-scale texture features, we further extract the corresponding structural features through three deeper convolutional layers. We use the local residual connection to register the texture features to the corresponding positions of the structural features. We do not directly add the texture and structural features; instead, we first add a global residual connection link to the texture feature, then use concat to stack the texture feature and the global residual feature, finally adding the structural feature:
F i = F [ i , S ] + Concat F [ i , T ] , F i 1
Finally, we use a point convolutional layer to fuse the texture features and structural features. Table 2 shows the characteristics of our HMblock compared to other super-resolution core modules. We have used a variety of methods to reduce the number of parameters and made improvements in terms of hybrid multi-scale features. Table 3 shows the differences between HMSF and other methods used for multi-scale feature extraction.

3.1.3. DRblock

The last module is DRBlock. This module includes a 1 × 1 convolution for dimensionality reduction integration, as well as a PixelShuffle [25] layer for upsampling. DRBlock inputs an H × W low-resolution input image, and through a sub-pixel operation turns it into a high-resolution image of r H × r W . However, the realization process is not achieved by directly generating the high-resolution image through interpolation; instead, the process yields r 2 , the dimensional feature map, through convolution. Then the high-resolution image is obtained through periodic shuffling, where r is the upsampling factor, which is the magnification of the image. The 1 × 1 convolution we used before yields the feature map to be sampled with matching dimensions. After upsampling, we add together the feature map that meets the resolution with the low-resolution enlarged image obtained by bicubic linear interpolation to obtain the super-resolution image.

3.2. Loss Function

We considered using two loss functions to evaluate the difference between the true-value HR and the model-predicted-and-reconstructed SR image. The first loss function is L 2 , which is the mean-squared error (MSE), expressed as follows:
L MSE ( I ^ , I ) = 1 hwc I ^ i , j , k I i , j , k 2
However, the L2 loss function is difficult to use under the influence of noise; its correlation with human visual perception is insufficient, and the potential multi-modal distribution from low-resolution LR to high-resolution HR is not to be found. Oftentimes, the reconstructed image is too smooth. Unlike L2, the L1 loss function is widely used in many super-resolution tasks; many experiments have shown that it improves the effect of super-resolution. Therefore, we have considered using the Charbonnier loss function, which is an improved form of the L1 loss function [8]. To prevent over-fitting, it adds a regular term ε with 1 × 10 3 at the end:
L Charbonnier ( I ^ , I ) = 1 hwc i , j , k I ^ i , j , k I i , j , k 2 + ϵ 2
We have used the L2 and Charbonnier loss functions to train our model. Experimental results show that the Charbonnier loss function can achieve better training results.

4. Experiments

4.1. Datasets

4.1.1. Training Datasets

In line with state-of-the-art methods [6,10,12,34,45,46,47], we utilized the DIV2K dataset [48] to train our image super-resolution network. DIV2K is a high-quality image dataset, containing 800 training images, 100 testing images and 100 validation images.

4.1.2. Testing Datasets

In keeping with most existing methods, we also evaluated the effectiveness of the developed network on the following benchmark datasets: Set5 [49], Set14 [50], BSD100 [51], Urban100 [52] and Manga109 [53]. Among these datasets, BSD100, Set5, and Set14 consist of images with natural scenes; Urban100 contains urban scenes; and Manga109 consists of Japanese manga pictures with rich colors. As usual, we used bicubic interpolation to downsample the image to obtain the LR/HR image pair. By convention, we converted the picture from RGB format to YCbCr format, evaluated only the Y channel, and used bicubic interpolation to upsample the color components.

4.2. Implementation Details

To obtain the image pairs required for training, we first used a bicubic interpolation to downsample the original HR image by scales of M ( M = 2 , 3 , 4 ) to obtain the LR image. We then cropped the LR to the size of L × L to obtain a set of sub-images of the L × L image. To match it, we cropped the HR image to the size of M L × M L to obtain the HR sub-image set. For example, under the 2× task, we set the sub-image size of LR to 24 × 24, the sub-image size of HR to 48 × 48, and the step size of the cropped sliding window to 24 pixels. Meanwhile, we applied the same three data augmentation operations [10,12] on the data for training the network; the three operations included (1) turning the picture horizontally, with mirroring up and down; (2) rotating the picture 90 degrees, 180 degrees and 270 degrees; and (3) scaling the image to 0.6, 0.7, 0.8 and 0.9 times the original size. During training, we used the Charbonnier loss [8]; set the learning rate to be 4 × 10 4 ; and used the Adam optimizer [54] to optimize the training with β 1 = 0.9 and β 2 = 0.999. To make the network converge faster, we simply initialized the parameters by using Kaiming [55] initialization. Our method was implemented by using Pytorch with NVIDIA RTX 2080ti GPUs, at a cost of about 24 h to train HMSF 2×, using 4 GPUs.

4.3. Analysis

4.3.1. Study of EFblock

EFblock is an efficient feature extraction block. Compared with the convolution block that consists of two standard 3 × 3 convolutional layers, its parameter number is reduced, but the performance is improved. We designed an ablation experiment by replacing the EFblock with a two-layer 3 × 3 convolutional block similar to that in other methods, similarly aiming to obtain 64-channel features.
Figure 5 shows all 64-dimensional features as well as the heat map of the features, the average features, and the SR image result extracted by Convblock and EFblock when Barbara.bmp in Set14, 2×, is processed. From the total features, it can be seen that the color extracted by the Convblock is dark, and many features are almost black. This shows that these feature maps are not effective; in contrast, the EFblock feature maps include almost all effective knowledge. The heat map makes clear that the features extracted by EFblock are richer, especially the details; furthermore, the color contrast, such as for the several items on the desk, is sharper. The final HMSF-EFblock also yields higher PSNR and SSIM values for Barbara.bmp. Table 4 shows the use of Convblock and EFblock to train the HMSF model. We used 2× three general datasets for testing. The number of parameters of HMSF-EFblock is 3000 less than that of HMSF using Convblock, with fewer multi-adds, too. HMSF-EFblock excels in all three datasets, especially in Manga109 and Set14. The ablation experiment shows that, compared with Convblock, the efficient and lightweight feature-extraction module EFblock has the advantages of a small number of parameters and calculations, while performing quite well.

4.3.2. Study of Local Multi-Scale Learning

Local multi-scale is mainly used for texture features. Convolutional layers with different receptive fields can extract texture features with different thicknesses and precision. Figure 6 shows the feature maps and heat maps extracted from the RF series of multi-scale modules with different receptive field sizes. It can be seen that, as the receptive field increases, the extracted texture features become more and more concise, from fine details to image contours. The fusion image shows the fusion result of texture features with different levels of fineness: both fine features and rough contours.
NMT and HMblock in Figure 7 represent the differences between local single-scale and multi-scale texture-extraction structures. We kept the local and global residual connections, and adjusted only the number of local multi-scale modules, RF. The HMSF-NMT in Table 4 represents a model that uses a local single-scale structure during texture extraction. Although the parameters and calculations of the local single-scale structural model are slightly reduced, the performance is greatly reduced, and each test dataset declines. Among them, the decline in Urban100 is the most obvious. Experiments show that local multi-scale texture features are very important for reconstructing local details of the image.

4.3.3. Study of Irregular Convolution

We propose an RF module that uses two types of irregular convolution, dilated convolution and deformable convolution, to obtain different sizes of the receptive field. We view these two irregular convolutions as a whole module; the dilated convolution aims to reach the receptive field, while the deformable convolution aims to extract the irregular feature. To compare our approach with the use of regular convolution, we designed an ablation experiment with five sub experiments: (a) presents the HMSF that, using regular convolutions, has different sizes of kernels (i.e., 3 × 3, 5 × 5, 7 × 7) to obtain different receptive fields; (b) means using stack regular convolution 3 × 3 to achieve the same size of receptive fields; and (c,d) means replacing two types of irregular convolution with regular convolution 3 × 3 in HMblock. Note that DI means dilated convolution, and DE means deformable convolution. Table 5 shows a performance comparison between five types of HMSF. It is obvious that, with the factor 2 ×, the comparison between (d) and (e) prove the advance of the use of dilated convolution (DI) by achieving performance improvement on three datasets; on the other hand, it can be seen that using deformable convolution (DE) can achieve better performance through comparing (c) and (e). Moreover, compared to (e) and the traditional RC-based multi-scale method (a), the use of two types of irregular convolutions not only can greatly reduce the number of parameters, but also improve performance and achieve a better PSNR/SSIM on all three datasets.

4.3.4. Study of Global Multi-Scale Learning

Besides local multi-scale texture features, the method also proposes relative global multi-scale structure features. These two scale features constitute global multi-scale features. The two feature maps at the bottom of Figure 3 represent the average feature maps of global multi-scale texture features as well as structural features. The texture features are dark and retain the fine texture features of the baby, hat and other items, while the structural feature only retains the definite outline and the main features of the face and facial organs. The output of the final fusion feature map is texture, accurately registering and reconstructing features and structural features, while yielding effective features with rich thickness, texture and an accurate structure. We conducted ablation experiments on global multi-scale structural features. The NST in Figure 7 indicates that only the local multi-scale texture feature extraction structure was used; the structure feature was not extracted. The trained model is named HMSF-NST. Table 4 shows the greatly reduced performance of HMSF-NST with the global multi-scale structure feature removed; in each dataset, there was significant performance degradation.

4.3.5. Study of Loss Function

Table 6 shows the performance of our model trained with three loss functions, using a factor of 2×, and evaluating it using Set14 and Urban100. Among them, the model trained by Charbonnier loss has better performance than the L2 loss.

4.4. Comparisons with State-of-the-Arts

As shown in Table 7, we compare our proposed HMSF with several state-of-the-art lightweight SR methods, including SRCNN [15], FSRCNN [16], VDSR [17], DRCN [29], LapSRN [8], DRRN [7], IDN [11], CARN [10] and MRFN [13], from 2014 to 2020, with a total of 18 methods. We compare our proposed HMSF on five common datasets, using two common quality evaluation indicators: PSNR and SSIM. To further evaluate the model parameters, memory usage, and multi-adds (Madds), we compare several methods to obtain 1280 × 720 pictures as the benchmark to calculate the above indicators.
As shown in Figure 8 and Figure 9, compared with some recent methods, ours requires fewer parameters, but performs excellently. Further experiments show that our method bests the existing methods in terms of model size and memory consumption, and exceeds the state-of-the-art method in terms of performance. Table 8 shows the difference between our method and others.
Since our proposed method is based on the idea of hybrid multi-scale features, our method has only two items—Set5 × 2 and Set14 × 4—ranking second among all PSNR and SSIM comparison items, while other items show the best performance. Our method can make the SR image more similar to the original structure and texture of the ground truth image, as well as producing similar results in terms of brightness, contrast, and structure. In the 2× stage, some models come close to ours: EDSR-baseline, MRFN and GLDSR. When compared with the EDSR-baseline, however, we lead in PSNR and SSIM on all datasets, although the parameter number of the EDSR-baseline is about twice that of ours. In comparison with MRFN, Set5’s SSIM is only slightly behind, by 0.003. However, we are far ahead in other aspects, especially in the Set14 and Urban100 datasets. In the Urban100 dataset, our method exceeds the GLADSR on PSNR by 0.36. The Urban100 dataset contains many urban scenes; the texture of houses and streets in the pictures is complex, but the structure is very regular. Because our method focuses on hybrid features-extraction and fusion of texture features and structural features, ours can be better applied to complex scenes. Experimental data also show that our method has advantages. Figure 10 shows a visual comparison. In Urban100’s img067.png image, there are many windows. These windows are structured, which means that their texture features are not rich, although the structure is very prominent. Among the six visualization methods, only ours understands the structural features and produces the least blur. The other methods have many gray areas between panes. As can be seen in another picture, ppt3.bmp from Set14, our method can better restore the letters on the picture, particularly with less blur.
In the 3× stage, our method performs best on PSNR and SSIM. In contrast, other methods, such as MADNet and SRMDNF, perform less well as the image size increases, necessitating a considerable increase in the number of parameters. For example, the parameter number of MADNet exceeds one million at 4×: too much, perhaps, to execute well on devices with limited memory. With our method, on the other hand, the number of parameters always remains at about 730,000—not increasing excessively as the super-resolution ratio changes. For comparison purposes, we selected some images from B100 that contain natural scenes. As shown in Figure 11 in the image ‘14037.png’, our HMSF restores terrestrial details best; moreover, in the image ‘108005.png’, our method excels by restoring the stripe of the tiger more accurately.
In the 4× stage, our method performs at just a slight disadvantage on Set14. However, methods with performance similar to ours, such as GLADSR, have more parameters, exceeding the number in our method by 95,000. MADNet’s number is also much larger than ours. Figure 12 shows a visual comparison. We chose Urban100’s ‘Img092.png’. In this picture, there are both structured features and rich textures. We notice that, among all six compared methods, almost all of the others reconstruct the numerical texture into an oblique texture. Doing so creates a serious problem, even changing the key structure of the picture. Our method benefits from the fusion of hybrid multi-scale features to better restore the original texture and structure of the graph. The superiority of our method can be seen in yet another picture: ‘148026.png’ in B100. The bridge stripes in this picture, although very rich in texture details, are very subtle and difficult to restore. Only our method can restore more detail this well.
We also compared the models from the perspective of memory usage. With ours, the small number of parameters is a major advantage, directly affecting the memory consumption of the model inference. To keep the comparison fair, we ensured that all methods were experimented with using the same Pytorch platform. As shown in Table 9, we used four datasets for evaluation at a scale of 4×, and recorded memory consumption during inference. We selected several open-source representative methods: (1) DRRN, with the number of parameters much smaller than ours; (2) CARN, with the number of parameters much larger than ours; and (3) EDSR-baseline and LESRCNN, with a number of parameters similar to ours. It can be seen that DRRN, despite its small number of parameters, required very large memory consumption, reaching 8211 MB on the Urban100 dataset—too much for most personal computers. Several other methods performed satisfactorily, but consumed more than 2 GB of memory on the Urban100 dataset, more than can be supported by personal computers and mobile devices. Our method consumed the least memory for each dataset, but achieved satisfactory performance, especially on the Urban100 dataset; our memory consumption was about 800 MB lower than CARN’s, while our PSNR was 0.08 higher, and our SSIM was 0.005 higher. As shown in Figure 13, compared with four SOTAs, the proposed model achieves the best performance with less memory consumption in Urban100 4×.
We further add the comparisons between multi-scale-based SOTAs [12,13,57,58]. Among them, Wang et al. proposed a traditional theory-based multi-scale method that uses multi-scale dictionary training to construct the mapping of SR and LR. On the other hand, MADNet, MRFN, and the method proposed by Du et al. consider regular convolutions with different kernel sizes to obtain multi-scale receptive fields. Table 10 shows that the CNN-based methods achieve better performance than the traditional theory-based method. What is more, our HMSF is based on dilated convolutions and deformable convolutions, which leads to a more powerful ability to learn multi-scale features and achieves better performance in both Set5 and Set14 than other methods based on regular convolutions.
In addition, we compared several open-source methods with similar parameters. Table 11 shows a comparison of our multi-adds with the performance of several methods with similar parameters. Generally speaking, the number of parameters affects memory usage during execution, to a certain extent. Multi-adds can reflect execution speed. We can see that CARN, the second method listed in the table, had slightly fewer multi-adds, but its number of parameters was greatly increased, with performance much lower than ours. Although MemNet had fewer parameters, its number of multi-adds was excessive: tens of times more than those of other methods. For LESRCNN, the number of parameters and multi-adds was slightly larger than ours, but its performance was poor. In general, our method requires moderate parameters and calculations, but provides satisfactory performance. For example, our number of parameters was 861,000 fewer than that of CARN. While our number of multi-adds was increased by only 32 G, the performance of our method exceeded that of CARN on all three datasets, especially on Urban100, for which the PSNR was increased by 0.08.
Moreover, we added a comparison about the run times with several state-of-the-art methods [6,9,59,60]; in keeping with some recent research [61], we used the same GPU (Nvidia GTX 1080 Ti) to run models including HMSF and HMSF-L, and processed 100 images from Urban100 for 4×. Table 12 and Figure 14 show that, compared with four state-of-the-art methods, our HMSF ran the fastest; HMSF-L had the second-best speed. HMSF-L achieved better image quality with higher PSNR/SSIM, meaning that HMSF-L achieved both efficiency and accuracy.

4.5. Further Comparisons with Larger State-of-the-Art Methods

Although our HMSF provides the benefits of being lightweight and effective, we designed a series of comparisons to further demonstrate its potential. Given that our model initially aimed to build a lightweight and tight neural network architecture, we preferred to keep our original network structure, and simply changed the number of channels in some layers to add parameters in order to enlarge the model size. We enlarged our HMSF to middle size and named it HMSF-L. To enlarge it, we increased the number of blocks to three, and applied more channels. We selected five state-of-the-art methods, including SRDenseNet [62], EDSR [6], FRSR [63], SRGAN [64], and NatSR [63]; most have a similar number of parameters. As shown in Table 13, our HMSF-L has a middle size, includes 4.48 million parameters, and achieved strong performance. HMSF-L achieved the best scores on seven comparative items with a 4× factor; compared to the second-best method, EDSR, our model trailed behind only Set5 with a PSNR of 0.02, but with far fewer parameters—those of EDSR were almost 9.6 times more than ours with HMSF-L. Furthermore, to compare ours with other state-of-the-art methods designed with a large size, we enlarged our HMSF to a larger size, named HMSF-XL, by simply increasing the number of HMblocks to five and further increasing the channels in some layers. As a result, our model’s number of parameters was increased to 15.4 million. Since any model with more than 15 million parameters can be categorized as a standard large model, we selected five state-of-the-art methods, most with more than 15 million parameters. The five include EDSR [6], SRGAT [65], RDN [59], RCAN [66] and SAN [67]; at least four of them require more parameters than our HMSF-XL. For comparison, we evaluated the methods on four common datasets. To fairly and comprehensively compare them, we considered average PSNR/SSIM as well as the number of parameters. As shown in Table 14 with the three methods RCAN, SAN and our HMSF-XL, each displayed certain advantages. RCAN performed better than HMSF-XL on 4 out of 11 items. At best—on Urban100—RCAN outperformed our method with a PSNR of 0.14 db. On the other hand, HMSF-XL achieved better performance on 6 of the 11 items. Moreover, compared to SAN, our HMSF-XL performed better on 7 of the 11; furthermore, SAN failed to placed first or second in performance on 4 of the 11, even with 0.3 million more parameters than ours. In summary, our HMSF-XL was enlarged simply by increasing the number of channels in some network layers, without modifying the main architecture of the network. Obviously, there were several methods that performed better than our HMSF-XL on some items, but on most items, our method excelled. In short, HMSF-XL provides better overall performance with a small number of parameters.

4.6. Further Test on Real-World Images

As a powerful image processing tool, super-resolution is widely used. In the real world, however, there are few LR-HR image pairs that can be evaluated for image quality. To simulate a real-world scenario, we followed the methods described in [8,56], using some real-world images to test ours. Because there are no HR images that suit these real-world images, we provided just the visualized results to be evaluated. Figure 15 shows a real-world historical image, processed with factor 2×; in the visualized comparison, it can be detected that our HMSF restored font details better. Figure 16 shows a visualized test result on the real LR image, Cat.png, which was more difficult to process because of its blur. Compared to the results of LapSRN [8], Waifu2x [68] and CARN [10], the image processed by HMSF appears with a sharper edge as well as a purer black color.

4.7. Analysis of Limitations

As previously mentioned, our HMSF achieves better performance in terms of PSNR/ SSIM and visualization quality. However, many deep-learning-based models [10,11,17,34] (including our HMSF) share a problem, as shown in Figure 17: none of the methods can correctly restore the direction of a stripe from the input of a bicubic low-resolution image. In a bicubic image, stripes are displayed as staggered black and white pixels; each white/black pixel connects the next black/white pixel in two contrary directions. Such a phenomenon will mislead trained models to yield undesirable results.

5. Discussion

In light of the paper, our proposed HMSF is lightweight, accurate, and fast, and has fewer parameters but performs better than SOTAs. We found that existing lightweight methods ignore the texture and structural characteristics of features and do not effectively extract them, which leads to an unsatisfying image restore quality. To tackle the aforementioned problem, we proposed an efficient feature extraction module and a hybrid multi-scale mechanism that aims to efficiently extract multi-level image structure and texture features. Unlike previous multi-scale mechanisms, our proposed method combines local and global multi-scale features, including both local multi-scale receptive fields and global multi-scale over network depth. Further, in the super-resolution task, the processed images often have blurry and irregular artifacts, which cannot be recovered well by previous methods. To solve this problem, we consider the use of dilated convolution and deformed convolution to further improve the ability to extract local multi-scale texture features. Since the size of the receptive field of deformed convolution is irregular, it is well adapted to complex irregularities artifacts. However, there are also some unsolved problems sustained, as Section 4.7 shows that existing SOTAs cannot correctly restore the direction of stripes in Figure 17, which is a research gap for how to properly fix images with misleading information. We believe that some prior knowledge must be added to guide the model to identify misleading information (i.e., wrong texture orientation) when processing such images. In future research, we will try to incorporate priors that verify misleading information (such as probabilistic predictions of the correct orientation of the texture) in our HMSF to better recover these images with misleading or redundant information.

6. Conclusions

In this paper, we propose a lightweight and fast super-resolution method, based on hybrid multi-scale features. We developed a feature-extraction module, EFblock, with a novel structure, flexible use of point convolution, and grouped convolution. It also adds local and global residual connections, but with fewer parameters—fewer than in the usual convolutional layer. We also propose a novel, hybrid, multi-scale features-extraction block, HMblock, with an efficient bottleneck structure, dilated convolution and deformable convolution, all to achieve local and global multi-scale learning. This can accurately match an image structure, and completely restore texture details. Compared with other state-of-the-art methods, ours performed competitively on five datasets, performing with high efficiency while still being lightweight. In particular, our method necessitates fewer parameters and consumes less memory during execution, yet performs better; in particular, our method offers promising benefits for memory-constrained devices. In further work, we hope to develop a lightweight and efficient video-data super-resolution method. Due to the similarities and differences of tasks, we hope to incorporate associated video frame information into our work so that video super-resolution also can be executed on memory-constrained devices.

Author Contributions

Conceptualization, W.H. and X.L.; methodology, W.H.; software, W.H.; validation, W.H., X.L. and M.W.; formal analysis, L.Z. and Q.W.; investigation, Q.W.; resources, Q.W.; data curation, W.H.; writing—original draft preparation, W.H.; writing—review and editing, L.Z., Q.W. and M.W.; visualization, W.H.; supervision, Q.W.; project administration, Q.W.; funding acquisition, Q.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by multiple grants, including: The National Key Research and Development Program of China (2020YFB1313900), National Natural Science Foundation of China (62072452, 61902386), Shenzhen Science and Technology Program (JCYJ20200109115627045, JCYJ20200109114233670, JCYJ20200109115201707).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are openly available at https://github.com/H-Wenfeng/SR (accessed on 20 November 2021).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Yang, W.; Zhang, X.; Tian, Y.; Wang, W.; Xue, J.; Liao, Q. Deep Learning for Single Image Super-Resolution: A Brief Review. IEEE Trans. Multimed. 2019, 21, 3106–3121. [Google Scholar] [CrossRef] [Green Version]
  2. Schulter, S.; Leistner, C.; Bischof, H. Fast and accurate image upscaling with super-resolution forests. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3791–3799. [Google Scholar]
  3. Yang, C.Y.; Yang, M.H. Fast direct super-resolution by simple functions. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 561–568. [Google Scholar]
  4. Timofte, R.; De Smet, V.; Van Gool, L. A+: Adjusted anchored neighborhood regression for fast super-resolution. In Asian Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 111–126. [Google Scholar]
  5. Yao, X.; Wu, Q.; Zhang, P.; Bao, F. Weighted Adaptive Image Super-Resolution Scheme based on Local Fractal Feature and Image Roughness. IEEE Trans. Multimed. 2020, 23, 1426–1441. [Google Scholar] [CrossRef]
  6. Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
  7. Tai, Y.; Yang, J.; Liu, X. Image super-resolution via deep recursive residual network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3147–3155. [Google Scholar]
  8. Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 624–632. [Google Scholar]
  9. Tai, Y.; Yang, J.; Liu, X.; Xu, C. Memnet: A persistent memory network for image restoration. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4539–4547. [Google Scholar]
  10. Ahn, N.; Kang, B.; Sohn, K.A. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 252–268. [Google Scholar]
  11. Hui, Z.; Wang, X.; Gao, X. Fast and accurate single image super-resolution via information distillation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 723–731. [Google Scholar]
  12. Lan, R.; Sun, L.; Liu, Z.; Lu, H.; Pang, C.; Luo, X. MADNet: A Fast and Lightweight Network for Single-Image Super Resolution. IEEE Trans. Cybern. 2020, 51, 1443–1453. [Google Scholar] [CrossRef]
  13. He, Z.; Cao, Y.; Du, L.; Xu, B.; Yang, J.; Cao, Y.; Tang, S.; Zhuang, Y. Mrfn: Multi-receptive-field network for fast and accurate single image super-resolution. IEEE Trans. Multimed. 2019, 22, 1042–1054. [Google Scholar] [CrossRef]
  14. Yang, X.; Mei, H.; Zhang, J.; Xu, K.; Yin, B.; Zhang, Q.; Wei, X. Drfn: Deep recurrent fusion network for single-image super-resolution with large factors. IEEE Trans. Multimed. 2019, 21, 328–337. [Google Scholar] [CrossRef] [Green Version]
  15. Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 184–199. [Google Scholar]
  16. Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 391–407. [Google Scholar]
  17. Kim, J.; Kwon Lee, J.; Mu Lee, K. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
  18. Zhang, D.; Shao, J.; Liang, Z.; Gao, L.; Shen, H.T. Large Factor Image Super-Resolution with Cascaded Convolutional Neural Networks. IEEE Trans. Multimed. 2020, 23, 2172–2184. [Google Scholar] [CrossRef]
  19. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  20. Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
  21. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
  22. Riaz, M.; Smarandache, F.; Firdous, A.; Fakhar, A. On soft rough topology with multi-attribute group decision making. Mathematics 2019, 7, 67. [Google Scholar] [CrossRef] [Green Version]
  23. Khan, N.; Yaqoob, N.; Shams, M.; Gaba, Y.U.; Riaz, M. Solution of Linear and Quadratic Equations Based on Triangular Linear Diophantine Fuzzy Numbers. J. Funct. Spaces 2021, 2021, 8475863. [Google Scholar] [CrossRef]
  24. Mahmood, T.; Ali, Z.; Aslam, M.; Chinram, R. Generalized Hamacher Aggregation Operators Based on Linear Diophantine Uncertain Linguistic Setting and Their Applications in Decision-Making Problems. IEEE Access 2021, 9, 126748–126764. [Google Scholar]
  25. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
  26. Tian, C.; Xu, Y.; Zuo, W.; Zhang, B.; Fei, L.; Lin, C.W. Coarse-to-fine CNN for image super-resolution. IEEE Trans. Multimed. 2020, 23, 1489–1502. [Google Scholar] [CrossRef]
  27. Wei, W.; Feng, G.; Zhang, Q.; Cui, D.; Zhang, M.; Chen, F. Accurate single image super-resolution using cascading dense connections. Electron. Lett. 2019, 55, 739–742. [Google Scholar] [CrossRef]
  28. Jin, Z.; Iqbal, M.Z.; Bobkov, D.; Zou, W.; Li, X.; Steinbach, E. A Flexible Deep CNN Framework for Image Restoration. IEEE Trans. Multimed. 2020, 22, 1055–1068. [Google Scholar] [CrossRef]
  29. Kim, J.; Kwon Lee, J.; Mu Lee, K. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1637–1645. [Google Scholar]
  30. Liu, P.; Zhang, H.; Zhang, K.; Lin, L.; Zuo, W. Multi-level wavelet-CNN for image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 773–782. [Google Scholar]
  31. Li, M.; Zhang, Z.; Yu, J.; Chen, C.W. Learning Face Image Super-Resolution through Facial Semantic Attribute Transformation and Self-Attentive Structure Enhancement. IEEE Trans. Multimed. 2021, 23, 468–483. [Google Scholar] [CrossRef]
  32. Ahn, N.; Kang, B.; Sohn, K.A. Efficient Deep Neural Network for Photo-realistic Image Super-Resolution. arXiv 2019, arXiv:1903.02240. [Google Scholar]
  33. Zhang, X.; Gao, P.; Liu, S.; Zhao, K.; Li, G.; Yin, L.; Chen, C.W. Accurate and efficient image super-resolution via global-local adjusting dense network. IEEE Trans. Multimed. 2020, 23, 1924–1937. [Google Scholar] [CrossRef]
  34. Tian, C.; Zhuge, R.; Wu, Z.; Xu, Y.; Zuo, W.; Chen, C.; Lin, C.W. Lightweight image super-resolution with enhanced CNN. Knowl.-Based Syst. 2020, 205, 106235. [Google Scholar] [CrossRef]
  35. Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
  36. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
  37. Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
  38. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
  39. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  40. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
  41. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 1314–1324. [Google Scholar]
  42. Zoph, B.; Le, Q.V. Neural architecture search with reinforcement learning. arXiv 2016, arXiv:1611.01578. [Google Scholar]
  43. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
  44. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
  45. Xie, C.; Zeng, W.; Lu, X. Fast Single-Image Super-Resolution via Deep Network With Component Learning. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 3473–3486. [Google Scholar] [CrossRef]
  46. Li, F.; Bai, H.; Zhao, Y. FilterNet: Adaptive information filtering network for accurate and fast image super-resolution. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1511–1523. [Google Scholar] [CrossRef]
  47. Choi, J.S.; Kim, M. A deep convolutional neural network with selection units for super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 154–160. [Google Scholar]
  48. Agustsson, E.; Timofte, R. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 126–135. [Google Scholar]
  49. Bevilacqua, M.; Roumy, A.; Guillemot, C.; Alberi-Morel, M.L. Low-Complexity Single-Image Super-Resolution Based on Nonnegative Neighbor Embedding. In Proceedings of the British Machine Vision Conference, Guildford, UK, 3–7 September 2012; pp. 135.1–135.10. [Google Scholar]
  50. Yang, J.; Wright, J.; Huang, T.S.; Ma, Y. Image super-resolution via sparse representation. IEEE Trans. Image Process. 2010, 19, 2861–2873. [Google Scholar] [CrossRef]
  51. Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the Eighth IEEE International Conference on Computer Vision, ICCV 2001, Vancouver, BC, Canada, 7–14 July 2001; Volume 2, pp. 416–423. [Google Scholar]
  52. Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar]
  53. Matsui, Y.; Ito, K.; Aramaki, Y.; Fujimoto, A.; Ogawa, T.; Yamasaki, T.; Aizawa, K. Sketch-based manga retrieval using manga109 dataset. Multimed. Tools Appl. 2017, 76, 21811–21838. [Google Scholar] [CrossRef] [Green Version]
  54. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  55. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
  56. Zhang, K.; Zuo, W.; Zhang, L. Learning a single convolutional super-resolution network for multiple degradations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3262–3271. [Google Scholar]
  57. Wang, W.; Hu, J.; Liu, X.; Zhao, J.; Chen, J. Single image super resolution based on multi-scale structure and non-local smoothing. Eurasip J. Image Video Process. 2021, 2021, 16. [Google Scholar] [CrossRef]
  58. Du, X.; Qu, X.; He, Y.; Guo, D. Single image super-resolution based on multi-scale competitive convolutional neural network. Sensors 2018, 18, 789. [Google Scholar] [CrossRef] [Green Version]
  59. Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2480–2495. [Google Scholar] [CrossRef] [Green Version]
  60. Hu, X.; Mu, H.; Zhang, X.; Wang, Z.; Tan, T.; Sun, J. Meta-SR: A magnification-arbitrary network for super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1575–1584. [Google Scholar]
  61. Behjati, P.; Rodriguez, P.; Mehri, A.; Hupont, I.; Tena, C.F.; Gonzalez, J. Overnet: Lightweight multi-scale super-resolution with overscaling network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikola, HI, USA, 5–9 January 2021; pp. 2694–2703. [Google Scholar]
  62. Tong, T.; Li, G.; Liu, X.; Gao, Q. Image super-resolution using dense skip connections. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4799–4807. [Google Scholar]
  63. Soh, J.W.; Park, G.Y.; Jo, J.; Cho, N.I. Natural and realistic single image super-resolution with explicit natural manifold discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8122–8131. [Google Scholar]
  64. Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
  65. Behjati, P.; Rodríguez, P.; Mehri, A.; Hupont, I.; Gonzàlez, J.; Tena, C.F. Overnet: Lightweight multi-scale super-resolution with overscaling network. arXiv 2020, arXiv:2008.02382. [Google Scholar]
  66. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
  67. Dai, T.; Cai, J.; Zhang, Y.; Xia, S.T.; Zhang, L. Second-order attention network for single image super-resolution. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 11057–11066. [Google Scholar] [CrossRef]
  68. Waifu2x. Image Super-Resolution for Anime-Style Art Using Deep Convolutional Neural Networks. Available online: http://waifu2x.udp.jp/ (accessed on 20 November 2021).
Figure 1. The overall structure of the proposed network. EFblock, HMblock and DRBlock are used for feature extraction, image enhancement, and image reconstruction, respectively.
Figure 1. The overall structure of the proposed network. EFblock, HMblock and DRBlock are used for feature extraction, image enhancement, and image reconstruction, respectively.
Mathematics 10 00653 g001
Figure 2. Feature extraction blocks used in several recent SR methods.
Figure 2. Feature extraction blocks used in several recent SR methods.
Mathematics 10 00653 g002
Figure 3. An overview of HMblock. HMblock includes the local multi-scale feature extraction module RF and the global residual connection used to extract the global multi-scale features, which together constitute the hybrid multi-scale features.
Figure 3. An overview of HMblock. HMblock includes the local multi-scale feature extraction module RF and the global residual connection used to extract the global multi-scale features, which together constitute the hybrid multi-scale features.
Mathematics 10 00653 g003
Figure 4. Three scales of the multi-scale RF module, expressed in the form of different receptive field sizes.
Figure 4. Three scales of the multi-scale RF module, expressed in the form of different receptive field sizes.
Mathematics 10 00653 g004
Figure 5. Visual qualitative comparison on average feature maps on HMSF-CONV and HMSF.
Figure 5. Visual qualitative comparison on average feature maps on HMSF-CONV and HMSF.
Mathematics 10 00653 g005
Figure 6. Visual qualitative comparison of average feature maps and average heat maps, via Local Multi-scale Learning.
Figure 6. Visual qualitative comparison of average feature maps and average heat maps, via Local Multi-scale Learning.
Mathematics 10 00653 g006
Figure 7. NST: without global multi-scale structural. NMT: without local multi-scale texture features. HMblock: complete hybrid multi-scale features.
Figure 7. NST: without global multi-scale structural. NMT: without local multi-scale texture features. HMblock: complete hybrid multi-scale features.
Mathematics 10 00653 g007
Figure 8. PSNR of recent CNN models for scale of 2× on Set5 [49]. The redpoint result is from our HMSF. Our model achieves the best performance, although the number of network parameters is relatively small.
Figure 8. PSNR of recent CNN models for scale of 2× on Set5 [49]. The redpoint result is from our HMSF. Our model achieves the best performance, although the number of network parameters is relatively small.
Mathematics 10 00653 g008
Figure 9. PSNR of recent CNN models for scale of 4× on Set5 [49]. The redpoint result is from our HMSF. Our model achieves the best performance, although the number of network parameters is relatively small.
Figure 9. PSNR of recent CNN models for scale of 4× on Set5 [49]. The redpoint result is from our HMSF. Our model achieves the best performance, although the number of network parameters is relatively small.
Mathematics 10 00653 g009
Figure 10. Visual qualitative comparison on 2× scale datasets.
Figure 10. Visual qualitative comparison on 2× scale datasets.
Mathematics 10 00653 g010
Figure 11. Visual qualitative comparisons on 3× scale datasets.
Figure 11. Visual qualitative comparisons on 3× scale datasets.
Mathematics 10 00653 g011
Figure 12. Visual qualitative comparisons on 4× scale datasets.
Figure 12. Visual qualitative comparisons on 4× scale datasets.
Mathematics 10 00653 g012
Figure 13. Diagram analysis of model memory consumption and performance. The proposed model achieves the best performance with less memory consumption.
Figure 13. Diagram analysis of model memory consumption and performance. The proposed model achieves the best performance with less memory consumption.
Mathematics 10 00653 g013
Figure 14. Diagram analysis of model running times and performance. The proposed model achieves the best performance with lower running times.
Figure 14. Diagram analysis of model running times and performance. The proposed model achieves the best performance with lower running times.
Mathematics 10 00653 g014
Figure 15. Visual test on real-world historical image of 2×.
Figure 15. Visual test on real-world historical image of 2×.
Mathematics 10 00653 g015
Figure 16. Visual test on real LR image ‘Cat.png’, 2×.
Figure 16. Visual test on real LR image ‘Cat.png’, 2×.
Mathematics 10 00653 g016
Figure 17. Failure case. A failure example of 2 × SR; our HMSF cannot correctly restore the direction of the stripe of the textile, with its complex detail and misleading information.
Figure 17. Failure case. A failure example of 2 × SR; our HMSF cannot correctly restore the direction of the stripe of the textile, with its complex detail and misleading information.
Mathematics 10 00653 g017
Table 1. Comparing EFblock and standard convolution layers from parameter and calculation number, and number of activation functions.
Table 1. Comparing EFblock and standard convolution layers from parameter and calculation number, and number of activation functions.
MethodLayerDimensionTotal Param. Total MaddActivation
Conv3 × 3 blockConv13-3219 K2
Conv232-644.4 G
EFblockBlock13-32-6416 K6
Block264-64-643.7 G
Table 2. Comparing different local multi-scale implementation methods.
Table 2. Comparing different local multi-scale implementation methods.
MethodRFParameterMaddReceptive Field
HMblock- inc [19]Different conv kernel sizes0.865 M138 G3, 5, 7
HMblock- mrf [13]Only stack 3 × 3 conv0.747 M111 G3, 5, 7
HMblockBottleneck + Dilated + Deformable0.523 M59 G[3, 5, 7] × Deformable
Table 3. Comparison of different image enhancement blocks: “LMS” means “local multi-scale”, “GMS” means “global multi-scale”, and “HMS” means “hybrid multi-scale”.
Table 3. Comparison of different image enhancement blocks: “LMS” means “local multi-scale”, “GMS” means “global multi-scale”, and “HMS” means “hybrid multi-scale”.
BlockLMSGMSHMS
CARN×××
MRF (MRFN)××
LESRCNN×××
HMblock (Ours)
Table 4. The ablation experiment results of EFblock, local multi-scale learning and global multi-scale learning were evaluated on three common datasets of 2× with PSNR/SSIM. Red text means the best performance.
Table 4. The ablation experiment results of EFblock, local multi-scale learning and global multi-scale learning were evaluated on three common datasets of 2× with PSNR/SSIM. Red text means the best performance.
MethodsParametersEFBNSTNMTSet14B100Manga109
HMSF- CONV 732 K×33.77/0.919232.27/0.900838.87/0.9776
HMSF- NMT 610 K×33.72/0.918632.25/0.900738.87/0.9776
HMSF- NST 462 K×33.55/0.917032.17/0.899638.41/0.9768
HMSF729 K33.81/0.919432.28/0.900938.94/0.9778
Table 5. The ablation experiment of the use of RC/DI/DE (regular/dilated/deformable convolution). Red text means the best performance.
Table 5. The ablation experiment of the use of RC/DI/DE (regular/dilated/deformable convolution). Red text means the best performance.
HMSFParametersScaleSet14B100Manga 109
PSNRSSIMPSNRSSIMPSNRSSIM
(a) RC_based762 K233.600.917432.190.899938.750.9772
(b) RC_RC_based695 K233.710.915932.250.900538.770.9775
(c) DI_RC_based675 K233.710.917932.250.900638.790.9775
(d) RC_DE_based729 K233.700.915832.240.900538.740.9775
(e) DI_DE_based729 K233.810.919432.280.900938.940.9777
Table 6. The performance of models trained with different loss functions.
Table 6. The performance of models trained with different loss functions.
Loss FunctionScaleSet5B100
PSNR (dB)/SSIMPSNR (dB)/SSIM
L238.04/0.960632.24/0.9006
Charbonnier (L1)38.10/0.960932.28/0.9009
Table 7. Quantitative comparisons (PSNR (DB)/SSIM for 2×, 3× and 4×) of SOTA SR models. Red/blue text means the best/second-best performance.
Table 7. Quantitative comparisons (PSNR (DB)/SSIM for 2×, 3× and 4×) of SOTA SR models. Red/blue text means the best/second-best performance.
ModelScaleParamSet5Set14B100Urban100Manga109
SRCNN [15]257 K36.66/0.954232.42/0.906331.36/0.887929.50/0.894635.60/0.9663
FSRCNN [16]212 K37.00/0.955832.63/0.908831.53/0.892029.88/0.902036.67/0.9710
VDSR [17]2665 K37.53/0.95873303/0.912431.90/0.896030.76/0.914037.22/0.9750
DRCN [29]21774 K37.63/0.958833.04/0.911831.85/0.894230.75/0.913337.55/0.9732
LapSRN [8]2813 K37.52/0.959033.08/0.913031.80/0.895030.41/0.910037.27/0.9740
DRRN [7]2297 K37.74/0.959133.23/0.913632.05/0.897331.23/0.918837.88/0.9749
MemNet [9]2677 K37.78/0.959733.28/0.914232.08/0.897831.31/0.919537.72/0.9740
EDSRbase [6]21370 K37.99/0.960433.57/0.917532.16/0.899431.98/0.927238.54/0.9769
SRMDNF [56]21513 K37.79/0.960033.32/0.915032.05/0.898031.33/0.920038.07/0.9761
IDN [11]2796 K37.83/0.960033.30/0.914832.08/0.898531.27/0.919638.01/0.9749
CARN [10]21592 K37.76/0.959033.52/0.916632.09/0.897831.92/0.9256/
DRFN [14]2-37.11/0.959533.29/0.914232.02/0.897931.08/0.9179/
MADNet [12]2878 K37.85/0.960033.39/0.916132.05/0.898117.59/0.9234/
LESRCNN [34]2626 K37.65/0.958633.32/0.914831.95/0.896431.45/0.9206/
CFSRCNN [26]21200 K37.79/0.959133.51/0.916532.11/0.898832.07/0.9273/
MRFN [13]2-37.98/0.961133.41/0.915932.14/0.899731.45/0.9221/
GLADSR [33]2812 K37.99/0.960833.63/0.917932.16/0.899632.16/0.9283/
HMSF2729 K38.10/0.960933.81/0.919432.28/0.900932.52/0.932238.94/0.9777
SRCNN [15]357 K32.66/0.908929.30/0.821528.41/0.786326.24/0.798930.48/0.9117
FSRCNN [16]312 K33.16/0.914029.43/0.824228.53/0.79l026.43/0.808031.10/0.9210
VDSR [17]3665 K33.67/0.921029.54/0.827728.55/0.794526.48/0.817532.01/0.9340
DRCN [29]31774 K33.85/0.921529.89/0.831728.81/0.795427.16/0.831132.24/0.9343
LapSRN [8]3813 K33.82/0.922729.87/0.832028.82/0.798027.07/0.828032.21/0.9350
DRRN [7]3297 K34.03/0.924429.96/0.834928.95/0.800427.53/0.837832.71/0.9379
MemNet [9]3677 K34.09/0.924830.00/0.835028.96/0.800127.56/0.837632.71/0.9381
EDSRbase [6]31555 K34.37/0.927030.28/0.841729.09/0.805228.15/0.852733.45/0.9439
SRMDNF [56]31530 K34.12/0.925430.04/0.837028.97/0.803027.57/0.840033.00/0.9403
IDN [11]3796 K34.11/0.925329.99/0.835428.95/0.801327.42/0.835932.71/0.9381
CARN [10]31592 K34.29/0.925530.29/0.840729.06/0.803428.06/0.8493/
DRFN [14]3-34.01/0.923430.06/0.836628.93/0.801027.43/0.8359/
MADNet [12]3930 K34.14/0.925130.20/0.839528.98/0.802327.78/0.8439/
LESRCNN [34]3811 K33.93/0.923130.12/0.838028.91/0.800527.70/0.8415/
CFSRCNN [26]31200 K34.34/0.925630.27/0.841029.03/0.803528.04/0.8496/
MRFN [13]3-34.21/0.926730.03/0.836328.99/0.802927.53/0.8389/
GLADSR [33]3821 K34.41/0.927230.37/0.841829.08/0.805028.24/0.8537/
HMSF3730 K34.49/0.928030.42/0.843829.15/0.806528.33/0.856633.71/0.9453
SRCNN [15]457 K30.48/0.862827.50/0.751326.90/0.710124.52/0.722127.58/0.8555
FSRCNN [16]412 K30.73/0.860127.71/0.748826.98/0.702924.62/0.727227.90/0.8610
VDSR [17]4665 K31.35/0.883028.02/0.768027.29/0.726725.18/0.754028.83/0.8870
DRCN [29]41774 K31.56/0.881028.15/0.762727.24/0.715025.15/0.753028.93/0.8854
LapSRN [8]4813 K31.54/0.885028.19/0.772027.32/0.727025.21/0.756029.09/0.8900
DRRN [7]4297 K31.68/0.888828.21/0.772027.38/0.728425.44/0.763829.45/0.8946
MemNet [9]4677 K31.74/0.889328.26/0.772327.40/0.728125.50/0.763029.42/0.8942
EDSRbase [6]41518 K32.09/0.893828.58/0.781327.57/0.735726.04/0.784930.35/0.9067
SRMDNF [56]41555 K31.96/0.893028.35/0.777027.49/0.734025.68/0.773030.09/0.9024
IDN [11]4796 K31.82/0.890328.25/0.773027.41/0.729725.41/0.763229.41/0.8942
CARN [10]41592 K32.13/0.893728.60/0.780627.58/0.734926.07/0.7837/
DRFN [14]4-31.55/0.886128.30/0.773727.39/0.729326.45/0.7629/
MADNet [12]41002 K32.01/0.892528.45/0.778128.47/0.732725.77/0.7751/
LESRCNN [34]4774 K31.88/0.890328.44/0.777227.45/0.731325.77/0.7732/
CFSRCNN [26]41200 K32.06/0.892028.57/0.780027.53/0.733326.03/0.7824/
MRFN [13]4-31.90/0.891628.31/0.774627.43/0.730925.46/0.7654/
GLADSR [33]4826 K32.14/0.894028.62/0.781327.59/0.736126.12/0.7851/
HMSF4731 K32.15/0.894728.61/0.782127.61/0.737226.15/0.788730.52/0.9082
Table 8. Comparison of parameters and feature extraction layers of several methods with 2×.
Table 8. Comparison of parameters and feature extraction layers of several methods with 2×.
MethodsSRCNN [15]VDSR [17]LapSRN [8]CARN [10]IDN [11]MRFN [13]HMSF
Parameters57 K665 K813 K1592 K796 K-729 K
Feature ExtractionConv LayersConv LayersConv LayersConv LayersConv LayersConv LayersEFblock
Global Multi-Scale×××××
Local Multi-Scale××××
Hybrid Multi-Scale××××××
Table 9. Using memory consumption as the evaluation criterion—in a comparison of five open-source methods with 4× scale—our method has the best performance and smallest memory consumption. Red text means the best performance.
Table 9. Using memory consumption as the evaluation criterion—in a comparison of five open-source methods with 4× scale—our method has the best performance and smallest memory consumption. Red text means the best performance.
MethodsScaleParametersSet5Set14B100Urban100
PSNR/SSIMPSNR/SSIMPSNR/SSIMPSNR/SSIM
MemoryMemoryMemoryMemory
DRRN4301 K31.68/0.888828.21/0.772027.38/0.728425.44/0.7638
1071 M1845 M859 M8211 M
CARN41592 K32.13/0.893728.60/0.780627.58/0.734926.07/0.7837
803 M983 M695 M2697 M
EDSR-baseline41518 K32.09/0.893828.58/0.781327.57/0.735726.04/0.7849
727 M921 M659 M2497 M
LESRCNN4774 K31.88/0.890328.44/0.777227.45/0.731325.77/0.7732
1149 M1805 M903 M7307 M
HMSF4731 K32.15/0.894728.61/0.782127.61/0.737226.15/0.7887
652 M729 M611 M1854 M
Table 10. Quantitative comparisons (PSNR (DB)/SSIM for 3×) of SOTA multi-scale-based SR models. Red text means the best performance.
Table 10. Quantitative comparisons (PSNR (DB)/SSIM for 3×) of SOTA multi-scale-based SR models. Red text means the best performance.
MethodMulti-Scale ImplementationScaleSet5Set14
Wang et al. [57]Multi-Scale Dictionary333.40/0.920029.51/0.8300
Du et al. [58]Regular Convolution333.44/0.918529.59/0.9290
MADNet [12]Regular Convolution334.14/0.925030.20/0.8390
MRFN [13]Regular Convolution334.21/0.926030.03/0.8360
HMSF (ours)Dilated + Deformable Convolution334.49/0.928030.42/0.8438
Table 11. Multi-adds compared between six similar methods at 4× scale. Madds evaluated the number of multiplications and additions of the model, and evaluated the combination of parameters and Madds. Red text means the best performance.
Table 11. Multi-adds compared between six similar methods at 4× scale. Madds evaluated the number of multiplications and additions of the model, and evaluated the combination of parameters and Madds. Red text means the best performance.
MethodsScaleParametersMaddsSet14B100Urban100
PSNR/SSIMPSNR/SSIMPSNR/SSIM
LapSRN4813 K149 G28.19/0.772027.32/0.727025.21/0.7560
MemNet4677 K2662 G28.26/0,772327.40/0.728125.50/0.7630
CARN41592 K90 G28.60/0.780627.58/0.734926.07/0.7837
MADNet41002 K54 G28.45/0.778127.47/0.732725.77/0.7751
LESRCNN4774 K241 G28.44/0.777227.45/0.731325.77/0.7732
HMSF4731 K122 G28.61/0.782127.61/0.737226.15/0.7887
Table 12. Comparison of average running time on Urban100 for 4×. Red/blue text means the best/second-best performance.
Table 12. Comparison of average running time on Urban100 for 4×. Red/blue text means the best/second-best performance.
ModelParametersRunning Times (s)PSNRSSIM
MemNet0.6 M0.48125.500.7630
EDSR43 M1.21826.640.8029
RDN22 M1.26826.610.8028
Meta-RDN22 M1.3526.65-
HMSF (Ours)0.7 M0.06926.150.7887
HMSF-L (Ours)4.48M0.3126.710.8056
Table 13. We enlarge the HMSF to a middle size, and compare it with other state-of-the-art models that have a similar number of parameters. Red/blue text means the best/second-best performance.
Table 13. We enlarge the HMSF to a middle size, and compare it with other state-of-the-art models that have a similar number of parameters. Red/blue text means the best/second-best performance.
DatasetScaleSRDenseNetEDSRFRSRSRGANNatSRHMSF-L
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
Set5432.020.893432.460.896832.200.893929.410.834530.980.860632.440.8987
Set14428.500.778228.800.787628.540.780826.020.693425.670.675728.820.7876
B100427.530.733727.710.742027.600.736625.180.640124.930.625927.750.7428
Urban100426.050.781926.640.803326.210.7904//23.540.692626.710.8056
Parameters42.0 M43 M4.8 M1.5 M4.8 M4.48 M
Table 14. We enlarge the HMSF to a large size, and compare it with other state-of-the-art models that have a similar number of parameters. Red/blue text means the best/second-best performance.
Table 14. We enlarge the HMSF to a large size, and compare it with other state-of-the-art models that have a similar number of parameters. Red/blue text means the best/second-best performance.
DatasetScaleEDSRSRGATRDNRCANSANHMSF-XL
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
Set5238.110.960138.200.961038.240.961438.270.961438.310.962038.280.9616
Set14233.920.919533.930.920134.010.921234.120.921634.070.921334.130.9219
B100232.320.901332.340.901432.340.901732.410.902732.420.902832.400.9026
Urban100232.930.935132.900.935932.890.935333.340.938433.100.937033.200.9384
Avarage234.320.929034.340.929634.370.929934.540.931034.480.930834.500.9311
Parameters43 M/22.3 M16 M15.7 M15.4 M
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Huang, W.; Liao, X.; Zhu, L.; Wei, M.; Wang, Q. Single-Image Super-Resolution Neural Network via Hybrid Multi-Scale Features. Mathematics 2022, 10, 653. https://0-doi-org.brum.beds.ac.uk/10.3390/math10040653

AMA Style

Huang W, Liao X, Zhu L, Wei M, Wang Q. Single-Image Super-Resolution Neural Network via Hybrid Multi-Scale Features. Mathematics. 2022; 10(4):653. https://0-doi-org.brum.beds.ac.uk/10.3390/math10040653

Chicago/Turabian Style

Huang, Wenfeng, Xiangyun Liao, Lei Zhu, Mingqiang Wei, and Qiong Wang. 2022. "Single-Image Super-Resolution Neural Network via Hybrid Multi-Scale Features" Mathematics 10, no. 4: 653. https://0-doi-org.brum.beds.ac.uk/10.3390/math10040653

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop