An Adaptive Multitask Network for Detecting the Region of Water Leakage in Tunnels

Zhao, Liang; Wang, Jiawei; Liu, Shipeng; Yang, Xiaoyan

doi:10.3390/app13106231

Open AccessArticle

An Adaptive Multitask Network for Detecting the Region of Water Leakage in Tunnels

¹

School of Information and Control Engineering, Xi’an University of Architecture and Technology, Xi’an 710055, China

²

School of Building Services Science and Engineering, Xi’an University of Architecture and Technology, Xi’an 710055, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(10), 6231; https://0-doi-org.brum.beds.ac.uk/10.3390/app13106231

Submission received: 3 May 2023 / Revised: 13 May 2023 / Accepted: 15 May 2023 / Published: 19 May 2023

(This article belongs to the Special Issue Machine/Deep Learning: Applications, Technologies and Algorithms)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Tunnels water leakage detection in complex environments is difficult to detect the edge information due to the structural similarity between the region of water seepage and wet stains. In order to address the issue, this study proposes a model comprising a multilevel transformer encoder and an adaptive multitask decoder. The multilevel transformer encoder is a layered transformer to extract the multilevel characteristics of water leakage information, and the adaptive multitask decoder comprises the adaptive network branches. The adaptive network branches generate the ground truths of wet stains and water seepage through the threshold value and transmit them to the network for training. The converged network, the U-net, fuses coarse images from the adaptive multitask decoder, and the fusion images are the final segmentation results of water leakage in tunnels. The experimental results indicate that the proposed model achieves 95.1% Dice and 90.4% MIOU, respectively. This proposed model demonstrates a superior level of precision and generalization when compared to other related models.

Keywords:

water leakage; multilevel transformer encoder; adaptive multitask decoder; adaptive network branches; converged network

1. Introduction

Due to the rapid development of tunnel traffic in recent years, the public’s attention has shifted from tunnel construction to tunnel maintenance. [1]. As a result, water leakage detection has become increasingly relevant. The traditional way of disease detection is mainly manual inspection inside the tunnel. The method could have been more efficient, but the inspection results are too dependent on the subjectivity of the inspector. There may be problems such as wrong selection and false detection in the actual detection process [2,3]. In addition to manual visual detection, a number of detection algorithms have been applied for detecting water leakage with the development of computer technology.

There are traditional segmentation algorithms for disease detection, such as threshold [4], edge [5], region [6], and fusion algorithms. Fujita et al. first used a two-image preprocessing method with subtraction and linear emphasis to reduce the influence of shadows [7]. However, this kind of algorithm needs better generalization capabilities as it struggles to adapt to the tunnel’s constant changes in light and background. Some researchers have used machine learning methods such as SVM and random forests for tunnel disease detection tasks [8,9]. Although machine learning is easy to explain, machine learning algorithm still has serious disadvantages, such as being limited to small cluster samples, low efficiency, and being easy to overfit when there are too many samples.

In recent years, due to the convolutional neural networks’ good nonlinear expression ability, convolutional neural networks have been gradually applied to detect diseases in tunnels [10,11], which are more intelligent and automatic than manual detection and traditional image processing technology. Most studies initially treated the segmentation task as a pixel-by-pixel classification task. For example, Khalaf, A.F. et al. proposed a new formulation of deep Convolutional Neural Networks to implement segmentation tasks [12]. However, the method of pixel-by-pixel segmentation can’t achieve structured predictions. The advent of FCN [13] solved the problem of structured prediction. Huang et al. adopted the FCN network to achieve segmentation and identification of tunnel leakage diseases [14]. Various algorithms were proposed based on the FCN algorithm, including Deeplab series algorithms [15,16,17,18], Segnet [19], U-Net [20], U-Net++ [21], and Unet3+ [22]. Dong et al. proposed a new method that integrates the basic SegNet with a focal loss function. This method can accurately predict small cracks and structural damage in the tunnel [23]. Zhang et al. proposed CrackUnet, an improved automatic detected crack algorithm based on U-Net, and studied the impact of dataset size and model depth on training time, detection accuracy, and speed [24]. Yang et al. used the U-Net++ model to perform pixel-level semantic segmentation of crack images [25]. Li et al. improved the deeplabv3+ algorithm to detect water leakage in subway tunnels and added an ECA-Net channel attention mechanism to two adequate feature layers in the codec part of the DeeplabV3+ model, resulting in an effective improvement in detection accuracy [26]. These algorithms work very well, and they’re very trendy right now.

However, water leakage detection is challenging. Water leakage detection is currently facing two significant challenges.

(1): The background of the leakage image is complex, which is greatly affected by the actual project.
(2): Due to the similar structure of water seepage and wet stains, and affected by illumi-nation and shooting Angle, it isn’t easy to detect the edge area.

The algorithms of FCN and U-Net et al. adopt convolution. The lack of a global perception field leads to information loss during water leakage characteristic information transmission. These algorithms are not very effective in an environment with complex water leakage. Some studies have proposed that Vision Transformer (ViT) can solve the problem [27]. Yang et al. used a ViT as the backbone network to segment the crack disease in the tunnel, which achieved good results [28]. Although the transformer has a natural global receptive field to solve this problem, it has limitations, such as being computationally intensive and the size of the test dataset being comparable to the size of the training dataset. In addition, to solve the problems of unclear edge detection, Yang et al. tried to split the labels into complex and simple labels [29]. It turned out that using two decoders trained separately would lead to an over-reliance on hand-designed tags, so they are inefficient and not suitable for end-to-end networks.

To solve the above problems, we designed the following network:

(1): Due to the complex background of the water leakage image, the information loss problem easily occurs during detection. To solve this problem, the encoder designed in this paper adopted a multilevel transformer and used depth separable convolution to mark the location information. Also, the multilevel transformer reduced the computational effort by lowering the length of sequences in the self-attentive mechanism and used a hierarchical transformer to extract multiple levels of features.
(2): To solve the problem of unclear edge segmentation, we designed an adaptive multitask network branch that can automatically generate water seepage and wet stain labels without manual labeling. Then, the labels are input into the network training, and the fusion network fuses the rough segmentation map from the adaptive multitask decoder to get the final segmentation image.

The organization of the manuscript is as follows: chapter two expounds upon the structure and function of the model; chapter three pertains to the data sets and methods of data augmentation; chapter four delves into the parameters and metrics of the experiments; chapter five presents the results of the experimentation; and chapter six offers an examination of the conclusions and limitations of the research.

2. Methods

The structure of the model is shown in Figure 1. The proposed model comprises 3 parts: a multilevel transformer encoder, a decoder composed of adaptive network branches, and a converged network. Section 2.1 provides an overview of the foundational knowledge required for understanding the subsequent sections. Section 2.2 introduces the multilevel transformer encoder, and Section 2.3 introduces the decoder, Section 2.4 introduces the converged network, Section 2.5 introduces the loss function.

2.1. Preliminaries

Having received popularity as a type of backbone network in recent years, Vision Transformer is derived from the most popular transformer model in Natural Language Processing (NLP). Compared with the CNN network, ViT is more powerful in performing various functions. ViT is applicable to model a long-term dependence on input data and guarantee weak inductive bias on learning data.

ViT primarily comprises 3 modules: the Linear Projection module, the Transformer Encoder module, and the MLP Head module. In the Linear Projection module, the image is segmented into patches, and the patches are converted into token embeddings, and corresponding position embeddings are added to token embeddings. The module generates the token embedding for the class token symbol and the positional encoding for all sequences, finally combining the positional encoding and the token embedding; The core part of the Transformer Encoder module is the multi-head attention mechanism. The image is split into several patches that are then transformed into a sequence of 3 dimensions through reshaping. To obtain 3 matrices

Q

,

K

, and

V

, the different weight matrices

W_{q}

,

W_{k}

, and

W_{v}

, which are from the head, are multiplied by the sequences of each dimension. Then, the information about image features is extracted by inputting these 3 matrices into the multi-head self-attention mechanism for calculation. Its calculation formula is expressed as Equation (1).

A (Q, K, V) = S (\frac{Q \cdot K^{T}}{\sqrt{d_{h e a d}}}) V

(1)

where

Q, K, V

represent the matrix obtained by the product of different weights and input vectors, respectively;

\frac{1}{\sqrt{d_{h e a d}}}

indicates the scaling scalar,

d_{h e a d}

is set to 64 in the paper, and

Q \cdot K^{T}

is used to calculate the attention score between

Q

and

K^{T}

;

S

denotes the

S o f t m a x

function used to convert the attention score into probability, which is multiplied by

V

to obtain the probability value matrix. The efficiency of the multi-head self-attention layer can be affected by the complexity of calculating the multi-head self-attention layer; In the MLP head module, the shape of the tensor output after the transformer encoder is the same as the shape of the tensor input. If the downstream task is a classification model, the corresponding class token needs to be extracted to obtain the classification result.

2.2. The Multilevel Transformer Encoder

Although many studies have proved that a transformer has more robust performance than convolution, which has attracted many people’s attention, using a transformer effectively is a challenging task. This paper designs a transformer to be more suitable for the water leakage detection task. Figure 2 shows the components of the transformer encoder, and the transformer is described in more detail below.

2.2.1. Multiscale Feature

The transformer component in the encoder employs a multi-scale feature extraction approach, allowing it to extract a variety of features from the input image of varying scales. These multiscale features generated by the transformer provide both high-resolution coarse features and low-resolution fine-grained features.

2.2.2. Overlapping Patch Merging

The encoder employs an overlapping patch merging process before input, as illustrated in Figure 3. The blue part is patch 1, which is the overlapping block of patch 2 and patch 3. Non-overlapping images or feature patches failed to maintain local continuity around those patches. To address this issue, we utilize an overlapping patch merging process. In this process, the image is cut into multiple patches. To make the information between different patches interact, it is necessary to ensure the overlap between different patches. Patch overlap is achieved by changing the stride size. The stride represents the number of lines moved when the image is cut from the upper left corner of the image. In this paper, the patch size is set to 7, and the stride size is set to 4.

2.2.3. Self-Attention Mechanism

For the multi-head self-attention layer in ViT, there remains a heavy computational workload. The amount of computational workload can be calculated by using Equation (2).

O = 4 N C^{2} + 2 C {(N)}^{2} ~ N^{2}

(2)

where

O

represents the complexity,

N \in R^{N \times C}

indicates the product of the width and height of the feature map, and

C

denotes the number of channels. According to Equation (2), the complexity of the self-attention layer is proportional to the square of the product of the width and height of the image, and the complexity of multi-head self-attention layers is denoted as

O (N^{2})

. Inspired by Segformer [30], the attenuation ratio

R

is introduced into the self-attention layer of the multilevel transformer encoder, and the attenuation ratio is used to reduce the product of image width and height, thus lowering the complexity of the self-attention layer. The calculation process is shown in Equations (3) and (4).

\hat{K} = R e s h a p e (\frac{N}{R}, C \cdot R) (K)

(3)

K = L i n e a r (C \cdot R, C) (\hat{K})

(4)

where

\hat{K}

represents the intermediate variable, and

K

indicates the dimension of the input matrix, which varies from

[N, C]

to

[\frac{N}{R}, R \cdot C]

. Then, the dimension of

K

is changed to

[\frac{N}{R}, C]

through the

L i n e a r

layer. When the input matrix changes to

V

, the dimension of

V

is changed to

[\frac{N}{R}, C]

by Equations (3) and (4). To clearly explain Equations (3) and (4), we draw Figure 4. The orange sequence in Figure 4 is sequence

K

, and Figure 4 describes how to reduce the length of sequence

K

. Similarly, this way can reduce sequence length

V

. At this time, the size of the input matrix

Q

remains unchanged. Finally, the matrix

Q

,

K

,

V

is sent into Equation (1) for calculation.

.

Equation (5) shows the formula used to calculate complexity after the introduction of

R

.

O = 4 \frac{N}{R} C^{2} + 2 C \frac{{(N)}^{2}}{R} ~ \frac{N^{2}}{R}

(5)

As shown in Equation (5), the complexity of the self-attention layer of the multilevel Transformer encoder decreases to

O (\frac{N^{2}}{R})

. From stage 1 to stage 4,

R

is set to (64, 16, 4, 1), respectively.

2.2.4. Positional Encoding

The ViT model incorporates positional encoding (PE) to incorporate positional information. However, the utilization of PE necessitates that the input resolution of the test dataset is congruent with the input resolution of the training dataset, thus resulting in a decline in accuracy when tested on datasets with disparate resolutions. Inspired by CPVT [31], the MLP implements 3 × 3 depthwise separable conv (DWconv) with zero padding to circumvent this issue. The DWconv comprises depthwise conv and pointwise conv [32]. In the depthwise convolution stage, each channel of the input feature map is convolved with a convolution kernel to generate an intermediate feature map with the same number of channels as the input feature map. Then, pointwise convolution is carried out, the 1 × 1 convolution kernel is applied to each channel of the intermediate feature map of the previous step, and the output feature map is finally generated.

Deep separable convolution can effectively reduce the number of parameters and calculation costs by decommissioning convolution. In addition, zero padding operations can encode absolute position information. Therefore, 3 × 3 depth-separable convolution with zero padding is used in the MLP structure of encoders to realize the transmission of position information. Figure 5 illustrates the structure of MLP.

2.2.5. Hyperparameter Configuration in Multi-Level Transformer

To clearly show the hyperparameter configuration of the multilevel Transformer, we have designed a hyperparameter configuration table. Various hyperparameter configurations are shown in Table 1. Table 1 shows the stage, stride, layer, and parameter. Stage

= i

represents the

i

stage of the multilevel Transformer model. Stride represents the multiples of downsampling at each stage of the multilevel Transformer model. The layer displays the name of the module in the current stage. A stage consists of 2 modules, called Patch Embed and Transformer Block. Patch Embed reduces the input feature map by half and then sends the output to Transformer Block to extract global self-attention and realize feature mapping. In Table 1,

P_{i}

represents the downsampling multiple of the stage

i

, and each stage of the model is twice downsampling.

C_{i}

represents the number of feature graph channels after sampling at the stage

i

,

H_{i}

represents the number of heads of the multi-head attention mechanisms, and

R_{i}

represents the channel spread ratio of MLP.

2.3. The Adaptive Multitask Decoder

Due to no obvious boundary between the wet stain in the water leakage edge area and the water seepage, it will be challenging to divide the edge area if only 1 decoder is used to deal with it. To solve this problem, some researchers have used the manual label-making method to separate the difficult-to-segment and easily-segmented areas [29]; this method has been proven to work well. However, this method is inefficient. Therefore, this paper proposes an adaptive multitask decoder, through which the decoder composed of 3 network branches adaptively generates labels with water seepage and wet stains.

Each of the adaptive network branches has the same structure and is composed of general convolution. However, the labels for supervising the training of the adaptive network branches are distinct, enabling adaptive multitask training for the network branches. The labels for supervising network training can adaptively generate different labels in accordance with a threshold value. Figure 6 illustrates the process of automatically generating labels. Input undergoes calculation by the Sigmoid function, yielding a matrix of probability values. The network was divided into 2 branches to generate the labels of water seepage and wet stains. In the first branch, if a probability value in the matrix is less than the threshold value R, the corresponding probability value is set to 0, resulting in a new matrix. This new matrix is then combined with an all-1 matrix, and the resulting matrix and the original label are subjected to an inner product mathematical operation, automatically generating a feature map of 0 and 1, which is the label of water seepage.

In the second branch, if a probability value in the matrix is greater than the threshold value R, the corresponding probability value is set to 0. resulting in a new matrix. This new matrix is then combined with an all-1 matrix; the resulting matrix and the original label are subjected to an inner product mathematical operation, automatically generating a feature map of 0 and 1, which is the label of wet stains. At this point, the different labels individually supervise the decoder’s network branches for training, thereby achieving the objective of adaptive multitasking.

2.4. Converged Network

The adaptive multitask decoder outputs the water leakage rough segmentation image, the edge information of the rough segmentation image is incomplete, and there is noise and other interference information. Therefore, this paper designs the Converged network to calibrate the rough segmentation image and obtain the fine segmentation image of the tunnel water leakage. Many previous pieces of research regard the UNet network as a standard network for image segmentation. UNet has many advantages, Unet network structure is an Encoder-Decoder structure, and the UNet shape is similar to the U shape. Its structure is simple and easy to implement. UNet encoder extracts low-resolution feature information (global feature) by down-sampling, the decoder recovers high-resolution feature information by upsampling (local feature such as edge and texture of water leakage area), and UNet’s skip connection fuses low-resolution feature information extracted by the encoder with high-resolution feature information of the same size in the decoder, the problem of insufficient information in up-sampling is improved, and the model’s performance is also improved. UNet network realizes pixel-level segmentation with high accuracy and output fine tunnel leakage images. We adopt shallow UNet as the Converged network based on the above advantages. The converged network based on the original U-Net fuses the segmentation images of water leakage from the adaptive multitask decoder. The convolutional component of the network is a standard convolution. The convolutional module is composed of 3 × 3 conv, batch norm layer, and the ReLU activation function.

2.5. Loss Function

The loss function is also a crucial aspect of the segmentation network. The model’s loss function and its definition are presented in Equation (6).

L o s s_{o u r} = \sum_{i = 0}^{i} α_{i} L o s s_{B C E} (G T_{i}, P r e_{i})

(6)

where

i

represents the value of different labels,

α

represents the value of the weight, and

α_{i}

is set to 0.25, 0.25, 0.2, 0.2, and 0.25, respectively.

L o s s_{B C E}

represents the cross-entropy loss function, and its definition is presented in Equation (7).

L o s s_{B C E} = - \frac{1}{n} \sum [y_{i} l o g {\hat{y}}_{i} + (1 - y_{i}) l o g (1 - {\hat{y}}_{i})]

(7)

3. Datasets

3.1. Original Datasets

This paper collects water leakage images from the tunnel between the XinWang Road Station of Xi’an Metro No. 14 and the Sports Center to demonstrate the method’s efficacy, comprising 574 tunnel images obtained through photography. The resolution of these images ranges from 400 to 500. By the definition of the leakage phenomenon outlined in article C.1.5 of underground waterproof engineering quality, the wet stains present in the leakage area are referred to as wet stains. The water leakage images collected in this study exhibit a diverse array of background conditions (different lighting, wire interference, occlusion, dry stains, noise, and interference from other diseases). The background features of the dataset include the following six points:

(1): The articulated junctures are more prevalent. The articulated junctures assemble the tunnels. Therefore, the likelihood of leakage at these junctures is greater.
(2): The area of exudation is not a simple and singular characteristic, the variation within the class is substantial, and a portion of the leakage area is disconnected.
(3): A part of the objective is impeded. The lining surface is outfitted with illuminations, props, materiel, conduit lines, etc.
(4): Influence of background noise. Noise, such as cement mortar and scratches, will inevitably be left on the segment during construction.
(5): The impact of illumination conditions. The lining surface is distant from the light source due to the dissimilar position of light exudation, which is significantly affected by the luminary.
(6): The edge of the leakage water is relatively shallow, and the difference between seepage and wall is not clear.

Based on the above characteristics, water leakage images are divided into six categories, and Table 2 illustrates the details.

The labeling software Labelme is employed to label the image of water leakage. Figure 7 illustrates the annotation procedure. The annotation procedure of the images only necessitates the utilization of a rectangle to designate the target area in the target detection. In the annotation procedure of the semantic segmentation task, it is necessary to magnify the images of water leakage and annotate them point by point along the contour of the relevant region to attain a sub-pixel level. It closes all points into a loop to accomplish the annotation procedure of water leakage. In Figure 7, the label color of the water leakage in the tunnel leakage area is white, and the color of the background area is black. Figure 8 illustrates some original images and labels to get a sense of the data set.

3.2. Enhance Datasets

In this paper, spatial-level transformations (rotation, flipping, deformation scaling, etc.) and pixel-level transformations (noise, brightness adjustment) to further augment the quantity and diversity of datasets and avert overfitting during network training to increase the data. Figure 9 illustrates some data augmentation methods.

4. Experiment Configuration

The computer system is Windows 10, the working memory is 16 GB, and the GPU is NVIDIA GeForce RTX 2080 Ti (11 GB). The experiment used the RMSprop optimizer. The initial learning rate was 1 × 10⁻⁴, the weight decay was 1 × 10⁻⁸, and the momentum was 0.9. The batch size of the experiment was 300.

Evaluated Metrics

Some previous works introduced a confusion matrix to calculate

F 1

score,

P A

,

F W I O U

intuitively [33,34,35]. As shown in Table 3, the confusion matrix consists of

T P

,

F P

,

F N

, and

T N

. In addition,

M I O U

and

D i c e L o s s

can be calculated according to their formula.

F 1

score is a comprehensive evaluation index, and it is defined as shown in Equation (8).

F 1 = \frac{2 \cdot P \cdot R}{P + R}

(8)

Among them,

P = \frac{T P}{T F + F P}

,

R = \frac{T P}{T P + F N}

.

Pixel accuracy (

P A

) represents the ratio of correct pixels to all pixels, and Equation (9) shows its definition:

P A = \frac{T P + T N}{T P + F N + F P + T N}

(9)

F W I O U

is the weighted intersection, and Equation (10) shows its definition.

F W I O U = \frac{T P + F N}{T P + F P + T N + F N} \cdot \frac{T P}{T P + F P + F N}

(10)

M I O U

is a standard measure of semantic segmentation, and Equation (11) shows its definition.

M I O U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{i i}}{\sum_{j = 0}^{k} p_{i j} + \sum_{j = 0}^{k} p_{j i} - p_{j i}}

(11)

where

K

is the number of classes,

p_{i j}

denotes the number of pixels that have

i

number of ground truth labels and

j

predicted labels.

D i c e L o s s

is a set similarity measure and is usually used to calculate the similarity of two samples. Equation (12) shows its definition.

D i c e L o s s = 1 - \frac{2 |X \cap Y|}{|X| + |Y|}

(12)

where

2 |X \cap Y|

represents the intersection of the sets

X

and

Y

, and

|X| + |Y|

represents the number of elements.

5. Results

5.1. Quality Result

This paper makes a visualization experiment about the adaptive multitask branches, and Figure 10 shows that adaptive network branches can automatically generate labels with water seepage and wet stains. And the model can separate the seepage area and the wet stains area, and Figure 10 also shows the segmentation results of the areas of water seepage and wet stains.

It can be seen from the figure that the water seepage and wet stain labels are automatically generated, and the seepage and wet stain labels are sent to the network for training to obtain the prediction diagram of water seepage and wet stain. The prediction diagram of the two regions is basically the same as the label phase, which indicates the reliability of the model.

5.2. Stability Results

To prove the stability of the algorithm proposed in this paper, we selected images with additive noise, images with chaotic backgrounds, flipped images, photographs with uneven illumination, and images with object occlusion as the test set to test the stability of the model. Part of the visualization results are shown in Figure 11.

As can be seen from Figure 11, for the image with noise added in Figure 11a, the model in this paper can suppress noise and other interference factors and extract details of the edge of the leakage area. Figure 11b is an image with chaotic background and multiple bolt holes in the image, and the model in this paper can also effectively segment the leakage area. For flipped photographs, uneven light images, and blocked images, the proposed model can still segment the target region, which proves the stability of the proposed model. The algorithm’s stability can be proved by the changes in indexes during the test using images of different environments. The changes in

M I O U

,

D i c e

,

P A

, and

F 1

in different environments are shown in Table 4.

As can be seen from Table 4, when the image with complex background is tested,

D i c e

is 95.6%,

M I O U

is 90.7%,

P A

is 98.5%, and

F 1

is 95.1%, which is because the data set also contains images with complex environments. When the image with noise is tested,

D i c e

is 94.8%,

M I O U

is 90.3%,

P A

is 98.1%,

F 1

is 94.6%, and the index value is relatively high. In other environments, the model’s performance indicators are still good. The stability of the model is proved quantitatively by the changes in the indexes in different environments.

5.3. Overfitting Analysis

To prove that this study effectively prevents the overfitting of the network model through data enhancement methods such as flip, rotation, contrast transformation, and noise addition, Figure 12 draws the variation curve of loss in training and verification loss with or without data enhancement.

Figure 12a shows the overfitting phenomenon of the network model when there is no data enhancement method. Blue represents the loss curve in training, and orange represents the verification loss curve. Figure 12a shows that the validation loss continues to fluctuate and increase until the end of the validation, while the training loss continues to fluctuate and decrease during the training. This result proves that without data enhancement, the network model will overfit. Figure 12b shows the change curves of loss and verification loss during training using the data enhancement method. It can be seen from the figure that blue represents the loss curve during training, and orange represents the verification loss curve. The verification loss curve becomes stable and fluctuates slightly at the 45th epoch. At this time, the training loss is around 0.5 reaching a steady trend. This result proves the effectiveness of adding the data enhancement method can keep the optimal performance of the model. The trend of the loss curve can judge that the network model proposed in this paper does not overfit.

5.4. Compared with SOTA Methods

To provide a clear understanding of the segmentation effect of each model on the water leakage images in tunnels, we compared different algorithms, which include TransUNet [36], U-Net++, U-Net+++, and Deeplabv3+. Figure 13 illustrates some results from the different algorithms.

Figure 13a shows the segmentation results of water leakage images under the background of a pipeline, light, joint, and bolt hole. Even under light solid interference and pipeline occlusion, the model in this paper can still maintain a good segmentation effect and accurately identify the shape of tunnel leakage. In the pipeline, lights, and tunnel background in Figure 13b, there are pipes, bolt holes, and other materials that resemble the appearance and color of the leak. The model in this article rarely mistakes the interference in the background for a leak. However, the other four models have varying degrees of misclassification. In the image background of Figure 13c, there are complex backgrounds such as separate water seepage areas, joint and bolt holes, etc. The segmentation results of the model in this paper do not connect the water seepage area. However, the other four models have different faults in segmentation. The segmentation results of the network model in this paper are relatively good for the water leakage images with relatively complex environmental conditions, and most of the water leakage region can be segmented from the background. The above results showed that, compared with other model algorithms, the model in this paper has more refined segmentation of edge contours and has better anti-interference ability, which can realize feature recognition under complex background and lighting conditions.

This paper tests the properties of the five network models to illustrate the performance of the model by using the same test dataset. Table 5 shows the performance indicators results. The

P A

value of the model in this paper is 98.3%, which is 2.2%, 19.2%, 14.9%, and 10.9% higher than those of TransUNet, UNet++, UNet+++, and DeepLabv3+ models, respectively. The

M I O U

of the model in this paper is 90.4%, which is 4.6%, 24.2%, 17.9%, and 8.3% higher than those of TransUNet, UNet++, UNet+++, and DeepLabv3+ models, respectively. To test whether the complexity of the model decreases after the introduction of attenuation ratio R in the self-attention layer calculation. A comparative experiment is conducted in this study on the number of parameters between the adaptive multitask network and TransUNet with ViT as the encoder. The number of parameters of TransUNet is 726 M, but that of the adaptive multitasking network is 73.9 M, which is about one-10th of that of TransUNet. Therefore, the adaptive multitask network model yields a higher efficiency and a better performance, with the effectiveness of the attenuation ratio R confirmed. The above results show that the algorithm model proposed in this paper can effectively improve the segmentation accuracy of the water leakage in the tunnel lining. Compared with the traditional semantic segmentation algorithm, the semantic segmentation algorithm proposed in this paper has higher accuracy in identifying tunnel water leakage. It can accurately locate the location of different types of tunnel leakage water to detect leakage water and the pixel-level segmentation of the shape.

5.5. Ablation Study

Because of the need to divide labels manually for some traditional models, the adaptive multitask network proposed in this paper is intended to avoid the use of manual partition labels for satisfactory results. However, threshold R is required by the adaptive multitask decoder proposed in this paper, which is reasonable in the range of 0.5–1. To identify the value in the scope of 0.5–1 that can make the model perform best, a lot of experiments have been conducted on the basis of empiricism. The experimental results are shown in Figure 14, where the blue polyline represents the change that occurs to Dice.

When the model takes different thresholds, the red polyline indicates the change in

M I O U

. In the experiment, the ablation study threshold is selected from 0.5 to 1, which leads to the optimal peak in both polylines at around 0.75. Additionally, Table 6 shows the alterations to evaluation metrics at the threshold values of 0.5, 0.75, and 0.9. According to the table,

M I O U

,

D i c e

,

P A

, and

F 1

reach their peak values when the threshold is set to 0.75.

This paper conducts the ablation study to prove that the decoder composed of the adaptive multitask network branches performs better than the decoder composed of a single-task network. Figure 15 obtains the visual segmentation results. As can be seen from the figure, compared with the single-task network, the water leakage image divided by the adaptive multitask decoder is more precise and has more edge information. Visual results prove that the multitask decoder performs better than the single task.

Evaluation indexes are introduced to prove that the multitask decoder performs better than the single-task decoder. Table 7 and Table 8, respectively, show the specific value of the evaluation index. These results presented in the table indicate that the multitask network exhibits superior performance compared to the single-task network for both the water seepage and wet stain regions.

P A

of the adaptive multitask network is 95.8, higher than the single-task segmentation network, which segments the water seepage area. Additionally, the

M I O U

of the model is 88.4%, higher than the single-task segmentation network, which segments the area of the wet stains. These results demonstrate that the model in the paper can effectively enhance the segmentation accuracy of water leakage in tunnels.

6. Conclusions

The paper designs a multilevel transformer network and an adaptive multitask decoder, which addresses the challenge of extracting edge details of water leakage identification in tunnels and realizes intelligent and high-precision detection of tunnel water leakage diseases in complex environments. The experimental results indicate that the article model has excellent performance. Our contributions are as follows:

(1): The encoder is a multilevel transformer to address the limitations of ViT.
(2): An adaptive multitask decoder is proposed to accurately segment the water seepage and wet stains from water leakage images in tunnels.
(3): A converged network is designed to fuse the coarse images of the adaptive multitask decoder.

Our methodology has been found to exhibit a commendable degree of efficacy in the segmentation of water leakage, as demonstrated through its comparison to other state-of-the-art segmentation techniques. However, it remains apparent that the method necessitates further enhancement. Specifically, the computational cost of the model is still substantial and renders it incapable of achieving real-time segmentation. As such, in future research endeavors, we shall persist in our efforts to improve the image segmentation of water leakage in tunnels.

Author Contributions

Conceptualization, L.Z. and X.Y.; Funding acquisition, S.L.; Investigation, X.Y.; Methodology, L.Z.; Project administration, S.L.; Resources, S.L.; Software, X.Y.; Validation, L.Z. and J.W.; Visualization, J.W.; Writing—original draft, L.Z. and J.W.; Writing—review & editing, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

National Natural Science Foundation of China, No. 51209167, 12002251; Natural Science Foundation of Shaanxi Province, No. 2019JM-474.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Xue, Y.; Cai, X.; Shadabfar, M.; Shao, H.; Zhang, S. Deep learning-based automatic recognition of water leakage area in shield tunnel lining. Tunn. Underground Space Technol. 2020, 104, 103524. [Google Scholar] [CrossRef]
Wei, F.; Yao, G.; Yang, Y.; Sun, Y. Instance-level recognition and quantification for concrete surface bug hole based on deep learning. Autom. Constr. 2019, 107, 102920. [Google Scholar] [CrossRef]
Huang, H.; Li, Q.; Zhang, D. Deep learning based image recognition for crack and leakage defects of metro shield tunnel. Tunn. Undergr. Space Technol. 2019, 77, 166–176. [Google Scholar] [CrossRef]
Otsu, N. A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef]
Ma, L. Research on Pavement Crack Recognition Method Based on Digital Image Processing; Southeast University: Nanjing, China, 2018. [Google Scholar]
Li, Q.; Zou, Q.; Zhang, D.; Mao, Q. FoSA: F* seed-growing approach for crack-line detection from pavement images. Image Vis. Comput. 2011, 29, 861–872. [Google Scholar] [CrossRef]
Fujita, Y.; Mitani, Y.; Hamamoto, Y. A method for crack detection on a concrete structure. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 3, pp. 901–904. [Google Scholar]
Almusawi, A.; Amintoosi, H. DNS tunneling detection method based on multilabel support vector machine. Secur. Commun. Netw. 2018, 2018, 6137098. [Google Scholar] [CrossRef]
Buczak, A.L.; Hanke, P.A.; Cancro, G.J.; Toma, M.K.; Watkins, L.A.; Chavis, J.S. Detection of tunnels in PCAP data by random forests. In Proceedings of the 11th Annual Cyber and Information Security Research Conference, Oak Ridge, TN, USA, 5–7 April 2016; pp. 1–4. [Google Scholar]
Bao, Y.; Li, H. Artificial Intelligence for civil engineering. China Civ. Eng. J. 2019, 52, 1–11. [Google Scholar]
Yufei, L.; Jiansheng, F.; Jianguo, N. Review and prospect of digital-image-based crack detection of structure surface. China Civ. Eng. J. 2021, 54, 79–98. [Google Scholar]
Khalaf, A.F.; Yassine, I.A.; Fahmy, A.S. Convolutional neural networks for deep feature learning in retinal vessel segmentation. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 385–388. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Huang, H.; Li, Q. Image recognition for water leakage in shield tunnel based on deep learning. Chin. J. Rock Mech. Eng. 2017, 36, 2861–2871. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Wu, J. Unet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar]
Dong, Y.; Wang, J.; Wang, Z.; Zhang, X.; Gao, Y.; Sui, Q.; Jiang, P. A deep-learning-based multiple defect detection method for tunnel lining damages. IEEE Access 2019, 7, 182643–182657. [Google Scholar] [CrossRef]
Zhang, L.; Shen, J.; Zhu, B. A research on an improved Unet-based concrete crack detection algorithm. Struct. Health Monit. 2021, 20, 1864–1879. [Google Scholar] [CrossRef]
Yang, Q.; Ji, X. Automatic pixel-level crack detection for civil infrastructure using Unet++ and deep transfer learning. IEEE Sens. J. 2021, 21, 19165–19175. [Google Scholar] [CrossRef]
Li, M.; Wang, H.; Zhang, S.; Gao, P. Subway Water Leakage Detection Based on Improved deeplabV3+. In Proceedings of the 2022 IEEE 2nd International Conference on Computer Systems (ICCS), Qingdao, China, 23–25 September 2022; pp. 93–97. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Houlsby, N. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Yang, G.; Liu, K.; Zhang, J.; Zhao, B.; Zhao, Z.; Chen, X.; Chen, B.M. Datasets and processing methods for boosting visual inspection of civil infrastructure: A comprehensive review and algorithm comparison for crack classification, segmentation, and detection. Constr. Build. Mater. 2022, 356, 129226. [Google Scholar] [CrossRef]
Yang, L.; Wang, H.; Zeng, Q.; Liu, Y.; Bian, G. A hybrid deep segmentation network for fundus vessels via deep-learning framework. Neurocomputing 2021, 448, 168–178. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Chu, X.; Tian, Z.; Zhang, B.; Wang, X.; Wei, X.; Xia, H.; Shen, C. Conditional positional encodings for vision transformers. arXiv 2021, arXiv:2102.10882. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [PubMed]
Zhang, D.; Wang, J.; Zhao, X. Estimating the uncertainty of average F1 scores. In Proceedings of the 2015 International Conference on the Theory of Information Retrieval, Northampton, MA, USA, 27–30 September 2015; pp. 317–320. [Google Scholar]
Li, R.; Zheng, S.; Duan, C.; Wang, L.; Zhang, C. Land cover classification from remote sensing images based on multi-scale fully convolutional network. Geo-Spat. Inf. Sci. 2022, 25, 278–294. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]

Figure 1. The overall model. The model consists of a multilevel transformer encoder, an adaptive multitask decoder, and a converged network.

Figure 2. Structure diagram of the multilevel transformer encoder. The transformer block comprises a layer normer, self-attention layer, DWMLP, etc.

Figure 3. Interaction of overlapping patches. Two patches overlap to exchange information.

Figure 4. Flow diagram of decreasing the length of sequence

K

.

Figure 4. Flow diagram of decreasing the length of sequence

K

.

Figure 5. The composition structure of the MLP module. The MLP module introduces a 3 × 3 depth separable convolution with zero padding (DWconv).

Figure 6. The Schematic diagram of the adaptive multitask decoder. Two kinds of labels are automatically generated using the threshold R.

Figure 7. Image labeling process. (a) Original images; (b) labeling images.

Figure 8. Original images and labels. (a–f) Original images; (g–l) ground truth.

Figure 9. Original images and data-enhanced images: (a–f) Original image; (g) image flip; (h) Image rotate 90°; (i) image rotate 180°; (j) Gaussian noise; (k) image brightened; (l) image darkened.

Figure 10. The experiments of adaptive multitask decoder. The adaptive multitask decoder automatically generates the water seepage and wet stain labels. (a,g) Original image; (b,h) ground truth of the whole image; (c,i) segmentation results of the water seepage; (d,j) ground truth of the water seepage; (e,k) segmentation results of the wet stains; (f,l) ground truth of the wet stains.

Figure 11. Visualization results in different environments. (a) The image with additive noise; (b) The image with chaotic backgrounds; (c) The flipped image; (d) The image with uneven illumination; (e) The image with object occlusion; (f–j) Ground truth; (k–o) Visualization results of water leakage area segmentation.

Figure 12. The variation curve of loss in training and verification loss with or without data enhancement. (a) The variation curve of loss in training and verification loss without data enhancement; (b) The variation curve of loss in training and verification loss with data enhancement.

Figure 13. Visual results of different algorithms. (a–e) Original images; (a1–e1) Ground truth; (a2–e2) proposed method; (a3–e3) TransUNet; (a4–e4) UNet++; (a5–e5) UNet+++; (a6–e6) Deeplabv3+.

Figure 14. The changing polylines of MIOU and Dice when the model adopted different thresholds.

Figure 15. Visual results of the adaptive multitask network and the single-task network. (a,e) Original image; (b,f) Ground truth; (c,g) Single-task network; (d,h) Adaptive multitask network.

Table 1. Parameter configuration table of the multilevel Transformer.

Stage	Stride	Layer	Parameter
1	1	Patch Embed	$P_{1} =$ 2 $C_{1} =$ 64
1	1	Transformer Block	$[\begin{matrix} H_{1} = 1 \\ R_{1} = 4 \end{matrix}] \times$ 2
2	2	Patch Embed	$P_{2} =$ 2 $C_{2} =$ 128
2	2	Transformer Block	$[\begin{matrix} H_{2} = 2 \\ R_{2} = 4 \end{matrix}] \times$ 2
3	2	Patch Embed	$P_{3} =$ 2 $C_{3} =$ 256
3	2	Transformer Block	$[\begin{matrix} H_{3} = 4 \\ R_{3} = 4 \end{matrix}] \times$ 2
4	2	Patch Embed	$P_{4} =$ 2 $C_{4} =$ 512
4	2	Transformer Block	$[\begin{matrix} H_{4} = 8 \\ R_{4} = 4 \end{matrix}] \times$ 2

Table 2. Shield tunnel leakage in different backgrounds. Water leakage images are divided into six categories.

Category	Describe	Train	Validation	Test
1	stitching + screw bolt	41	15	6
2	stitching + screw bolt + shielding	35	12	7
3	stitching + screw bolt + shadow	60	14	5
4	stitching + screw bolt + pipe + light	31	7	8
5	stitching + screw bolt + pipeline + light + shielding	47	8	4
6	region not connected	45	9	7

Table 3. Confusion matrix.

Confusion Matrix		Actual Value
Confusion Matrix		Water Leakage	Background
Predictive value	Water Leakage	TP	FP
Predictive value	Background	FN	TN

Table 4. The changes of MIOU, Dice, PA and F1 in different environments.

Environments	Dice	MIOU	PA	F1
Additive noise	0.948	0.903	0.981	0.946
Chaotic backgrounds	0.956	0.907	0.985	0.951
Geometric modifications	0.946	0.898	0.972	0.946
Uneven illumination	0.949	0.896	0.965	0.939
Object occlusion	0.958	0.909	0.961	0.938

Table 5. Segmentation results of different methods on the water leakage dataset.

Method	F1	PA	MIOU	FWIOU	Dice
Ours	0.947	0.983	0.904	0.971	0.951
TransUNet	0.916	0.961	0.858	0.936	0.915
UNet++	0.801	0.791	0.662	0.833	0.794
UNet+++	0.848	0.834	0.725	0.856	0.844
DeepLabv3+	0.785	0.784	0.821	0.832	0.756

Table 6. Threshold R experimental setup.

R	MIOU	Dice	PA	F1
0.5	0.822	0.852	0.937	0.931
0.75	0.904	0.951	0.983	0.947
0.9	0.833	0.862	0.943	0.926

Table 7. Evaluation metrics of adaptive multitask network and the branch of the water seepage.

Method	F1	PA	MIOU	FWIOU	Dice
Adaptive Multitask Network	0.958	0.988	0.906	0.962	0.948
Single-task Network	0.912	0.951	0.852	0.923	0.921

Table 8. Evaluation metrics of adaptive multitask network and the branch of the wet stains.

Method	F1	PA	MIOU	FWIOU	Dice
Adaptive Multitask Network	0.937	0.964	0.884	0.956	0.928
Single-task Network	0.906	0.932	0.836	0.896	0.906

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, L.; Wang, J.; Liu, S.; Yang, X. An Adaptive Multitask Network for Detecting the Region of Water Leakage in Tunnels. Appl. Sci. 2023, 13, 6231. https://0-doi-org.brum.beds.ac.uk/10.3390/app13106231

AMA Style

Zhao L, Wang J, Liu S, Yang X. An Adaptive Multitask Network for Detecting the Region of Water Leakage in Tunnels. Applied Sciences. 2023; 13(10):6231. https://0-doi-org.brum.beds.ac.uk/10.3390/app13106231

Chicago/Turabian Style

Zhao, Liang, Jiawei Wang, Shipeng Liu, and Xiaoyan Yang. 2023. "An Adaptive Multitask Network for Detecting the Region of Water Leakage in Tunnels" Applied Sciences 13, no. 10: 6231. https://0-doi-org.brum.beds.ac.uk/10.3390/app13106231

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Adaptive Multitask Network for Detecting the Region of Water Leakage in Tunnels

Abstract

1. Introduction

2. Methods

2.1. Preliminaries

2.2. The Multilevel Transformer Encoder

2.2.1. Multiscale Feature

2.2.2. Overlapping Patch Merging

2.2.3. Self-Attention Mechanism

2.2.4. Positional Encoding

2.2.5. Hyperparameter Configuration in Multi-Level Transformer

2.3. The Adaptive Multitask Decoder

2.4. Converged Network

2.5. Loss Function

3. Datasets

3.1. Original Datasets

3.2. Enhance Datasets

4. Experiment Configuration

Evaluated Metrics

5. Results

5.1. Quality Result

5.2. Stability Results

5.3. Overfitting Analysis

5.4. Compared with SOTA Methods

5.5. Ablation Study

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI