A Cross-Attention Mechanism Based on Regional-Level Semantic Features of Images for Cross-Modal Text-Image Retrieval in Remote Sensing

Zheng, Fuzhong; Li, Weipeng; Wang, Xu; Wang, Luyao; Zhang, Xiong; Zhang, Haisu

doi:10.3390/app122312221

Open AccessArticle

A Cross-Attention Mechanism Based on Regional-Level Semantic Features of Images for Cross-Modal Text-Image Retrieval in Remote Sensing

by

Fuzhong Zheng

,

Weipeng Li

,

Xu Wang

,

Luyao Wang

,

Xiong Zhang

and

Haisu Zhang

^*

School of Information and Communication, National University of Defense Technology, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(23), 12221; https://0-doi-org.brum.beds.ac.uk/10.3390/app122312221

Submission received: 25 October 2022 / Revised: 12 November 2022 / Accepted: 25 November 2022 / Published: 29 November 2022

(This article belongs to the Section Earth Sciences)

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

Authors are encouraged to provide a concise description of the specific application or a potential application of the work. This section is not mandatory.

Abstract

With the rapid development of remote sensing (RS) observation technology over recent years, the high-level semantic association-based cross-modal retrieval of RS images has drawn some attention. However, few existing studies on cross-modal retrieval of RS images have addressed the issue of mutual interference between semantic features of images caused by “multi-scene semantics”. Therefore, we proposed a novel cross-attention (CA) model, called CABIR, based on regional-level semantic features of RS images for cross-modal text-image retrieval. This technique utilizes the CA mechanism to implement cross-modal information interaction and guides the network with textual semantics to allocate weights and filter redundant features for image regions, reducing the effect of irrelevant scene semantics on retrieval. Furthermore, we proposed BERT plus Bi-GRU, a new approach to generating statement-level textual features, and designed an effective temperature control function to steer the CA network toward smooth running. Our experiment suggested that CABIR not only outperforms other state-of-the-art cross-modal image retrieval methods but also demonstrates high generalization ability and stability, with an average recall rate of up to 18.12%, 48.30%, and 55.53% over the datasets RSICD, UCM, and Sydney, respectively. The model proposed in this paper will be able to provide a possible solution to the problem of mutual interference of RS images with “multi-scene semantics” due to complex terrain objects.

Keywords:

cross-attention; remote sensing; cross-modal retrieval

1. Introduction

With the rapid development of remote sensing observation technology over the past few years, the data size of RS images is increasing [1]. The question of how to quickly and accurately access effective information from large amounts of RS data has become a challenge, gaining widespread attentions from numerous researchers [2,3,4]. This technology has broad prospects in disaster relief, disaster forewarning, resource management, and more fields [5,6,7]. From the data modality involved in retrieval tasks, RS image retrieval can be classified into two modes, single-modal retrieval and cross-modal retrieval [8]. In single-modal retrieval, both query data and archived data are RS images, and existing research methods are mostly concentrated on CBIR [9], whose accuracy has been promoted tremendously, owing to the application and development of deep metric learning over the past few years [10,11,12,13]. At the same time, the application of the deep Hash method has made large-scale RS image retrieval possible [14,15,16,17]. While single-modal retrieval has achieved enormous progress, it has natural weaknesses in retrieval flexibility and semantic richness. Thus, cross-modal retrieval has evolved as a recent appeal to researchers [18].

The so-called cross-modal retrieval means the query data and the target data pertain to separate modalities (e.g., searching images with texts or voice). Cross-modal retrieval is typically more challenging than single-modal retrieval—because there are heterogeneous gaps between data from different modalities [19,20], making it hard to maintain the semantic alignment relation. In fact, cross-modal retrieval has achieved satisfactory accuracy in the field of natural images [21,22]. Therefore, predecessors have made in-depth explorations into the distinction of cross-modal retrieval in the two fields. [7] suggests the multiple scales, small targets, high resolution, and lack of annotation information of RS images are to blame for impeding effective cross-modal retrieval. [18] further summarizes three major challenges in this field: a large number of redundant features in RS images, high intraclass similarity, and the coarse semantic granularity of traditional RS image datasets. Aiming against the above issues, a few cross-modal RS image retrieval methods have been proposed in the past three years.

(1) Semantic alignment. To fix the problem of semantic alignment, ref [9] designed a novel multi-objective loss function that maintains the semantic alignment relation between images and texts via adversarial loss. Ref. [23] proposed an adversarial learning-guided multilabel attention mechanism to ensure semantic association between modalities and combine the Hash module with asymmetric loss, cosine triplet margin loss, and cosine quantization loss so as to give higher similarity to the Hash code of sample couples. An image-sound retrieval model SCRL has been proposed, which narrows heterogeneous semantic gaps by simultaneously modeling the relation among cross-modal paired data, cross-modal nonpaired data, and single-modal data [24]. Researchers [19] have also designed an effective triplet choice function, with the hardest negative example to constrain cross-modal semantic alignment. All the above methods proceed from improving the objective function, trying to model more compact cross-modal semantic associations, and constraining network training, and they have achieved good effects.

(2) Redundant features. In addition, focusing on the issues of multiple scales and redundant features of RS images, [18] designed an asymmetric multimodal matching network (AMFMN), which not only accommodated multiscale feature input but also allowed filtering redundant features out of images via a multiscale visual self-attention (MVSA) module. Specifically, the MVSA module includes a multiscale feature fusion network, which is used to extract the joint semantic information between layers, as well as a redundant feature filter, which extracts the significant information from joint feature representation. [7] first proposed a deep semantic alignment network based on the gate mechanism and the attention mechanism, which has achieved a satisfactory result when used to strengthen the correspondence between image areas and words and filter out redundant information. The above two methods have achieved good results.

(3) High intraclass similarity. To solve this problem, researchers [9] have implemented a more balanced distribution inside unimodal data by embedding a comparative learning module into each of two unimodal encoders. Particularly, they [9] demonstrated experimentally the method of vision enhancement in details, drew a reliable conclusion, and proposed a rule-based text augmentation method based on word replacement, where words are replaced by semantically similar ones. On the basis of a previous ablation study [9], the removal of the two comparative learning modules would impair the model’s performance far more seriously than any other modification. [18] designed a triplet loss function with a dynamic variable margin, which changed the fixed margin in the traditional triplet loss based on sample pairs’ prior similarity, easing the problem of high intraclass similarity of RS images.

Although all the above methods have to some extent solved the puzzle of RS image cross-modal retrieval and enhanced the model’s precision, it seems to us that this field remains confronted with the challenge of multi-scene semantics in RS images. We can interpret multi-scene semantics in the following way: Unlike natural images, which typically focus on a certain subject, RS images, as a reflection of objective things, contain richer and more obvious spatial information, geometric structure, textual information, and other details [25]. RS images are referred to as “multi-scene semantic remote sensing images” (or “multilabel remote sensing images” in [26]) because there are typically multiple objectives in them, which may be loosely correlated in semantic terms and belong to different scenes. This will result in the fact that, for a specific query, only some areas of the RS image are related to it, while the others will become “redundant features” disturbing the semantic association between the query data and the target data. Despite the description of “redundant features” by [18], it seems to us that the so-called “redundancy” is not invariable but closely associated with query—for instance, if the same image is queried using two different textual statements, then the “redundant features” will vary. Therefore, it may not be an optimal choice to filter out redundant features through mere image processing without considering the query data. Adopting the cross-attention mechanism, with textual semantics guiding the network to filter redundant features out of images, would probably turn out to be a feasible solution to this problem [27]. The principle of this solution is shown in Figure 1. Indeed, the attention mechanism has been widely applied in the field of cross-modal retrieval of natural images, and its feasibility has been proven by experimental results [27,28,29,30].

Inspired by the above methods, this paper, focusing on the issue of mutual interference between multi-scene semantic information in RS images, puts forward a new model for cross-modal image-text retrieval from RS images, called the cross-attention model based on image area level semantic features (CABIR). To acquire region-level image features that are more adaptive to the cross-attention mechanism, we gave up the global average pooling layer common in convolution neural networks (CNNs) and instead adopted the output of the last convolution as the region-level semantic features of images. To acquire the statement-level features of texts, we designed a deep network of BERT plus Bi-GRU. The reason for doing so was that RS images have such a high spatial resolution that it would be difficult to build close connections, such as quantitative and spatial position relations, between a single image area and some words; it would be more appropriate to directly model the whole statement semantically. Next, in order to constrain paired semantic relations, we created a cross-attention layer (CAL), and designed a unique temperature control function to control this cross-modal information exchange process. Finally, we used a triplet loss [7], which is more suitable for cross-modal tasks, to constrain network training. In summary, this paper has made the following two major contributions:

First, it has proposed a cross-attention mechanism-based cross-modal network framework to tackle the problem of mutual interference of RS images with multi-scene semantics.
Second, different from the combination of word-level textual information and region-level image information in previous cross-attention models, we adopted the correspondence between statement-level textual information and region-level image information and proposed a method for statement-level text feature generation based on the combination of the pretrained BERT model and Bi-GRU.

The remainder of this paper is organized as follows: Our proposed method is described in particular in Section 2. The experimental results are displayed and analyzed in Section 3. Section 4 further analyzes the experimental results and discusses the details of the method. A summary of this paper is made in Section 5.

2. Methodology

In this section, our proposed model CABIR is presented in five respects: (1) problem analysis; (2) the image feature extraction module; (3) the text feature extraction module; (4) the cross-attention module; and (5) the objective function. The overall framework of this model is shown in Figure 2, mainly including an image extraction network, a text extraction network, and a cross-attention network.

2.1. Problem Analysis

In cross-modal image-text retrieval tasks of RS images, a basic assumption can be made: the semantic information contained in a single RS image is typically richer than the matched texts (See Figure 1). In other words, a single text typically describes only one side of an RS image. Then, for a specific query by the user, typically only several areas of an image are related to it. The global average pooling strategy adopted at the endpoint of traditional CNNs is exactly such that all image areas (including the unrelated ones) fit in the ultimate image representation at the same rate of contribution; the inability to screen unrelated areas or enhance related ones at the very source inevitably reduces the similarity between the texts for query and the target images, thereby affecting the model precision. Briefly, this can be reasoned mathematically:

Denote

x_{i m g}

for an arbitrary image consisting of

L

areas, and denote

x_{i}, i \in [1, L]

, for these image areas. Through CNN treatment, the vectors of image features and of area features are

x_{i m g} = {a_{j} | j = 1, \dots, n, a_{j} \in ℝ}

and

x_{i} = {b_{i j} | j = 1, \dots, n, b_{i j} \in ℝ}

,

i \in [1, L]

, respectively. In a CNN where the global average pooling strategy is adopted, the elements in

x_{i m g}

and those in

x_{i}

are related by

a_{j} = \frac{1}{L} \sum b_{i j} i \in [1, L] j \in [1, n]

(1)

The vector of features extracted from the text via the text encoder is

y_{t e x t} = {c_{j} | j = 1, \dots, n, c_{j} \in ℝ}

. At this point, by finding the cosine distance between the whole image and the text and that between each image area and the text, one can find the relations:

\begin{matrix} s i m (x_{i m g}, y_{t e x t}) & = & \frac{x_{i m g} \cdot y_{t e x t}^{T}}{‖ x_{i m g} ‖ \times ‖ y_{t e x t} ‖} \\ = & \frac{\sum_{j = 1}^{n} a_{j} c_{j}^{}}{\sqrt{\sum_{j = 1}^{n} a_{j}^{2}} \sqrt{\sum_{j = 1}^{n} c_{j}^{2}}} \\ = & \frac{\sum_{i = 1}^{L} \sum_{j = 1}^{n} b_{i j} c_{j}}{\sqrt{\sum_{j = 1}^{L} (\sum_{i = 1}^{n} b_{i j})^{2}} \sqrt{\sum_{j = 1}^{L} c_{j}^{2}}} \end{matrix}

(2)

s i m (x_{i}, y_{t e x t}) = \frac{\sum_{j = 1}^{n} b_{i j} c_{j}^{}}{\sqrt{\sum_{j = 1}^{n} b_{i j}^{2}} \sqrt{\sum_{j = 1}^{n} c_{j}^{2}}}

(3)

From Formula (3), if the text is negatively correlated to image areas, then we must have

\sum_{j = 1}^{n} b_{i j} c_{j}^{} < 0

(4)

Formula (2) represents the contribution to the image-text cosine distance from each image area

x_{i}

. Combining Formulas (2) and (3), if some image area is negatively correlated to the text, then the contribution from that area is also negative. On the contrary, if the image areas negatively correlated to the query texts can be screened when the final representation of image features is calculated and greater weights can be assigned to strongly correlated image areas, then it must be possible to raise the similarity between positive examples and query texts, thus solving the problem of multi-scene semantic image retrieval.

2.2. Image Feature Extraction

The foremost task in this paper is to extract region-level image features. Existing studies show that CNNs can accomplish the task of deep semantic feature extraction from images excellently. Here, this paper used ResNet152 [31] pretrained over the dataset ImageNet as the image feature extractor and removed the final fully connected layers and the global average pooling layer from the network to acquire the region-level semantic features of images. Bypassing the global average pooling layer means the advanced semantic features of all image areas are preserved intact with their weights undetermined, which can be further allocated and selected by the CAL. The specific process is shown in Figure 3. During training, we fine-tuned the parameters of ResNet152 to adapt them to our specific task. Given the set X of images, the image feature extraction process can be formulated as

F = f (X, θ_{v})

.

One thing to note is that, since the global average pooling layer was removed, the features extracted from a single image were no longer a one-dimensional vector but a two-dimensional matrix, each row of which would serve as the feature representation of an image area.

2.3. Text Feature Extraction

Since this paper used “statement-level text features” and “region-level image features” for cross-attention operation, how to generate more discriminative statement-level text features is a focus needing attention. Here, we began by using the pretrained BERT_BASE [32] network to extract word features one after another from a statement. Assuming the set of texts to be Y, the process can be formulated as

T = g (Y, θ_{t})

.

After BERT treatment, each statement can be expressed as a sequence composed of word vectors. Assuming that a statement contains

M

words, it can then express the vector set

t_{i} = {y_{j} | j = 1, \dots, M, y_{j} \in ℝ^{768}}

, where

y_{j}

denotes the vector of words. Afterwards, in order to give the overall semantic representation of a statement and for the statement features to be mapped into Euclidean space with the same dimension as image features, we feed

t_{i}

to a bidirectional GRU [33] for treatment. Below is the process.

h_{j} = G R U_{f o r w o r d} (h_{j - 1}, y_{j})

(5)

h_{j}^{'} = G R U_{b a c k w o r d} (h_{j + 1}, y_{j})

(6)

where

h_{j}

and

h_{j}^{'}

denote the hidden states of GRU in forward propagation and backward propagation, respectively. Finally, the hidden states of the last layer in all directions are taken, and their mean is determined as the ultimate feature of the statement.

t_{i}^{'} = \frac{h_{M} + h_{M}^{'}}{2}

(7)

Up to this point, we have derived the statement-level features

T^{'} = {t_{i}^{'} | i = 1, \dots, N, t_{i}^{'} \in ℝ^{2048}}

of the text set.

2.4. Cross-Attention Layer (CAL)

This layer is used for attention convergence of region-level image features with text features as a clue. First, the correlation weights between each area and texts are determined. Next, a weighted summation is performed with these weights to generate the overall semantic features of images. Then, the image-text similarities are calculated. Finally, the matrix of image-text similarities is output. The detailed structure of the CAL is shown in Figure 4. Below is the concrete calculation process.

Step 1: Use the three fully connected layers to map region-level image features into a key tensor and a value tensor containing all semantic information of images and text features into a query matrix containing all semantic information of texts.

M_{k e y} = F c_k e y (F, θ_{k e y})

(8)

M_{v a l} = F c_v a l (F, θ_{v a l})

(9)

M_{q u e} = F c_q u e (T^{'}, θ_{q u e})

(10)

At this point,

M_{k e y}

and

M_{v a l}

are both

N \times 2048 \times L

tensors, where

N

represents the number of images, 2048 denotes the dimension of space, and

j

denotes the number of image areas.

M_{q u e}

is an

N \times 2048

matrix, where

N

represents the number of texts and 2048 denotes the dimension of space.

Step 2: Calculate the cosine distance between each area

k_{i j} \in ℝ^{2048}

of each image

k_{i}

in

M_{k e y}

and each text

q_{l}, q_{l} \in ℝ^{2048}

in

M_{q u e}

.

s i m (k_{i j}, q_{l}) = \frac{k_{i j}^{T} q_{l}}{‖ k_{i j} ‖ ‖ q_{l} ‖}

(11)

This formula expresses the cosine distance between the jth area of the ith image and the lth text.

Step 3: Normalize the cosine distance found from (11) to give the weight of correlation between the jth area of the ith image and the lth text.

α_{i j}^{(l)} = \frac{e^{g (h) s i m (k_{i j}, q_{l})}}{\sum_{j = 1}^{J} e^{g (h) s i m (k_{i j}, q_{l})}}

(12)

where

g (h)

is a temperature control function, which will be discussed later. According to the weight scores, one can find the feature representation of the ith image fused with the lth piece of textual information (hereinafter referred to as image fusion representation):

f_{i}^{(l)} = \sum_{j = 1}^{J} α_{i j}^{(l)} v_{i j}

(13)

where

α_{i j}

is the weight of correlation between the jth area of the ith image and the lth text as defined in (12)

v_{i j}

is the jth area of image

v_{i}

in

M_{v a l}

.

Step 4: Calculate the similarity matrix

S

between image features

F^{'}

and text features

T^{'}

. The similarity between the ith image and the lth text can be formulated as:

S_{i l} = \frac{f_{i}^{(l)}^{T} t_{j}^{'}}{‖ f_{i}^{(l)} ‖ ‖ t_{j}^{'} ‖}

(14)

Step 5: About the temperature control function

g (h)

: Since the cosine distances range within [−1, 1], there will be no obvious distinction in the Wight score

α

between different areas. Therefore, it would be best to multiply it by an amplification coefficient. However, it is advised that this coefficient not be taken as constant, because the weighting process is essentially one by which to tell whether the image areas are important. At the initial stage of model training, the image and text features extracted via different networks may be so loosely correlated in semantic terms that the CAL cannot make an accurate judgement as to which area outweighs the others. If the disparity between different image areas were amplified at this moment, the model would extremely likely run into a locally optimal solution. Therefore, using an algorithm similar to simulated annealing, we designed a temperature control function

g (h)

that satisfies the below criteria:

$g (h)$ should get close to zero at the initial stage of network training, so that the weights on image areas are close to one another, effectively equivalent to degenerating towards the global average pooling strategy.
$g (h)$ should zoom and converge to a certain constant after being trained for some rounds, which can avoid having too large a function value such that a minority of the image areas take up an extremely large proportion.

$g (h) = \frac{u}{1 + e^{(r - h)}}$

(15)

where: h denotes the rounds of training; u is the ultimate constant to which the function converges; and r denotes at which round of training the median point should be achieved. The graph of $g (h)$ is shown in Figure 5.

2.5. Objective Function

To better guide the model training, this paper used the image-text two-way triplet ranking-loss function proposed in [34] as a constraint. This implies the basic idea that, in the public feature space, mutually paired images and texts will serve as positive examples, while unpaired images and texts will serve as negative ones. A closer distance is expected between positive examples, as defined by the formula below.

\begin{matrix} L (F^{'}, T^{'}) & = & \sum_{i = 1}^{m} {\max [β + S (f_{i}^{″}, t_{i}^{'}) - S (f_{i}^{'}, t_{i}^{'}), 0] \\ + & \max [β + S ({f_{i}}^{'}, {t_{i}}^{″}) - S ({f_{i}}^{'}, {t_{i}}^{'}), 0]} \end{matrix}

(16)

where

β

represents the margin. Since the minibatch training strategy is usually taken during training to update network parameters more stably,

f_{i}^{″}

only represents the negative samples of all images in a same batch,

t_{i}^{″}

represents the negative samples of all texts in a same batch, and m is the size of minibatch. During training, the optimal model parameters are achieved by minimizing

L (F^{'}, T^{'})

.

3. Experimental Results and Analysis

To verify the effectiveness of the method proposed in this paper, we conducted a large number of experiments over Sydney-Captions, UCM-Captions, and RSICD, three cross-modal public datasets. This section will display the experimental results and draw a well-reasoned conclusion from the data.

3.1. About the Datasets and Evaluation Indexes

RSICD: This is a large-sized RS image subtitle dataset created by Lu et al. [35], including 10,921 RS images classified into “Airport”, “Industrial Zone”, and 28 other categories, each of which had a size of 224 × 224, matched with five statements of descriptive texts.

UCM-Captions: This dataset was created by Qu et al. [36] based on [37], and includes 2100 RS images classified into 21 categories of scenes, each of which had a size of 256 × 256 and was also matched with five statements of descriptive texts.

Sydney-Captions: This dataset was also created by Qu et al. [36], based on [38], and included 613 RS images of size 500 × 500 classified into seven categories of scenes, each of which was matched with five statements of descriptive texts.

On the basis of the above, given the large sizes of these datasets and the wide variety of scenes, RSICD can evaluate the performance of cross-modal retrieval models with greater effectiveness, and it has become favored for tasks in this field [18].

Evaluation indexes: This paper uses three indexes to evaluate the effectiveness of the model. The first index is Recall at K (R@K, K = 1, 5, and 10), which is common in the field of image retrieval and denotes the percentage of positive examples occurring among the first K images. The second index is mR, proposed in [39] by Huang et al., which represents the mean of six recall rates. The third index is R@SUM, proposed in [7] by Cheng et al., which represents the sum of six recall rates.

It should also be noted that all the above datasets were used in the same way: 80% of the images and subtitles were used for training, 10% for verification, and 10% for testing.

3.2. Details in Training

In the experiment, our pretrained ResNet152 would extract visual features of dimension

2048 \times L

, where the value of L would vary by image size. For instance,

L = 49

when image size was

224 \times 224

. BERT_BASE would extract word-level text features of dimension

768 \times M

before generating 2048-dimensional statement-level text features via a two-way GRU network. This paper used the Adam optimizer to train the network over 80 epochs for the training rounds. The minibatch size was 64, and the learning rate dropped by 90% every other 10 epochs, starting from the initial value of 0.0001.

3.3. Experiment Design

In order to present a strong case that the model proposed in this paper did, to some extent, work out the confronting puzzle in the current field of RS image cross-modal retrieval, we designed three kinds of experiments:

3.3.1. Basic Experiments Design

CABIR was compared mainly with a few selected baseline models in terms of the performance indexes achieved over the three public datasets (under the same test condition). A total of four baseline models were selected for this paper, which are detailed below.

VSE++ [34]: a natural image cross-modal retrieval model taking data enhancement and fine-tuning strategies based on an improved triplet loss function.
MTFN [40]: a model that has designed a self-defined similarity function and that trains networks via the ranking-loss function.
SCAN [27]: a model with the Stacked Cross Attention mechanism proposed for exploring fine-grained semantic correspondence between images and texts, signifying a successful application of the cross-attention mechanism in the field of natural images. In this paper, the model CABIR was compared with both SCAN i2t and t2i.
AMFMN [18]: See Section 1 for a detailed introduction. In this paper, the CABIR model was compared simultaneously with AMFMN-soft, AMFMN-fusion, and AMFMN-sim, three versions of AMFMN.

3.3.2. Ablative Experiments Design

The performances of different variants of CABIR that varied as some of its components/modules were replaced or some module parameters were frozen were compared to analyze the role of each component in model training.

3.3.3. Multi-Scene Semantic Image Retrieval Experiments Design

In order to examine whether CABIR fixed the problem of mutual interference between semantic information in multi-scene semantic image retrieval to some extent, this paper selected 40 RS images covering eight categories of scenes, “Airport”, “Farmland”, “Dense residential”, “Playfields”, “Port”, “Parking”, “Center”, and “River”, from RSICD as the multi-scene RS image retrieval test set (hereinafter referred to as MSIRTS), which included five multi-scene semantic images, as shown in Figure 6. In MSIRTS, all semantic information contained in the five multi-scene semantic images can be found in other images to ensure the difficulty for the model to distinguish these images. During the experiments, we retrieved these five images in separate semantic types before evaluating the model’s performance by ranking the results of retrieval.

3.4. Experimental Results

The data of all baseline models in this paper were quoted from [18] (under the same experimental condition).

3.4.1. Results of the Basic Experiments

As shown in Table 1, all experimental indexes of our proposed CABIR model are superior to the four baseline models for comparison over the RSICD dataset. Among the four baseline models, AMFMN [18] has the top overall performance. However, our CABIR is 10.3% higher than it in the index mR, representing the balancing performance of a model, and the index R@SUM, representing the overall performance. The top-performing baseline model in the index R@1 for text retrieval is SCAN [27]. Yet, we have raised it by 46.8% in our model. Given the higher difficulty of the RSICD dataset, the optimal results achieved over this dataset by us relative to the baseline models are undoubtedly stronger evidence of the validity of the method.

As shown in Table 2, our CABIR model has achieved optimal performance in the five experimental indexes over the UCM-Captions dataset. For a limited number of indexes haven’t reached their best performance, such as R@1 in text retrieval, they can still reach satisfactory accuracy and show a very tiny gap when compared with the top-performing baseline model. Moreover, the indexes mR and R@SUM of our model are significantly higher than their counterparts in all baseline models, with a rise of 5.7%. Particularly, the index R@10 in image retrieval achieves an extremely high level, with a rise of 12.3% as compared to the top-performing baseline model AMFMN [18].

As shown in Table 3, our CABIR model achieves the optimal results in the six experimental indexes over the Sydney-Captions dataset. Specifically, the nonoptimal indexes R@5 and R@10 in image retrieval are only second to the optimal index by 4.1%. In the indexes mR and R@SUM, our model achieves a rise of 7.3% as compared to the top-performing baseline model AMFMN [18]. The best result is associated with the index R@5 in text retrieval and is significantly higher than those of the four baseline models, with a rise of 11.8% as compared to the top performer.

On the above, despite the failure of CABIR to achieve the top performance in all experimental indexes over the three datasets, its optimality rate has reached 79.1%. Further analysis reveals that the indexes in which no optimal results have been achieved are all associated with the datasets UCM-Captions and Sydney Captions. It appears to us that this may result from the higher repeatability and coarser semantic granularity in the textual descriptions of the two datasets—a more rigorous and more distinctive text is more advantageous to unleashing the effect of the CAL.

3.4.2. Results of the Ablative Experiments

In order to compare the roles of all structural components of CABIR, we designed a total of six variants of CABIR by making separate changes to the following modules:

the CAL, which was replaced with a global average pooling layer, hence the name “NO Attention” for the variant resulting from this change;
the pretrained BERT network, which was removed and replaced with an embedded matrix to change the dimension of the word vector into 300, hence the name “NO BERT” for the variant resulting from this change;
the GRU module, which was removed, with the CLC vector of the output sequence of the BERT network taken immediately as statement-level text features and mapped into a 2048-dimensional vector space via a fully connected layer, hence the name “NO GRU” for the variant resulting from this change;
the temperature control function, which was removed, hence the name “NO Temperature” for the variant resulting from this change;
the parameters of pretrained RSE152, which were frozen during training, hence the name “Freeze Rse152” for the variant resulting from this change;
the parameters of pretrained BERT, which were frozen during training, hence the name “Freeze BERT” for the variant resulting from this change.

From the data displayed in Table 4, the CAL has the greatest influence on the overall performance of the model, which differs by 39.3% with and without this component. The second greatest influence comes from freezing the parameters of the two pretrained models. The next step is the freezing of the parameters of the two pretraining models, which would at least lead to a performance degradation of above 20%. Despite the fact that previous literature [7] has warned that fine-tuning of pretrained models would increase the risk ofoverfitting, the experimental results show that training without fine-tuning makes a difference to the generation of optimal joint semantic space of models. The changes into “NO BERT” and “NO GRU” were intended to verify the effectiveness of the statement-level text feature extraction method proposed in this paper, while the change into “NO Temperature” was intended to check whether our proposed temperature control strategy would work. From the experimental data, the removal of whichever of the above modules would lead to a performance difference of above 10%, which is undoubtedly a piece of strong evidence. The above results fully show that our designed CABIR model is complete and reasonable in system structure, with all components existing indispensably, being closely connected, and interplaying with one another.

3.4.3. Results of the Multi-Scene Semantic Image Retrieval Experiments

Using the ten sentences as shown in Table 5, we performed retrieval on the separate models with and without the CAL. During retrieval, we ensured that the texts used involved the semantics of only one scene of the target images; in other words, only parts of the areas of the target images were correlated to the texts for retrieval. The purpose of doing so was to exhibit the influence of multi-scene semantics in RS images on the precision of retrieval as much as possible and to check whether our proposed CABIR could act as a shield against unrelated areas.

The results in Table 5 show that, after the across attention technique was used, the retrieval rankings of the target images in MSIRTS have risen in most cases. The mean of the rankings has risen from 7.2 to 4.7, and none of the rankings have dropped. This suggests that, while retrieving multi-scene semantic images, the CABIR model proposed in this paper can effectively screen image features unrelated to the current retrieval (as shown in Figure 7), thereby increasing the relevancy between the target image and the texts for the query and enhancing the model’s precision.

Furthermore, there is another phenomenon worth studying throughout the experimental process. In Figure 6e, more areas are related to farmlands than to rivers, but the results of the retrieval indicate that its deep semantic features are more closely related to rivers. We speculate this has something to do with the induction preference of the learner, which may cause higher severity of the mutual interference with multi-scene semantics.

4. Discussion

This section begins with a further analysis of the experimental results in the previous section, which provokes our reflection on the question: how great an influence will the variations of the temperature control function and image region division have on our model? To answer it, we carried on to conduct a control variable experiment and arrived at a reliable conclusion via the experimental data.

4.1. Further Analysis

In Section 3, we showed through basic experiments that CABIR is superior in overall performance to the baseline models over the three public datasets, demonstrated the reasonability of the existence of all components, and explained the effect of the cross-attention mechanism in masking redundant features in multi-scene semantic image retrieval experiments. However, such results remain sketchy. For instance, despite our mere knowledge about the necessity of the existence of the temperature control function, we are not sure at what settings it would operate most reasonably and whether different settings would make the results worse. No relevant conclusion can be drawn from the available data. In addition, given that our proposed cross-attention mechanism is based on regional-level features of images, the results are supposed to differ depending on the number of regions divided. Whether finer-grained or coarser-grained grains are better for grain division is another question that needs to be discussed and answered further. Therefore, we conducted the following two additional experiments to thoroughly discuss the generalization ability and stability of the model.

4.2. Influences of Temperature Control Function Parameter Settings on Model Precision

There are two parameters to set in the temperature control function, namely extremum u and median point r, each of which affects the model’s performance in a different way. From the perspective of mechanism analysis, the magnitude of r affects the curvature of the function; for small r, the function value converges fast, which means that the model is going fast into a stable operation state but also that the odds that the model will run into a locally optimal solution increase. The magnitude of extremum u determines by what factor the probability of area selection increases during stable operation of the model. In the standard version of CABIR, we set r = 3 and u = 5. Therefore, by means of control variables, we let r = 3 and u = [1, 5, 10, 15, 20] and u = 5, respectively, to get the results shown in Figure 8. According to the results, the variation in model precision is consistent with our theoretical expectation: neither u nor r can be set too large or too small, and the model is more sensitive to the variation of r. In this experiment, the wide value interval only indicates a variation trend (the settings in the standard version are not necessarily the optimal values of the parameters). But as long as the temperature control parameters are set within some relatively reasonable intervals, say r ∈ [2, 10] and u ∈ [3, 15], the model performance would boost dramatically. Moreover, even the relatively unreasonable values of the parameters have not made the results worse than when the temperature control function was removed. This suggests that our proposed temperature control function has indeed enhanced the stability of the model.

4.3. Influence of Different Image Region Divisions on Model Precision

As mentioned above, for ResNet152, when the image size was 224 × 224, the image would be divided into 49 areas. Evidently, this parameter could be adjusted by changing either the network structure or the image size. For instance, adding a pooling layer for downsampling would decrease the number of regions, while using higher-resolution images would increase the number of regions. From the perspective of the mechanism, the number of regions can neither get too large nor too small—too fine-grained a division would extremely likely cause a target to be segmented into multiple regions, while too coarse-grained a division would go against semantic distinction. In this discussion, we changed the number of image areas to 36, 25, 16, 9, and 4 (on the dataset RSICD), respectively, by adding a convolutional layer to ResNet152 and observed the variations in model precision. From Figure 9, the number of image areas has a great influence on the model precision, which first decreases fast with the decrease in the number of areas and then keeps relatively stable at a low precision.

The above results seem to mean that a finer-grained region division is more advantageous to image-text semantic association—which disagrees with the foregoing analysis. However, we guessed it might be that the 49 regions divided have not yet reached the turning point. Thus, we conducted another control experiment over the Sydney-Captions datasets—whose image was larger, with a resolution of 500 × 500, and the number of original regions processed by ResNet152 was 16 × 16. In the experiment, the number of image regions varied within [144, 100, 64, 49, 36]. According to the experimental results in Figure 10, a greater number of regions (16 × 16) corresponds to a better result, even though the intergroup difference has decreased significantly. The differences between the region divisions of 12 × 12 and 8 × 8 and the optimal result are 3.9% and 3.3%, respectively. Moreover, from the results of these two divisions, we did observe the phenomenon that “the number of regions decreases while the model precision increases”. Why it is that 16 × 16 number of regions achieves the optimal result? We suggest two possibilities: (1) The 16 × 16 region division has not yet reached the turning point enough to degrade the model’s performance; (2) the newly added convolutional layer has not undergone pretraining, which may have interfered with precision. This reminds us that the pretraining model of ResNet152 is not completely coincident with our method and that a pretraining convolutional architecture that can be adapted to scale variations of images is quite crucial to CABIR.

It should also be noted that a larger number of image areas means a higher computation overhead. From our recorded experimental data, with the continuous increase in the number of areas, the computational time cost increases by a percentage of 25% to 50%. In practical applications, it is necessary to balance accuracy and efficiency.

5. Conclusions

In this paper, we have proposed a cross-attention network based on regional-level semantic features of images, which has to some extent solved the “multi-scene semantic” puzzle of RS images. Specifically, this network has adopted a novel approach to employing the attention mechanism in the CAL module and designed an effective statement-level text feature generation module and a temperature control function controlling the network’s operation. To demonstrate the effectiveness of the method, we have conducted extensive experiments on the RSICD, UCM, and Sydney datasets. The experimental results have shown that, in the tasks of image-text retrieval from RS images, our proposed method has surpassed the four state-of-the-art baseline models being compared in overall performance, with the index mR reaching 18.12%, 48.30%, and 55.53%, respectively. In future work, we will place an intense focus on how to improve the existing convolutional network structure so that it can be adapted to multiscale RS image input, divide image regions more intelligently, and enhance the precision and generalization ability of the model.

Author Contributions

Conceptualization, F.Z. and H.Z.; methodology, F.Z. and H.Z.; software, F.Z. and W.L.; validation, W.L. and X.Z.; formal analysis, F.Z.; investigation, F.Z. and X.Z.; resources, H.Z. and X.W.; data curation, F.Z.; writing—original draft preparation, F.Z.; writing—review and editing, H.Z. and L.W.; visualization, F.Z. and X.W.; supervision, H.Z.; project administration, F.Z.; funding acquisition, H.Z. and X.W. All authors have read and agreed to the published version of the manuscript.

Funding

The research was supported by the National Natural Science Foundation of China (NSFC) (Grant No.62102423).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

Authors thank for all reviewers who helped to improve this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ma, J.; Zhou, H.; Zhao, J.; Gao, Y.; Jiang, J.; Tian, J. Robust Feature Matching for Remote Sensing Image Registration via Locally Linear Transforming. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6469–6481. [Google Scholar] [CrossRef]
Scott, G.J.; Klaric, M.N.; Davis, C.H.; Shyu, C.-R. Entropy-Balanced Bitmap Tree for Shape-Based Object Retrieval From Large-Scale Satellite Imagery Databases. IEEE Trans. Geosci. Remote Sens. 2011, 49, 1603–1616. [Google Scholar] [CrossRef]
Demir, B.; Bruzzone, L. Hashing-Based Scalable Remote Sensing Image Search and Retrieval in Large Archives. IEEE Trans. Geosci. Remote Sens. 2016, 54, 892–904. [Google Scholar] [CrossRef]
Chi, M.; Plaza, A.; Benediktsson, J.A.; Sun, Z.; Shen, J.; Zhu, Y. Big Data for Remote Sensing: Challenges and Opportunities. Proc. IEEE 2016, 104, 2207–2219. [Google Scholar] [CrossRef]
Li, P.; Ren, P. Partial Randomness Hashing for Large-Scale Remote Sensing Image Retrieval. IEEE Geosci. Remote Sens. Lett. 2017, 14, 464–468. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Y.; Huang, X.; Zhu, H.; Ma, J. Large-Scale Remote Sensing Image Retrieval by Deep Hashing Neural Networks. IEEE Trans. Geosci. Remote Sens. 2018, 56, 950–965. [Google Scholar] [CrossRef]
Cheng, Q.; Zhou, Y.; Fu, P.; Xu, Y.; Zhang, L. A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4284–4297. [Google Scholar] [CrossRef]
Tobin, K.W.; Bhaduri, B.L.; Bright, E.A.; Cheriyadat, A.; Karnowski, T.P.; Palathingal, P.J.; Potok, T.E.; Price, J.R. Automated Feature Generation in Large-Scale Geospatial Libraries for Content-Based Indexing. Photogramm. Eng. Remote Sens. 2006, 72, 531–540. [Google Scholar] [CrossRef]
Mikriukov, G.; Ravanbakhsh, M.; Demir, B. Deep Unsupervised Contrastive Hashing for Large-Scale Cross-Modal Text-Image Retrieval in Remote Sensing. arXiv 2022, arXiv:2201.08125. [Google Scholar]
Cao, R. Enhancing remote sensing image retrieval using a triplet deep metric learning network. Int. J. Remote Sens. 2020, 41, 740–751. [Google Scholar] [CrossRef] [Green Version]
Sumbul, G.; Demir, B. Informative and Representative Triplet Selection for Multilabel Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 11. [Google Scholar] [CrossRef]
Yun, M.-S.; Nam, W.-J.; Lee, S.-W. Coarse-to-Fine Deep Metric Learning for Remote Sensing Image Retrieval. Remote Sens. 2020, 12, 219. [Google Scholar] [CrossRef] [Green Version]
Roy, S.; Sangineto, E.; Demir, B.; Sebe, N. Metric-Learning-Based Deep Hashing Network for Content-Based Retrieval of Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2021, 18, 226–230. [Google Scholar] [CrossRef] [Green Version]
Han, L.; Li, P.; Bai, X.; Grecos, C.; Zhang, X.; Ren, P. Cohesion Intensive Deep Hashing for Remote Sensing Image Retrieval. Remote Sens. 2019, 12, 101. [Google Scholar] [CrossRef] [Green Version]
Shan, X.; Liu, P.; Gou, G.; Zhou, Q.; Wang, Z. Deep Hash Remote Sensing Image Retrieval with Hard Probability Sampling. Remote Sens. 2020, 12, 2789. [Google Scholar] [CrossRef]
Kong, J.; Sun, Q.; Mukherjee, M.; Lloret, J. Low-Rank Hypergraph Hashing for Large-Scale Remote Sensing Image Retrieval. Remote Sens. 2020, 12, 1164. [Google Scholar] [CrossRef] [Green Version]
Ye, D.; Li, Y.; Tao, C.; Xie, X.; Wang, X. Multiple Feature Hashing Learning for Large-Scale Remote Sensing Image Retrieval. ISPRS Int. J. Geo.-Inf. 2017, 6, 364. [Google Scholar] [CrossRef] [Green Version]
Yuan, Z.; Zhang, W.; Fu, K.; Li, X.; Deng, C.; Wang, H.; Sun, X. Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
Chen, Y.; Lu, X. A Deep Hashing Technique for Remote Sensing Image-Sound Retrieval. Remote Sens. 2019, 12, 84. [Google Scholar] [CrossRef] [Green Version]
Rahhal, M.M.A.; Bazi, Y.; Abdullah, T.; Mekhalfi, M.L.; Zuair, M. Deep Unsupervised Embedding for Remote Sensing Image Retrieval Using Textual Cues. Appl. Sci. 2020, 10, 8931. [Google Scholar] [CrossRef]
Socher, R.; Karpathy, A.; Le, Q.V.; Manning, C.D.; Ng, A.Y. Grounded Compositional Semantics for Finding and Describing Images with Sentences. Trans. Assoc. Comput. Linguist. 2014, 2, 207–218. [Google Scholar] [CrossRef]
Karpathy, A.; Joulin, A.; Li, F.F.F. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. Adv. Neural Inf. Process. Syst. 2014, 27, 9. [Google Scholar]
Gu, W.; Gu, X.; Gu, J.; Li, B.; Xiong, Z.; Wang, W. Adversary Guided Asymmetric Hashing for Cross-Modal Retrieval. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, Ottawa, ON, Canada, 10–13 June 2019. [Google Scholar]
Ning, H.; Zhao, B.; Yuan, Y. Semantics-Consistent Representation Learning for Remote Sensing Image–Voice Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Mao, G.; Yuan, Y.; Xiaoqiang, L. Deep Cross-Modal Retrieval for Remote Sensing Image and Audio. In Proceedings of the 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS), Beijing, China, 19–20 August 2018; pp. 1–7. [Google Scholar] [CrossRef]
Cheng, Q.; Huang, H.; Ye, L.; Fu, P.; Gan, D.; Zhou, Y. A Semantic-Preserving Deep Hashing Model for Multi-Label Remote Sensing Image Retrieval. Remote Sens. 2021, 13, 4965. [Google Scholar] [CrossRef]
Lee, K.-H.; Chen, X.; Hua, G.; Hu, H.; He, X. Stacked Cross Attention for Image-Text Matching. arXiv 2018, arXiv:1803.08024. [Google Scholar]
Huang, Y.; Wang, W.; Wang, L. Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7254–7262. [Google Scholar] [CrossRef] [Green Version]
Wang, Y.; Yang, H.; Bai, X.; Qian, X.; Ma, L.; Lu, J.; Li, B.; Fan, X. PFAN++: Bi-Directional Image-Text Retrieval With Position Focused Attention Network. IEEE Trans. Multimed. 2021, 23, 3362–3376. [Google Scholar] [CrossRef]
Nam, H.; Ha, J.-W.; Kim, J. Dual Attention Networks for Multimodal Reasoning and Matching. arXiv 2016, arXiv:1611.00471. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Schuster, M.; Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef] [Green Version]
Faghri, F.; Fleet, D.J.; Kiros, J.R.; Fidler, S. VSE++: Improving visual-semantic embeddings with hard negatives. arXiv 2017, arXiv:1707.05612. [Google Scholar]
Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring Models and Data for Remote Sensing Image Caption Generation. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2183–2195. [Google Scholar] [CrossRef] [Green Version]
Qu, B.; Li, X.; Tao, D.; Lu, X. Deep semantic understanding of high resolution remote sensing image. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China, 6–8 July 2016; pp. 1–5. [Google Scholar] [CrossRef]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems—GIS ’10, San Jose, CA, USA, 2–5 November 2010; p. 270. [Google Scholar] [CrossRef]
Zhang, F.; Du, B.; Zhang, L. Saliency-Guided Unsupervised Feature Learning for Scene Classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 2175–2184. [Google Scholar] [CrossRef]
Huang, Y.; Wu, Q.; Song, C.; Wang, L. Learning Semantic Concepts and Order for Image and Sentence Matching. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6163–6171. [Google Scholar] [CrossRef]
Wang, T.; Xu, X.; Yang, Y.; Hanjalic, A.; Shen, H.T.; Song, J. Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019. [Google Scholar]

Figure 1. (a) Displays a large distance between image and text features over the hypersphere under the interference of redundant information. (b) Displays the compression of image information in a specific direction, such that the distance (between image and text features) is shrunk under the cross-attention mechanism.

Figure 2. Overall network structure diagram of CABIR. This network consists of three modules, which are: (A) the text feature extraction module composed of BERT + Bi-GRU, which is used to extract advanced semantic features of texts; (B) the image feature extraction module composed of ResNet152, which is used to acquire regional-level advanced semantic features of images; and (C) the CAL module, which is used to filter redundant features of images and solve the similarity matrix between the image set and the text set. Finally, based on this matrix, the model will solve and return the gradient information using the triplet loss to constrain network training.

Figure 3. The general process of region-level feature extraction from an image. The red squares passed through by the straight line at the last step represent the vector of image features within some area, whose dimension is 2048 in ResNet152.

Figure 4. Detailed structural diagram of CAL. CAL is used for cross-modal information fusion, with three fully connected layers as its components. The inputs of the Fc_Value Layer and Fc_Key Layer are regional-level image features, and the outputs are a value matrix and a key matrix, respectively; the input of the Fc_Query Layer is text features, and the output is a query matrix. Finally, this layer will converge on the query matrix as a clue to get the final image feature representation and calculate the similarity score between it and texts. The output of the entire CAL is a similarity matrix of image-text pairs.

Figure 5. Graph of the temperature control function, with the values of the standard model’s parameters at the upper right corner.

Figure 6. Multi-scene semantic RS images in MSIRTS, each of which contains two distinct types of scene semantics. (a) Airport and Farmland; (b) Playfields and Dense residential; (c) Port and Dense residential; (d) Parking and Center; (e) River and Farmland.

Figure 7. When the same “multi-scene semantics” RS image is retrieved by different statements, the weight distribution of each region of the image is significantly different. The change in color from white to red represents an increase in weight. This shows that CABIR can realize semantic recognition and shield image regions irrelevant to current retrieval. From left to right, the retrieval statements are: (a) “A plane is parked on the apron” and “There are large areas of brown farmland”; (b) “There is a large area of neat red buildings in the residential area” and “A playground with several basketball fields”.

Figure 8. Variations of model precision with temperature control parameters.

Figure 9. Variations of model precision over RSICD dataset with the number of image areas.

Figure 10. Variations of model precision over the Sydney-Captions dataset with the number of image areas.

Table 1. Comparison of the results of all models on the dataset RSICD.

RSICD Dataset
Method	Text Retrieval			Image Retrieval
Method	R@1	R@5	R@10	R@1	R@5	R@10	mR	R@sum
VSE++	3.38	9.51	17.46	2.82	11.32	18.1	10.43	62.59
SCAN t2i	4.39	10.9	17.64	3.91	16.2	26.49	13.25	79.53
SCAN i2t	5.85	12.89	19.84	3.71	16.4	26.73	14.23	85.42
MTFN	5.02	12.52	19.74	4.9	17.17	29.49	14.81	88.84
AMFMN-soft	5.05	14.53	21.57	5.05	19.74	31.04	16.02	96.98
AMFMN-fusion	5.39	15.08	23.4	4.9	18.28	31.44	16.42	98.49
AMFMN-sim	5.21	14.72	21.57	4.08	17	30.6	15.53	93.18
CABIR	8.59	16.27	24.13	5.42	20.77	33.58	18.12	108.76

Table 2. Comparison of results of all models over the UCM-Captions dataset.

UCM Dataset
Method	Text Retrieval			Image Retrieval
Method	R@1	R@5	R@10	R@1	R@5	R@10	mR	R@sum
VSE++	12.38	44.76	65.71	10.1	31.8	56.85	36.93	221.6
SCAN t2i	14.29	45.71	67.62	12.76	50.38	77.24	44.67	268
SCAN i2t	12.85	47.14	69.52	12.48	46.86	71.71	43.43	260.56
MTFN	10.47	47.62	64.29	14.19	52.38	78.95	44.65	267.9
AMFMN-soft	12.86	51.9	66.67	14.19	51.71	78.48	45.97	275.81
AMFMN-fusion	16.67	45.71	68.57	12.86	53.24	79.43	46.08	276.48
AMFMN-sim	14.76	49.52	68.1	13.43	51.81	76.48	45.68	274.1
CABIR	15.17	45.71	72.85	12.67	54.19	89.23	48.3	289.82

Table 3. Comparison of results of all models on the Sydney-Captions dataset.

UCM Dataset
Method	Text Retrieval			Image Retrieval
Method	R@1	R@5	R@10	R@1	R@5	R@10	mR	R@sum
VSE++	24.14	53.45	67.24	6.21	33.56	51.03	39.27	235.63
SCAN t2i	18.97	51.72	74.14	17.59	56.9	76.21	49.26	295.53
SCAN i2t	20.69	55.17	67.24	15.52	57.59	76.21	48.74	292.42
MTFN	20.69	51.72	68.97	13.79	55.51	77.59	48.05	288.27
AMFMN-soft	20.69	51.72	74.14	15.17	58.62	80	50.06	300.34
AMFMN-fusion	24.14	51.72	75.86	14.83	56.55	77.89	50.17	300.99
AMFMN-sim	29.31	58.62	67.24	13.45	60	81.72	51.72	310.34
CABIR	32.76	65.52	79.31	19.66	57.59	78.51	55.53	333.15

Table 4. Comparison of the results of all variants of CABIR on the RSICD dataset.

RSICD Dataset
Method	Text Retrieval			Image Retrieval
Method	R@1	R@5	R@10	R@1	R@5	R@10	mR	R@sum
NO Attention	4.39	10.24	17	3.27	15.04	28.12	13.01	78.06
NO BERT	9.05	15.9	22.03	4.86	16.98	29.52	16.39	98.34
NO GRU	7.86	14.63	22.21	4.92	18.88	31.3	16.63	99.8
NO Temperature	6.3	14.16	21.85	5.13	19.07	32.34	16.48	98.85
Freeze Rse152	6.39	12.34	21.76	3.91	15.87	27.77	14.67	88.04
Freeze BERT	7.13	11.33	16.73	2.72	12.27	29.83	12	80.01
CABIR	8.59	16.27	24.13	5.42	20.77	33.58	18.12	108.76

Table 5. Comparison of rankings between the statements for retrieval and the target images used in the experiments.

Text For Retrieval	Target Image	Results Ranking (No Attention)	Results Ranking (With Attention)
a plane is parked on the apron	Figure 6a	3	3
there are some roads in the large brown farmland	Figure 6a	6	5
there is a large area of neat red buildings in the residential area	Figure 6b	4	2
a playground with several basketball fields	Figure 6b	5	5
beside the highway is a dense residential area	Figure 6c	13	9
next to the sea is a large port	Figure 6c	5	4
There are many vehicles parked in the parking lot around	Figure 6d	10	9
The center is a gray square building	Figure 6d	1	1
There is a forest beside the large farmland	Figure 6e	22	7
a large number of tall trees are planted on both sides of the river	Figure 6e	3	2

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, F.; Li, W.; Wang, X.; Wang, L.; Zhang, X.; Zhang, H. A Cross-Attention Mechanism Based on Regional-Level Semantic Features of Images for Cross-Modal Text-Image Retrieval in Remote Sensing. Appl. Sci. 2022, 12, 12221. https://0-doi-org.brum.beds.ac.uk/10.3390/app122312221

AMA Style

Zheng F, Li W, Wang X, Wang L, Zhang X, Zhang H. A Cross-Attention Mechanism Based on Regional-Level Semantic Features of Images for Cross-Modal Text-Image Retrieval in Remote Sensing. Applied Sciences. 2022; 12(23):12221. https://0-doi-org.brum.beds.ac.uk/10.3390/app122312221

Chicago/Turabian Style

Zheng, Fuzhong, Weipeng Li, Xu Wang, Luyao Wang, Xiong Zhang, and Haisu Zhang. 2022. "A Cross-Attention Mechanism Based on Regional-Level Semantic Features of Images for Cross-Modal Text-Image Retrieval in Remote Sensing" Applied Sciences 12, no. 23: 12221. https://0-doi-org.brum.beds.ac.uk/10.3390/app122312221

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Cross-Attention Mechanism Based on Regional-Level Semantic Features of Images for Cross-Modal Text-Image Retrieval in Remote Sensing

Abstract

Featured Application

Abstract

1. Introduction

2. Methodology

2.1. Problem Analysis

2.2. Image Feature Extraction

2.3. Text Feature Extraction

2.4. Cross-Attention Layer (CAL)

2.5. Objective Function

3. Experimental Results and Analysis

3.1. About the Datasets and Evaluation Indexes

3.2. Details in Training

3.3. Experiment Design

3.3.1. Basic Experiments Design

3.3.2. Ablative Experiments Design

3.3.3. Multi-Scene Semantic Image Retrieval Experiments Design

3.4. Experimental Results

3.4.1. Results of the Basic Experiments

3.4.2. Results of the Ablative Experiments

3.4.3. Results of the Multi-Scene Semantic Image Retrieval Experiments

4. Discussion

4.1. Further Analysis

4.2. Influences of Temperature Control Function Parameter Settings on Model Precision

4.3. Influence of Different Image Region Divisions on Model Precision

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI