FLsM: Fuzzy Localization of Image Scenes Based on Large Models

Chen, Weiyi; Miao, Lingjuan; Gui, Jinchao; Wang, Yuhao; Li, Yiran

doi:10.3390/electronics13112106

Open AccessArticle

FLsM: Fuzzy Localization of Image Scenes Based on Large Models

by

Weiyi Chen

^1,*,

Lingjuan Miao

¹,

Jinchao Gui

²,

Yuhao Wang

¹

and

Yiran Li

¹

School of Automation, Beijing Institute of Technology, Beijing 100081, China

²

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(11), 2106; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics13112106

Submission received: 2 April 2024 / Revised: 23 April 2024 / Accepted: 14 May 2024 / Published: 29 May 2024

(This article belongs to the Special Issue Advances in Social Bots)

Download

Browse Figures

Versions Notes

Abstract

:

This article primarily focuses on the study of image-based localization technology. While traditional methods have made significant advancements in technology and applications, the emerging field of visual image-based localization technology demonstrates tremendous potential for research. Deep learning has exhibited a strong performance in image processing, particularly in developing visual navigation and localization techniques using large-scale visual models. This paper introduces a sophisticated scene image localization technique based on large models in a vast spatial sample environment. The study involved training convolutional neural networks using millions of geographically labeled images, extracting image position information using large model algorithms, and collecting sample data under various conditions in elastic scene space. Through visual computation, the shooting position of photos was inferred to obtain the approximate position information of users. This method utilizes geographic location information to classify images and combines it with landmarks, natural features, and architectural styles to determine their locations. The experimental results show variations in positioning accuracy among different models, with the most optimal model obtained through training on a large-scale dataset. They also indicate that the positioning error in urban street-based images is relatively small, whereas the positioning effect in outdoor and local scenes, especially in large-scale spatial environments, is limited. This suggests that the location information of users can be effectively determined through the utilization of geographic data, to classify images and incorporate landmarks, natural features, and architectural styles. The study’s experimentation indicates the variation in positioning accuracy among different models, highlighting the significance of training on a large-scale dataset for optimal results. Furthermore, it highlights the contrasting impact on urban street-based images versus outdoor and local scenes in large-scale spatial environments.

Keywords:

localization of image scenes; fuzzy localization; large models; image processing; deep learning

1. Introduction

In recent years, the Beidou Satellite Navigation System (BDS), Galileo Satellite Navigation System (Galileo), modern Global Navigation System (GPS), and Global Navigation System (GLONASS) have been developed. There are over 140 GNSS satellites available [1]. Global Navigation Satellite Systems (GNSSs) are increasingly used for outdoor navigation [2]. The strap-on inertial navigation system (SINS) can automatically measure the user’s position, speed, and attitude [3]. However, the inertial navigation system is subject to its own limitations and high costs. Its primary drawback is the increase in error over time, leading to drift. The aforementioned systems represent traditional navigation and positioning technologies. Over years of development, these technical systems have essentially established a relatively comprehensive technical framework, which is extensively utilized in human activities and daily life. As human science and technology advance, the need for navigation and positioning continues to evolve. Robust visual localization over long periods of time is one of the biggest challenges for the long-term navigation of mobile robots [4]. Monocular visual inertial navigation systems (VINSs) are widely used in fields such as robot navigation, autonomous driving, and augmented/virtual reality [5]. The current emergence of various intelligent robots not only significantly facilitates our lives, but also presents higher performance requirements.

Visual navigation is a navigation method using visible and invisible imaging technology, which has the advantages of good concealment, strong autonomy, fast and accurate measurement, low cost, and high reliability [6]. At present, the emergence of diverse intelligent robots not only significantly facilitates human life, but also presents elevated demands for robot performance, thereby establishing a prerequisite for the advancement of visual positioning technology. Visual information is increasingly utilized in navigation applications. With the introduction of numerous new concepts, methods, and theories, image processing technology based on deep learning has progressively matured. Visual navigation technology is expected to be developed and widely used in the fields of aircraft, unmanned aerial vehicles, various cruise missiles, deep space probes, indoor mobile navigation, and so on [7]. Therefore, visual navigation technology has high research and application value in the field of navigation. Usually, visual navigation on robots is achieved by installing a monocular or binocular camera to obtain local images of the environment and to make navigation decisions. The research on intelligent robots began in the late 1960s, marked by Shakey, the first mobile robot developed by Stanford Research Institute (SRI) [8]. Its main objective is to study the real-time control of robot systems in complex environments. Representative examples include urban robots and tactical robots developed by Jet Propulsion Laboratory [9,10,11]. These robots are equipped with binocular stereo vision systems for obstacle detection. Visual navigation and positioning can also be applied to spacecraft or interplanetary detectors, such as lunar probes. The lunar rover has a high degree of autonomy and is suitable for performing exploration tasks in a complex and unstructured lunar environment [12,13,14,15]. The stereo vision system of the lunar rover is the most direct and effective tool for close-range and high-altitude moon detection. It serves as a tool for understanding the lunar environment and provides crucial information for lunar rover survival in complex environments. Using the stereo vision system, we can not only reconstruct the terrain of the environment in real time to avoid obstacles, but also use the obtained stereo sequence images to estimate the movement of the rover itself. Therefore, the application of visual navigation in the field of robotics is extremely extensive and significant [16,17,18].

The essence of visual navigation is to obtain the two-dimensional image information of the scene through one or more cameras, and then determine the operation information of navigation by using image processing, computer vision, pattern recognition, and other algorithms. The techniques involved include camera calibration, stereo image matching, path identification, and 3D reconstruction [19,20]. Inspired by the process of robot positioning and navigation, we contemplate the extraction of positional data from images and the potential for a single image to facilitate robot navigation and positioning tasks. Addressing this challenge, we have conducted extensive research. Our primary focus is on the feasibility and precision of obtaining location information from images. If it is possible to probabilistically infer geographic location data from images, it would constitute a highly significant area of study. Visual positioning can play an important role in satellite navigation failures and has a broad application value. When satellite navigation signals are obstructed or unavailable, visual positioning systems can provide reliable positioning information, suitable for many fields, such as autonomous driving and unmanned vehicles. With the help of high-precision visual positioning systems, autonomous vehicles can accurately locate and navigate in environments without satellite signals, ensuring safe driving in cities. Indoor navigation—in indoor environments, satellite signals are usually weak or unavailable. By using visual positioning technology, people can accurately navigate and locate within large buildings such as shopping malls, airports, and hospitals. Industrial robots—in the fields of manufacturing and logistics, industrial robots require precise positioning information to perform various tasks. Visual positioning systems can provide real-time location information to help robots accurately perform tasks. Search and rescue—in the event of a disaster, satellite navigation systems may be disrupted or damaged. Visual positioning technology can help search and rescue personnel locate trapped individuals, without satellite signals. Military applications—military departments can use visual positioning systems for precise positioning and navigation, without being disturbed or monitored by hostile forces. Mobile devices—smartphones and tablets can use cameras and sensors for visual positioning, providing users with indoor navigation, augmented reality experiences, and other functions.

This paper focuses on geographical fuzzy positioning using image information mining, consolidates the relevant technical accomplishments in navigation and positioning, and scrutinizes the limitations of current navigation and positioning technology based on their characteristics. Addressing the requirements of visual navigation and positioning, this paper aims to achieve visual positioning capability in a lightweight manner for various scenarios and applications. It introduces a method of image fuzzy positioning based on a large model, enabling scene location determination through images in GPS failure environments. A schematic diagram of FLsM, based on image localization, is shown in Figure 1. The key contributions of this paper are summarized as follows:

The concept of elastic scale space is introduced, which refers to a coupling space between large-scale scenes and fine scenes, emphasizing the variability and unpredictability of the environment.
A vision-based fuzzy positioning technology is suggested, emphasizing the semantic information extraction from the visual image itself and providing geographic location.
By leveraging multiple models for image training, employing advanced deep learning models, and utilizing a large dataset of Internet data for pre-training, we can efficiently match images and texts and accomplish the fuzzy positioning of images.

2. Related Works

In this section, we introduce the related technologies of visual navigation and positioning. Including traditional slam technology, deep learning models, geographic information positioning research, and so on.

2.1. Traditional SLAM Technology

The traditional craft of Simultaneous Localization and Mapping (SLAM) techniques has been a beacon in the odyssey of robotic navigation and mapping [21]. These methods typically harness the power of sensor fusion, utilizing data from cameras, lidars, and other sensors to simultaneously decipher the robot’s location and construct a map of its environment. Algorithms such as ORB-SLAM2 and ORB-SLAM3 have been luminaries in this domain, demonstrating significant advancements in accuracy and stability over the past decade [22]. However, as underscored in recent scholarly pursuits, challenges persist in achieving long-term robustness, especially in the face of diverse and dynamic environmental perturbations.

2.2. Deep Learning Models for Visual Navigation

The advent of deep learning has revolutionized the landscape of visual navigation [23,24]. Contemporary approaches leverage the prowess of neural networks to learn complex mappings between visual inputs and navigational outputs. CLIP, as discussed in the previous section, emerges as a model capable of learning transferable visual representations from the vast expanse of natural language supervision [25,26]. It extends the scope of computer vision systems by directly learning from the raw text about images, showcasing promising results across a myriad of tasks, without the need for task-specific training.

Moreover, recent scholarly endeavors delve into the comparison of open-source visual SLAM approaches, evaluating algorithms based on factors such as accuracy, computational performance, and robustness. This reflects the ongoing quest to enhance the performance of visual navigation systems and address the specific challenges posed by different scenarios and datasets.

2.3. Geographic Information Positioning Research

Geographic Information Systems (GISs) and positioning technologies are the compass and sextant of modern navigation systems. Recent works, such as GeoCLIP, integrate CLIP-inspired techniques to chart the course for effective worldwide geo-localization. By encoding GPS information and employing hierarchical learning [27,28], GeoCLIP demonstrates a state-of-the-art performance, navigating the challenges associated with the diversity of global landscapes. This underscores the importance of geographic information in refining the accuracy and reliability of visual navigation systems.

3. Materials and Methods

3.1. Overview

A visual fuzzy positioning method in elastic scale space is proposed for different application requirements of various scenes, based on an in-depth analysis of image information. This method differs from traditional visual positioning methods. High-resolution remote sensing images are used for photogrammetry in large-scale scenes to generate image maps of different scales. Fine-scale scenes are divided into indoor and outdoor areas, and environmental image data are collected from natural target sample data using mobile measuring equipment. The expectation is that these two types of data rely solely on the information carried by the image itself, and the positioning requirements can be fulfilled using deep learning algorithms with large models. The concept of elastic scale space involves leveraging the randomness and lightness of the image, focusing on mining the information value of the image and obtaining rough positioning information. The significance of this work lies in its lightweight design, which does not impose strict requirements on the image itself and emphasizes model training. The image captured by the terminal’s camera is used to determine the inclusion of a specific target and then the user’s precise position is calculated visually. The entire process is shown in Figure 2.

3.2. Constructing an Elastic Scale Spatial Environment

We propose the concept of an elastic scale space, which is a space between the coupling of large-scale and fine-grained scenes, highlighting the arbitrariness and randomness of the environment, as shown in Figure 3.

Traditionally, in large-scale scenes, it is necessary to make large-scale scene image base maps and develop service engines. The research process is as follows:

Select an appropriate reference point

The choice of reference points should be clearly defined as a clear landmark, such as road intersections, building corners, etc., that can be considered as optional reference points. After selecting the datum point, high-precision surveying and mapping should be carried out on the selected datum point.

Image map of construction environment

Utilize aviation professional equipment (drones and aerial photography planes) for photogrammetry work, collect high-resolution remote sensing images of the environmental field, and produce image maps at scales of 1:1000, 1:500, etc.

Unified spatiotemporal benchmark construction

The determination of the spatiotemporal baseline of the environmental field usually adopts the WEB Mercator projection method, whose core is the transformation of the coordinate system, mainly the transformation of the image into a plane coordinate system, which should be consistent with the current general map projection.

For the construction of fine scenes, the following process is required.

Target sample collection and data construction

In order to more accurately represent environmental information, it is divided into two parts—indoor environment and outdoor environment. The outdoor environment dataset utilizes satellite positioning technology to collect and store the location information of identifiable targets in the scene. At the same time, the mobile terminal is equipped with a high-definition camera to capture the target from various angles and to collect image information of the target. Then, utilize onboard or airborne measurement systems to collect local RGBD information around the target and construct a data resource lake for the scene through various technical means. Similarly, in indoor environments, it is necessary to establish a unified indoor coordinate system. The camera is used to capture the target from various angles, collect the image information of the target, collect the location information of identifiable targets indoors, and save it.

Model library construction

According to the requirements, use a deep learning framework to train the collected image information, obtain a proprietary model library, and obtain parameter models that meet the requirements.

3.3. The Concept of Visual Blur Localization

In order to achieve precise positioning, this article suggests a vision-based fuzzy positioning technology that is integrated with satellite navigation and other techniques. In order to categorize or detect input scene photographs, visual blur localization focuses on mining the semantic information included in the visual images themselves, supplying positional range information, and utilizing deep learning network methods [29] in conjunction with large-scale model structures. Semantic segmentation, which effectively pulls information from the scene, is the main focus. The marked position data is extracted from the saved position data, based on the recognition findings. It should be noted that the input for future precise positioning can come from fuzzy positioning data. Transform the fuzzy position information that was previously acquired into fine scene data. Extract the target’s local RGBD information from the location data that have been saved and compare it with the scene map based on the recognition findings. Use the 2D–3D visual solving algorithm to obtain the user’s precise position. In Figure 4, the technical procedure is displayed.

3.4. Multi-Source Image Data Source Matching

Picture retrieval and matching algorithms are essential for the search, matching, and display of location information derived from visual images. These techniques enable the rapid matching of feature data and corresponding picture data. In addition to image encoding and quick image retrieval for large amounts of data, there are numerous important technologies that still need to be resolved. These include fast feature extraction technology for multi-source image data, unified spatiotemporal benchmarks, and image encoding. Aerial surveys, other sensor image data, and satellite remote sensing photos are some of the multi-source image data used to create traditional large-scale scenes. Data integration is based on the unification of spatiotemporal benchmarks and the conversion of multi-source picture data to a common scale. The CGCS2000 coordinate system is the source of the coordinate system [30], terrain feature points, etc., in the image; it projects the image data and feature data in a plane according to a universal map. Uniformly project onto the WEB Magic Card to form a consistent spatiotemporal baseline. Complete feature extraction and spatiotemporal matching processing of multi-source image data.

3.5. Building a Large-Scale Complex Scene Graph Database

This project uses a YFCC100M dataset to obtain metadata containing Geo information, and combines scene image data such as SUN2010, Places2, and Google StreetView to construct a large-scale complex scene graph database [31,32,33]. Firstly, the YFCC100M metadata is processed to extract data with geo labels from the original dataset. Then, the Geo labeled dataset is transformed using the GEOPY tool to obtain an image dataset with actual location information. Utilize SIFT for feature extraction on scene datasets such as SUN2010, Places2, and Google StreetView to obtain a large database of scene images. YFCC100M obtains a text data document database after data preprocessing, which can be used as a data source for future model training. In particular, this text data contains the Geo information of each image, which is crucial for accurate scene positioning. SUN2010, Places2, and Google StreetView have undergone SIFT algorithm feature processing to obtain a relatively rich set of scene-based feature datasets, providing a reliable data source for training scene recognition models. There is an urgent need for rapid detection, autonomous warning, information confrontation, and on-site disposal of remote targets. The YFCC100M database is an imaging database that has been based on Yahoo Flickr since 2014. The library consists of 100 million pieces of media data generated between 2004 and 2014, including 99.2 million photo data and 800,000 video data [31]. The YFCC100M dataset does not contain photo or video data and each row in the document contains metadata for a photo or video. Among them, Photos/video identifiers, Longitude, and Latitude are used. Geo information refers to geographic location information, which can record the geographic location information at the time of photo shooting, namely longitude and latitude. But not all metadata contains Geo information, so it is necessary to filter out metadata that does not contain Geo information. Then, use the geopy toolkit to convert longitude and latitude to actual addresses. It is easy to obtain the geographic coordinates of a street address, city, country, and land parcel worldwide using geopy, and to parse them through third-party geoencoders and data sources.

3.6. Data Feature Extraction

The CLIP (Contrastive Language–Image Pre-Training) model is used to extract features from the YFCC100M, im2gps3k, and Google BigEarthNet datasets [34,35,36]. The CLIP model consists of two parts, a visual encoder and a text encoder, in which the visual encoder is used to process image information and the text encoder is used to process the address position information after reverse geocoding. The features extracted from the CLIP model have many advantages. First of all, because the CLIP model can handle both images and texts, it can understand and make use of the association between images and texts, thus extracting richer and more representative features. Secondly, the characteristics of the CLIP model are highly robust, and can remain stable even in the face of various changes (such as illumination changes, visual angle changes, etc.). In addition, the features extracted from the CLIP model have a good generalization ability and can be applied to various tasks and scenes. Finally, the features of the CLIP model have a high degree of discrimination, which can effectively distinguish different objects and scenes. These advantages make the CLIP model perform well in various visual and language tasks. CLIP’s visual encoder and text encoder are its core components. The visual encoder is responsible for extracting features from images, while the text encoder is responsible for extracting features from texts. These two encoders can extract the features of text and image, respectively, and then calculate the similarity between the text vector and the image vector to predict whether they match. This design enables the CLIP model to process both text and image at the same time, thus achieving the joint understanding of image and text. This is a major feature of the CLIP model and it is also the key to its outstanding performance in various tasks. This has provided strong support for our work.

3.7. Design of CNN-Based Visual Scene Localization and Recognition

In this technical roadmap, the basic idea is to use deep neural networks to train complex scene data to obtain a deep learning model FLsM, which predicts the approximate position and scene type of the captured photo based on the image. The schematic diagram is shown in Figure 5. When it needs to achieve fast communications between two arbitrary global points, the satellites in the air platform are used for forwarding communications. The solutions provide differentiated services for the ground user according to the quality, content, and priority. In Figure 5, image and text information fusion processing positioning can be seen.

In this technical roadmap, the basic idea is to use deep neural networks to train complex scene data to obtain a deep learning model, FLsM, which predicts the approximate position and scene type of the captured photo, based on the image in Figure 6. According to actual needs, images with GNSS labels are trained on a large amount of data to complete a set of deep learning methods for image localization. The geographic localization problem is transformed into a classification problem; by quantifying all image data with GNSS labels into a fixed number of classes, the GNSS labels are converted into class labels, so that each class represents a physical region in the real world. Then, the classification results are converted into GNSS coordinates of the corresponding region. In this study, in order to obtain a more accurate positioning model, we use multiple models including OpenAI double CLIP model for training [37]. Based on the ability of efficient matching between images and texts, the large model enhances the generalization ability of the visual image positioning model. More than 400 million pairs of image text data are used for pre-training through a large number of Internet data, which cover a wealth of topics and scenes and provide a wide range of samples for model training. A unique method is used in the training process of the visual image positioning model. First, a batch_size image text pair is selected, and then the image is encoded using Image Encoder and the text is encoded using Text Encoder. Next, the cosine similarity between the encoded image and the text vector is calculated to verify the matching between the image and the text. Thanks to its powerful pre-training ability and effective matching verification method, it can be seen through experiments that the positioning accuracy of the model in multiple scenes has reached the current best performance (SOTA). Based on the CLIP model, the visual image positioning model consists of two parts—a visual encoder and a text encoder. The visual encoder is the part used to process the image, which converts the input image into a vector representation of fixed length. The visual encoder can choose to use either the CNN-based ResNet or the Transformer-based ViT. The text encoder is the part used to process text, which converts the input natural language text sequence into a fixed-length vector representation. The text encoder uses the Transformer model. Both encoders are trained to map the input information into the same embedding space and make similar images and texts closer in the embedding space. Model parameters—in different versions of the CLIP model, the number of parameters is different. To ensure transparency and reproducibility in our study, we provide a summary of the datasets used in our experiments in Table 1. This summary includes key information such as the dataset names, their respective sources, and brief descriptions.

4. Results

This section delves into investigating the influence of larger models on accuracy through a multi-model and multi-sample approach, using the Google Landmarks Dataset. Specifically, two CLIP models, StreetCLIP (with 420 million parameters) and MetaClip (with 980 million parameters) [38,39], are employed. The experiment adopts a traditional hierarchical search method to facilitate CLIP in deducing the geographical locations of the images. By juxtaposing the performance of these models, particularly highlighting the substantial difference in parameter size between StreetCLIP and MetaClip, valuable insights into the impact of model size on accuracy can be gleaned. The methodology unfolds as follows:

Step 1: Reverse geocode the latitude and longitude coordinates of the image to obtain textual geographical location information, including country, first-level administrative region, second-level administrative region, address, and detailed address.
Step 2: Employ the model to predict the country of the image and compare it with the country information obtained from the textual data.
Step 3: Utilize the model to predict the first-level administrative region of the image and compare it with the corresponding information from the textual data.

…

Continue this iterative process until the detailed address is determined. Then, juxtapose it with the actual region of the image to derive accuracy metrics. The experimental findings are summarized in Table 2 below.

Table 2. Experimental results of the StreetCLIP and MetaClip models.

Model	Country	First-Level Administrative Region	Second-Level Administrative Region	Address	Detailed Address
StreetCLIP	20.65%	6.02%	1.79%	0.71%	0.59%
MetaCLIP	20.45%	5.99%	1.93%	0.95%	0.89%

Note: Bold green indicates the part with the highest accuracy.

Through experiments, we can see that the positioning accuracy is extremely low and replacing a larger model will not significantly improve the positioning accuracy. Based on this, we change the dataset and select im2gps3k, GeoYFCC, and BigEarthNet for experimental verification [40].

The Im2GPS3k dataset is a subset of the original Im2GPS dataset, which is used for testing in the field of photo geographic positioning estimation. This dataset is an important part of the estimation benchmark of photo geographic location. The purpose of using this dataset is to determine the exact latitude and longitude of the photo shooting place, which is a challenging but widely applicable task in the field of computer vision. There are about 3000 pictures and the dataset distribution, as shown in the figure, of GeoYFCC is a geographical subset of YFCC, which ensures that each country has 20,000–30,000 pictures, so the geographical distribution is more uniform. It contains about 1 million pictures and the dataset is distributed as shown in the figure. BigEarthNet is a large-scale remote sensing dataset based on Sentinel-2 satellite images, which contains 5.9 million image blocks in Europe, each with a size of 120 × 120 pixels, with 13 spectral bands covering 43 land cover/use types. The dataset distribution is shown in Figure 7.

In the small-scale space under the scene of street and living environment, we studied and performed experiments on two datasets. Firstly, based on the im2gps3k dataset, this study carried out a set of comparative experiments. The traditional hierarchical search is used to traverse the positioning through hierarchical query. First, start the query from a larger area, such as the country, then narrow down the scope one by one, such as the first-level administrative region and the second-level administrative region, and finally obtain the detailed address. The contrast experiment is carried out using a brand-new similarity calculation method. This experiment utilized Milvus as the vector database framework. The specific experimental process is as follows. Initially, each image in the dataset was processed. Every image is associated with a unique image ID and latitude–longitude information. The latitude–longitude coordinates were converted into textual location information using reverse geocoding. Simultaneously, the textual location information was encoded using a model to generate textual feature vectors. These components were abstracted into Milvus entities and stored in the database. These steps were repeated for each image in the dataset. Subsequently, the dataset was traversed again. The model was used to extract image feature vectors from each picture. Then, the database was queried to find the Milvus entity corresponding to the textual feature vector with the highest cosine similarity to the image feature vector. The latitude–longitude coordinates of the image and the corresponding entity were compared to calculate the predicted distance error. Finally, the errors were categorized into different scales (e.g., errors less than 1 km were categorized as within 1 km, errors less than 25 km were categorized as within 25 km, where 25 km includes 1 km). This process aims to establish associations between images and geographical location information. Through the efficient vector search capabilities of the Milvus database, the positioning accuracy of the models was evaluated across different scales. The comparison between the new method using StreetCLIP and the traditional method using StreetCLIP shows that the comparison between the new method and the traditional method is obviously improved on a smaller scale, but there is little difference on a larger scale. Then, new methods are used to test the performance of other models in image positioning, including ViTbigCLIP (GeoDE dataset performs well) and EVAplusCLIP (model parameters are large, reaching 5 billion). The performance can be further improved after replacing other better models (e.g., ViTbigCLIP performs better on GeoDE, and EVAplusCLIP model parameters are larger). The following experimental results are shown in Table 3. Thermal maps of positioning accuracy are shown in Figure 8a. On the GeoYFCC dataset, we use the same method to test StreetCLIP, ViTbigCLIP, EVAplusCLIP, and GeoCLIP (because CeoCLIP has no text encoder, so we use the traditional hierarchical search method for this model). The experimental results are shown in Figure 8b,c. Based on remote sensing images in large-scale space, the positioning accuracy of the model is tested on the BigEarthNet dataset, as shown in Figure 8d. Through experiments, it can be found that the overall accuracy of positioning based on remote sensing images is low, and it only improves slightly at the scale of 750 km, but the partial area of this dataset is mostly around 2500 km, which leads to the soaring accuracy of the last 2500 km, so it is more effective in large-scale space.

Examine the four charts in Figure 9a–d to see how each model performs under various distance settings and datasets. In most cases, the enhanced version of StreetCLIP (designated as “Our method”) has demonstrated superior performance, particularly in large-scale spaces where it operates more flawlessly. On some datasets, the StreetCLIP model outperforms the VitbigCLIP and EVAplusCLIP models; however, the enhanced version of StreetCLIP exhibits a more consistent performance growth in a number of areas. In every scenario, the StreetCLIP original version displayed the slowest performance growth. Figure 9e–h shows the accuracy performance of many models under various distance parameters on four distinct datasets. The variation in the model’s performance is represented by the scatter’s size. “Our method” (StreetCLIP) generally shows rather large scatter points in all datasets, especially when covering big distances (750 km and 2500 km), which suggests high accuracy. In comparison to other models, “Our method” exhibits a notable improvement in accuracy, particularly on the lm2GPS3k and yfcc_geosubset datasets.

5. Discussion

To investigate visual blur positioning using large models, we utilized conventional street/environmental image and remote sensing image datasets as elastic scale space samples. Experiments were conducted across various large models to analyze their performance. The results indicate that models with larger parameters, such as EVAplusCLIP, exhibit enhanced positioning accuracy on elastic scale space sample datasets, with more stable outcomes aligning with the requirements of visual blur in elastic scale space positioning. The key advantage of EVAplusCLIP lies in its larger model parameters, enabling better data complexity capture and representation, thereby enhancing model generalization. Notably, the output vector dimension of the model significantly impacts its performance, as evidenced by the ViTbigCLIP model having the highest positioning accuracy score on the GeoDE dataset, due to its larger vector dimensions providing more information for improved prediction accuracy. However, optimizing model parameters and vector dimensions alone may not suffice to meet all requirements. To further enhance model accuracy, alternative improvement methods should be considered, such as introducing a vector database to optimize vector retrieval and utilization for improved model accuracy.

6. Conclusions

In this research, we introduce a fuzzy positioning approach for images based on large models in elastic scale space. Our comparative experiments demonstrate that the EVAplusCLIP model achieves a higher positioning accuracy and can effectively serve the image positioning function across various scale spaces. This work represents an exploratory research endeavor with several areas open for future improvement. Potential research directions include optimizing model stability through further experiments with increased model parameters, enhancing model performance on specific datasets by expanding the output vector dimension and training on more relevant data, and exploring additional improvement methods such as vector databases to enhance model accuracy. These paths present critical avenues for our ongoing in-depth investigation and optimization of this research. Through conducting additional experiments and exploring further improvement methods, we can enhance the stability and accuracy of our image positioning approach. Our research has produced promising results, indicating the potential for further advances in model stability through increased parameter experiments. Furthermore, expanding output vector dimensions and training on more relevant data offer exciting opportunities for enhancing model performance across specific datasets. Incorporating vector databases as an improvement method also introduces new possibilities for optimizing positioning accuracy. These areas provide essential directions for our ongoing in-depth investigation and advancement of this research.

Author Contributions

Conceptualization, W.C.; Methodology, W.C.; Formal analysis, Y.W.; Resources, Y.L.; Writing—original draft, W.C.; Supervision, L.M. and J.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by [China’s Ministry of Science and Technology National Key R&D Program Beidou xing Energy] grant number [E33514060C].

Data Availability Statement

The data presented in this study are available in this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, X.; Wang, B.; Li, X.; Huang, J.; Lyu, H.; Han, X. Principle and performance of multi-frequency and multi-GNSS PPP-RTK. Satell. Navig. 2022, 3, 7. [Google Scholar] [CrossRef]
Al Hage, J.; Najjar, M.E.B.E. Improved Outdoor Localization Based on Weighted Kullback-Leibler Divergence for Measurements Diagnosis. IEEE Intell. Transp. Syst. Mag. 2018, 12, 41–56. [Google Scholar] [CrossRef]
Li, M.-H.; Jiang, P.; Yu, D.-J.; Sun, J.-H. Position and attitude determination by integrated GPS/SINS/TS for feed support system of FAST. Res. Astron. Astrophys. 2020, 20, 140. [Google Scholar] [CrossRef]
Shi, Q.; Wu, J.; Lin, Z.; Qin, N. Learning a Robust Hybrid Descriptor for Robot Visual Localization. J. Robot. 2022, 2022, 9354909. [Google Scholar] [CrossRef]
Wang, Z.; Cheng, X. Adaptive optimization online IMU self-calibration method for visual-inertial navigation systems. Measurement 2021, 180, 109478. [Google Scholar] [CrossRef]
Cao, M.; Tang, F.; Ji, P.; Ma, F. Improved Real-Time Semantic Segmentation Network Model for Crop Vision Navigation Line Detection. Front. Plant Sci. 2022, 13, 898131. [Google Scholar] [CrossRef]
Critchley-Marrows, J.J.; Wu, X.; Cairns, I.H. An architecture for a visual-based PNT alternative. Acta Astronaut. 2023, 210, 601–609. [Google Scholar] [CrossRef]
Bellamy, B.R. The Robotic Imaginary: The Human and the Price of Dehumanized Labor by Jennifer Rhee. Sci. Fict. Stud. 2019, 46, 655–657. [Google Scholar] [CrossRef]
Cass, S. Ayanna Howard: Robot wrangler. IEEE Spectr. 2005, 42, 21–22. [Google Scholar] [CrossRef]
Carpenter, K.; Wiltsie, N.; Parness, A. Rotary Microspine Rough Surface Mobility. IEEE/ASME Trans. Mechatron. 2015, 21, 2378–2390. [Google Scholar] [CrossRef]
Tang, C.; Ma, W.; Li, B.; Jin, M.; Chen, H. Cephalopod-Inspired Swimming Robot Using Dielectric Elastomer Synthetic Jet Actuator. Adv. Eng. Mater. 2019, 22, 1901130. [Google Scholar] [CrossRef]
Ning, X.; Liu, L. A Two-Mode INS/CNS Navigation Method for Lunar Rovers. IEEE Trans. Instrum. Meas. 2014, 63, 2170–2179. [Google Scholar] [CrossRef]
Ning, X.; Fang, J. A new autonomous celestial navigation method for the lunar rover. Robot. Auton. Syst. 2009, 57, 48–54. [Google Scholar] [CrossRef]
Wang, W.-R.; Ren, X.; Wang, F.-F.; Liu, J.-J.; Li, C.-L. Terrain reconstruction from Chang’e-3 PCAM images. Res. Astron. Astrophys. 2015, 15, 1057–1067. [Google Scholar] [CrossRef]
Sutoh, M.; Wakabayashi, S.; Hoshino, T. Influence of atmosphere on lunar rover performance analysis based on soil parameter identification. J. Terramech. 2017, 74, 13–24. [Google Scholar] [CrossRef]
Choi, I.-S.; Ha, J.-E. Simple method for calibrating omnidirectional stereo with multiple cameras. Opt. Eng. 2011, 50, 43608. [Google Scholar] [CrossRef]
Gamarra, D.F.T.; Pinpin, L.K.; Laschi, C.; Dario, P. Forward Models Applied in Visual Servoing for a Reaching Task in the iCub Humanoid Robot. Appl. Bionics Biomech. 2009, 6, 345–354. [Google Scholar] [CrossRef]
Zhang, M.; Cui, J.; Zhang, F.; Yang, N.; Li, Y.; Li, F.; Deng, Z. Research on evaluation method of stereo vision measurement system based on parameter-driven. Optik 2021, 245, 167737. [Google Scholar] [CrossRef]
Huang, W.; Fajen, B.R.; Fink, J.; Warren, W.H. Visual navigation and obstacle avoidance using a steering potential function. Robot. Auton. Syst. 2006, 54, 288–299. [Google Scholar] [CrossRef]
Bulanon, D.; Burks, T.; Alchanatis, V. Image fusion of visible and thermal images for fruit detection. Biosyst. Eng. 2009, 103, 12–22. [Google Scholar] [CrossRef]
Kuo, B.-W.; Chang, H.-H.; Chen, Y.-C.; Huang, S.-Y. A Light-and-Fast SLAM Algorithm for Robots in Indoor Environments Using Line Segment Map. J. Robot. 2011, 2011, 257852. [Google Scholar] [CrossRef]
Lv, K.; Zhang, Y.; Yu, Y.; Wang, Z.; Min, J. SIIS-SLAM: A Vision SLAM Based on Sequential Image Instance Segmentation. IEEE Access 2022, 11, 17430–17440. [Google Scholar] [CrossRef]
Zhao, X.; Wang, T.; Li, Y.; Zhang, B.; Liu, K.; Liu, D.; Wang, C.; Snoussi, H. Target-Driven Visual Navigation by Using Causal Intervention. IEEE Trans. Intell. Veh. 2023, 9, 1294–1304. [Google Scholar] [CrossRef]
Li, J.; Yin, J.; Deng, L. A robot vision navigation method using deep learning in edge computing environment. EURASIP J. Adv. Signal Process. 2021, 2021, 22. [Google Scholar] [CrossRef]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to Prompt for Vision-Language Models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
Xing, Y.; Wu, Q.; Cheng, D.; Zhang, S.; Liang, G.; Wang, P.; Zhang, Y. Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model. IEEE Trans. Multimed. 2023, 26, 2056–2068. [Google Scholar] [CrossRef]
Sun, B.; Liu, G.; Yuan, Y. F3-Net: Multiview Scene Matching for Drone-Based Geo-Localization. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3278257. [Google Scholar] [CrossRef]
Vicente Vivanco, C.; Nayak, G.K.; Shah, M. GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization. arXiv 2023, arXiv:2309.16020. [Google Scholar]
Gao, H.; Zhu, M.; Wang, X.; Li, C.; Xu, S. Lightweight Spatial-Spectral Network Based on 3D-2D Multi-Group Feature Extraction Module for Hyperspectral Image Classification. Int. J. Remote Sens. 2023, 44, 3607–3634. [Google Scholar] [CrossRef]
Cheng, P.; Cheng, Y.; Wang, X.; Wu, S.; Xu, Y. Realization of an Optimal Dynamic Geodetic Reference Frame in China: Methodology and Applications. Engineering 2020, 6, 879–897. [Google Scholar] [CrossRef]
Thomee, B.; Shamma, D.A.; Friedland, G.; Elizalde, B.; Ni, K.; Poland, D.; Borth, D.; Li, L.-J. Yfcc100m: The new data in multimedia research. arXiv 2016, arXiv:1503.01817. [Google Scholar] [CrossRef]
Alsubai, S.; Dutta, A.K.; Alkhayyat, A.H.; Jaber, M.M.; Abbas, A.H.; Kumar, A. Hybrid deep learning with improved Salp swarm optimization based multi-class grape disease classification model. Comput. Electr. Eng. 2023, 108, 108733. [Google Scholar] [CrossRef]
Anguelov, D.; Dulong, C.; Filip, D.; Frueh, C.; Lafon, S.; Lyon, R.; Ogale, A.; Vincent, L.; Weaver, J. Google Street View: Capturing the World at Street Level. Computer 2010, 43, 32–38. [Google Scholar] [CrossRef]
Steven, B.; Ayton, A. Text-to-Image Synthesis with Self-supervision via Contrastive Language-Image Pre-Training (CLIP). Available online: https://www.researchgate.net/publication/369299175_Text-to-Image_Synthesis_with_Self-supervision_via_Contrastive_Language-Image_Pre-training_CLIP (accessed on 13 May 2024).
Vo, N.; Jacobs, N.; Hays, J. Revisiting im2gps in the deep learning era. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Sumbul, G.; Wall, A.; Kreuziger, T.; Marcelino, F.; Costa, H.; Benevides, P.; Caetano, M.; Demir, B.; Markl, V. BigEarthNet-MM: A large-scale, multimodal, multilabel benchmark archive for remote sensing image classification and retrieval [software and data sets]. IEEE Geosci. Remote Sens. Mag. 2021, 9, 174–180. [Google Scholar] [CrossRef]
Roumeliotis, K.I.; Tselikas, N.D. ChatGPT and Open-AI Models: A Preliminary Review. Futur. Internet 2023, 15, 192. [Google Scholar] [CrossRef]
Haas, L.; Silas, A.; Michal, S. Learning generalized zero-shot learners for open-domain image geolocalization. arXiv 2023, arXiv:2302.00275. [Google Scholar]
Parashar, S.; Lin, Z.; Liu, T.; Dong, X.; Li, Y.; Ramanan, D.; Caverlee, J.; Kong, S. The Neglected Tails of Vision-Language Models. arXiv 2024, arXiv:2401.12425. [Google Scholar]
Dubey, A.; Ramanathan, V.; Pentland, A.; Mahajan, D. Adaptive methods for real-world domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021. [Google Scholar]

Figure 1. Schematic diagram of image fuzzy positioning.

Figure 2. Flow description of image positioning algorithm.

Figure 3. Overall framework of elastic spatial positioning.

Figure 4. Visual fuzzy positioning process based on large model.

Figure 5. Image and text information fusion processing positioning.

Figure 6. Overall framework of the FLsM structure, integrating image and text large models.

Figure 7. Schematic diagram of global distribution of image dataset.

Figure 8. Experimental results under different models.

Figure 9. Relationship between positioning accuracy of different datasets under different models.

Table 1. Summary of datasets.

Dataset Name	Source	Description
Google Landmarks	cvdfoundation/google-landmark	5 million landmark-labeled images.
BigEarthNet	Technische Universität Berlin	Contains 590,326 image pairs from Sentinel-1 and Sentinel-2.
Im2GPS3k	TIBHannover/GeoEstimation	Comprises 3000 geotagged images that span a variety of scenes and locations worldwide.
GeoYFCC	abhimanyudubey/GeoYFCC	Comprises a total of 1,147,059 images from 1261 categories across 62 countries.

Table 3. Experimental results under different models.

Dataset	Model	1 km_Accuracy	25 km_Accuracy	200 km_Accuracy	750 km_Accuracy	2500 km_Accuracy
BigEarthNet	StreetCLIP	3.38796 × 10⁻⁰⁶	0.004997239	0.067010432	0.177152963	0.99427435
BigEarthNet	ViTbigCLIP	2.20217 × 10⁻⁰⁵	0.004595766	0.071401226	0.274355187	0.950119087
BigEarthNet	EVAplusCLIP	1.18579 × 10⁻⁰⁵	0.004094348	0.042068281	0.18899896	0.956994949
Im2GPS3k	StreetCLIP	0.171838505	0.326659993	0.464798131	0.635969303	0.794127461
Im2GPS3k	ViTbigCLIP	0.256256256	0.450784117	0.589255923	0.724724725	0.851851852
Im2GPS3k	EVAplusCLIP	0.249249249	0.431097764	0.544210878	0.688688689	0.829496163
mix_feature	StreetCLIP	0.074074074	0.212545879	0.297297297	0.464130797	0.650650651
mix_feature	ViTbigCLIP	0.016016016	0.082749416	0.122455789	0.23023023	0.448114781
mix_feature	EVAplusCLIP	0.048381715	0.125792459	0.18685352	0.294627961	0.515181849

Note: The accuracy values represent the proportion of correctly identified geographical locations. The spatial scales are in kilometers. Bold green indicates the part with the highest accuracy.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, W.; Miao, L.; Gui, J.; Wang, Y.; Li, Y. FLsM: Fuzzy Localization of Image Scenes Based on Large Models. Electronics 2024, 13, 2106. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics13112106

AMA Style

Chen W, Miao L, Gui J, Wang Y, Li Y. FLsM: Fuzzy Localization of Image Scenes Based on Large Models. Electronics. 2024; 13(11):2106. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics13112106

Chicago/Turabian Style

Chen, Weiyi, Lingjuan Miao, Jinchao Gui, Yuhao Wang, and Yiran Li. 2024. "FLsM: Fuzzy Localization of Image Scenes Based on Large Models" Electronics 13, no. 11: 2106. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics13112106

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FLsM: Fuzzy Localization of Image Scenes Based on Large Models

Abstract

1. Introduction

2. Related Works

2.1. Traditional SLAM Technology

2.2. Deep Learning Models for Visual Navigation

2.3. Geographic Information Positioning Research

3. Materials and Methods

3.1. Overview

3.2. Constructing an Elastic Scale Spatial Environment

3.3. The Concept of Visual Blur Localization

3.4. Multi-Source Image Data Source Matching

3.5. Building a Large-Scale Complex Scene Graph Database

3.6. Data Feature Extraction

3.7. Design of CNN-Based Visual Scene Localization and Recognition

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI