Geographic Knowledge Graph Attribute Normalization: Improving the Accuracy by Fusing Optimal Granularity Clustering and Co-Occurrence Analysis

Yin, Chuan; Zhang, Binyu; Liu, Wanzeng; Du, Mingyi; Luo, Nana; Zhai, Xi; Ba, Tu

doi:10.3390/ijgi11070360

Open AccessArticle

Geographic Knowledge Graph Attribute Normalization: Improving the Accuracy by Fusing Optimal Granularity Clustering and Co-Occurrence Analysis

¹

School of Geomatics and Urban Spatial Informatics, Beijing University of Civil Engineering and Architecture, Beijing 100044, China

²

Beijing Key Laboratory of Urban Spatial Information Engineering, Beijing 100038, China

³

Information Service Department, National Geomatics Centre of China, Beijing 100830, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2022, 11(7), 360; https://0-doi-org.brum.beds.ac.uk/10.3390/ijgi11070360

Submission received: 29 March 2022 / Revised: 24 May 2022 / Accepted: 20 June 2022 / Published: 23 June 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Expansion of the entity attribute information of geographic knowledge graphs is essentially the fusion of the Internet’s encyclopedic knowledge. However, it lacks structured attribute information, and synonymy and polysemy always exist. These reduce the quality of the knowledge graph and cause incomplete and inaccurate semantic retrieval. Therefore, we normalize the attributes of a geographic knowledge graph based on optimal granularity clustering and co-occurrence analysis, and use structure and the semantic relation of the entity attributes to identify synonymy and correlation between attributes. Specifically: (1) We design a classification system for geographic attributes, that is, using a community discovery algorithm to classify the attribute names. The optimal clustering granularity is identified by the marker target detection algorithm. (2) We complete the fine-grained identification of attribute relations by analyzing co-occurrence relations of the attributes and rule inference. (3) Finally, the performance of the system is verified by manual discrimination using the case of “landscape, forest, field, lake and grass”. The results show the following: (1) The average precision of spatial relations was 0.974 and the average recall was 0.937; the average precision of data relations was 0.977 and the average recall was 0.998. (2) The average F1 for similarity results is 0.473; the average F1 for co-occurrence analysis results is 0.735; the average F1 for rule-based modification results is 0.934; the results show that the accuracy is greater than 90%. Compared to traditional methods only focusing on similarity, the accuracy of synonymous attribute recognition improves the system and we are capable of identifying near-sense attributes. Integration of our system and attribute normalization can greatly improve both the processing efficiency and accuracy.

Keywords:

attribute normalization; geographic knowledge graph; co-occurrence analysis; optimal clustering granularity

1. Introduction

In recent decades, geographic information service has become one of the most important information services [1] and is capable of providing fast access to geographic information. With the development of information and communication technology, geographic information is undergoing a great transformation, from single static to multi-source dynamic, from precise structure to fuzzy heterogeneity. In order to achieve a geographic knowledge service system for the public, geographic information services must realize the intelligent transformation from “data-information-knowledge-wisdom” [2]. Geographic knowledge is the product of geographic thinking and reasoning about natural and human phenomena in the physical world [3]. How to express, organize, store and produce geographic knowledge scientifically is always the core issue about which geography scholars are concerned [4]. With rapid development of the “Internet plus”, big data, cloud computing and artificial intelligence, knowledge services are more represented by geographic knowledge graphs [5]. Knowledge service refers to the information service of extracting knowledge and information content from various explicit and tacit resources, building a knowledge network, and providing knowledge content or solutions for users. A geographic knowledge graph is one major representation. KG (Knowledge Graph) [6], originally developed from the DB (Data Base), contains rich semantic information and has a flexible form as well as great scalability. It can bring complex knowledge fields and knowledge systems into reality by using methods, such as data mining, information processing, and knowledge measurement and graphics. For instance, geographic knowledge graphs, as one major knowledge graph, play a great role in the fields of traffic accident inference [7], disaster environment construction [8], information recommendation [9] and geography education [10].

Today, multi-source heterogeneous datasets are usually used to build geo-graphic knowledge graphs. Geospatial data have location information and topological relations. Combining these data, Internet encyclopedia knowledge bases, such as LinkedGeoData [11], GeoNames Ontology [12], and GeoWorldNet [13], YAGO [14], CrowdGeoKG [15] and CONCEPTNET [16], can enrich attribute information. However, these geographic knowledge graphs have incomplete and inaccurate searches in semantic retrieval, probably because the encyclopedic knowledge base is crowdsourced. In particular, it is an aggregate of different sources and lacks structured information of attributes. Synonymy and polysemy always happen and reduce the quality of geographic knowledge graphs. In order to improve the quality and enhance the intelligent service [17], it is necessary to fuse attribute information of geographic entities, that is, attribute normalization.

Geographic attributes include semantic attributes and spatial attributes. Influenced by the rich vocabulary of geography, researchers often use words appropriate to different scenarios when describing the same spatial relations, a situation that makes such language-based queries inconsistent with human language and cognition. Du et al. [18] proposed that semantic relations between spatial features of different geographic ontology types and description vocabulary of spatial relations are different. In this paper, we focus on geo-semantic attribute alignment and classify semantic description vocabulary of spatial relations to facilitate spatial computation and knowledge inference. In this context, geo-semantic attribute alignment refers to the normalization of geographical attributes with the same meaning according to semantics, spatial computing [19] refers to a computing model that uses spatial principles to optimize the performance of distributed computing and knowledge inference [20] is the process of using known knowledge to draw conclusions by deduction.

The remainder of this paper is structured as follows: Section 2 introduces the state of the art, the relevant concepts in this paper and the related techniques used. Section 3 introduces the methodology and the related models in this paper. The experiments and results are discussed in Section 4. Finally, the conclusions are presented in Section 5.

2. Related Work

Existing methods for attribute normalization include unsupervised learning and supervised learning [21]. Unsupervised learning usually achieves normalization from a text morphology perspective by calculating the text similarity, or by string matching vocabulary with existing knowledge base concepts. Gunaratna et al. [22] and Zhang et al. [23] identified synonymous attributes in Linked Open Data (LOD) [24] using overlap between triples; the latter gives an unsupervised framework for the attribute normalization. Ristad et al. [25] calculated the text similarity using string edit distance to achieve attribute normalization. Tsuruoka et al. [26] used the logistic regression algorithm to compute the string similarity for attribute normalization, which outperformed the rule matching approach. Liu et al. [27] transformed the attribute alignment into a similarity computation problem for attribute functions. Supervised learning tends to treat the problem of describing text and knowledge base matching as a sequential labeling task by learning textual features for each representation of geographic attribute names as a way to predict text classification of the target attribute name and, thus, achieve the normalization of geographic attribute names. For example, Huang et al. [28] proposed a data-driven fine-grained alignment method that integrates the extensions of attributes and definition domains to unify the identification of synonymy, inclusion and correlation relations among attributes. Jakub et al. [29] calculated the minimum distance between two attributes by comparing the aggregated attribute information and the features in the dataset and then realizing the attribute alignment by the KNN algorithm. This method can achieve attribute alignment without losing important information and can predict the distance between aligned attributes and individual attributes.

Overall, unsupervised learning is easier to implement compared to supervised learning methods. However, it relies on knowledge bases and large-scale training data. Sometimes, there is a problem with large-scale training data, where it is difficult to obtain. Supervised learning does not rely on the geographic knowledge base and it can update geographic attribute name features by continuously learning. Therefore, supervised learning can achieve the normalization of geographic attribute names at the semantic level more flexibly.

However, most studies use the semantic similarity of attribute names and attribute values or use the data types of attribute values as features for the attribute normalization, ignoring the features on the structure of entity attributes. Entities usually appear in the knowledge graph as a triple and an attribute can only appear once in the entity’s triple. To address these problems, we propose an attribute-normalization-method-fused optimal granularity clustering and co-occurrence analysis. The method first clusters the attributes using similarity and initially identifies the correlations among the attributes. Second, co-occurrence analysis is used to filter out non-synonyms from the structure. Finally, rules are generated based on the characteristics of geographic attributes to complete the identification of synonymous attributes, thus, completing the attribute normalization task.

In the following, we introduce the relevant concepts in this paper and the related techniques used.

Attribute normalization. For each category in the web encyclopedia knowledge base, the names of geographical attributes with the same meaning are normalized; that is, attributes with different names but the same representation are integrated. For example, “elevation” and “altitude” are unified as “altitude”, e.g.,

p_{i} = {p_{1}, p_{2}, p_{3}, \dots p_{n}}

where

p_{1}, p_{2}, p_{3}, \dots p_{n}

are synonyms, and the final selection of

p_{i}

is used to describe the attribute.

Attribute relation. In this paper, the relation of geographical attributes is defined as follows: (1) Related words refer to word pairs with a high similarity between two attribute names. It is a collective term for synonyms and near-synonyms. (2) Synonyms refer to word pairs with the same meaning, such as “altitude” and “height”. (3) Near-synonyms refer to word pairs with high similarity but different meanings, for example, “east from” and “east to”.

Word vector. The word vector takes the form of a low-dimensional vector of real numbers to represent a word [30]. The core idea is to map words into a vector space, and this vector space retains much of the original semantics. Each value represents a feature with a certain semantic and syntactic interpretation, which is obtained by contextual analysis. For example, (0.543, −0.242, −0.143, 0.435, …, −0.107) represents the word vector for a word and 0.543 represents an explanatory feature of the word vector; its value is derived from the different contributions of each word in the lexicon. Different word vectors are trained from different corpora, and here, we train word vectors for geographical information.

Community. A community is a subgraph containing nodes, which are more densely linked to each other than to the rest of the graph or equivalently, a graph has a community structure if the number of links into any subgraph is higher than the number of links between those subgraphs [31].

Modularity. The modularity is used to measure whether the division of a community is a relatively good result. This means that the result has a high similarity of nodes inside the community and a low similarity of nodes outside the community [32].

Evaluation Index. A value reflecting the quality of the experimental results. Different experiments have different evaluation indices.

Word similarity calculation:

Traditional attribute normalization mines synonyms only in terms of similarity. The similarity is usually calculated using the word vector method. Word vectors are trained using word2vector [33]. The obtained word vectors are calculated using cosine similarity.

s i m (w o r d_{1}, w o r d_{2}) = \frac{\sum_{k = 1}^{n} w_{k} (w o r d_{1}) \times w_{k} (w o r d_{2})}{\sqrt{\sum_{k = 1}^{n} w_{k}^{2} (w o r d_{1})} \times \sqrt{\sum_{k = 1}^{n} w_{k}^{2} (w o r d_{2})}}

(1)

At present, there are three main methods of word vector training: based on a corpus [34], text similarity [35] and contextual analysis [36]. Although it is simple and easy to use the knowledge base to mine synonyms, the knowledge base coverage is limited, and it is limited for each domain that needs its own knowledge base. The advantage of the text-similarity-based synonym mining method is as follows: it is simple to calculate, it does not require the use of a large corpus and it can find synonym relations as long as the word appears once. The disadvantage of this method is that the accuracy is not high and many wrong synonyms will be mined, especially when the two words are relatively short. Therefore, this method is suitable for some longer texts, especially for specialized words and terms. The advantage of the synonym mining method using contextual relevance is that it can mine a large number of synonyms in the corpus, and the disadvantage is that the method analyzes the frequency of the word in context, so the similarity of the related words is also high. For example, the similarity of “east from” and “east to” is 0.83, but they are not synonyms and cannot be aligned. Therefore, we propose a method based on co-occurrence analysis and rule-based reasoning to achieve the precise identification of synonymous attributes in classes.

Community discovery algorithm:

The community discovery algorithm can find highly modular partitions in large networks in a short time and show the complete hierarchical community structure for the network to obtain different community detection resolutions [32]. The whole algorithm is divided into two parts. Initially, each node is independently a class. The first part aims to classify and group the similar points into one class with good identification. The second part aims to re-initialize the new graph, that is, the same class as one node. Then, the first part is repeated to complete the iteration.

In this paper, we use this method to modularize attribute names, i.e., attribute clustering to find a way to have a moderate modularity to achieve the best clustering effect. The method relies on the following model. It means the ratio of the total number of edges inside the community to the total number of edges in the network minus an expectation value that is the size of the ratio of the total number of edges inside the community to the total number of edges in the network formed by the same community assignment when the network is set as a random network.

Q = \frac{1}{2 m} \sum_{v w} [A_{v w} - \frac{k_{v} k_{w}}{2 m}] δ (c_{v}, c_{w})

(2)

where

A_{v w}

represents the weight of the edge between v and w,

k_{v} = \sum_{w} A_{v w}

is the sum of the weights of the edges attached to vertex v,

C_{v}

is the community to which vertex v is assigned, the

δ

function

δ (i, j)

is 1 if

i = j

and 0 otherwise and

m = \frac{1}{2} \sum_{v w} A_{v w}

[32].

This section introduced the state of the art and provided a brief introduction to the concepts and methodological foundations that may be mentioned below. It was convenient for the reader to understand the methodology of the subsequent paper.

3. Methods and Models

In this section, we introduce the attribute normalization proposed in this paper as well as two basic problems and their solutions and models.

3.1. Overview

In this paper, the word vector is trained by using the word2vec tool with the data of Encyclopedia Knowledge Base, and the attribute names are used as custom dictionaries in the process to better train the word vectors of geographic attribute names. The similarity between the attribute names is calculated using the trained word vectors, and then the attribute names are clustered based on the similarity. The attribute names in each class are queried for co-occurrence in the set of attribute names of each entity in turn, and if they occur, they are not synonyms and are not normalized, and finally, the co-occurrence results are scored based on frequency. Then, remaining related words are filtered based on the rules which we will introduce in Section 3.3.2. The final result is obtained. The overall technical flow chart is shown in Figure 1, which mainly includes four parts: data pre-processing, optimal granularity attribute clustering based on the labeled target detection algorithm, result scoring based on co-occurrence analysis and result optimization based on rule inference. In the data pre-processing part, the data sources are imported into the database for further data query, and the data sources are used as a corpus to train word vectors. In the optimal granularity attribute clustering based on labeled target detection algorithm part, clustering is based on the similarity calculated from the word vector. In the result scoring based on co-occurrence analysis part, co-occurrence analysis was performed based on the above clustering results, and the results were scored. Finally, we moved on to the result optimization based on rule inference part.

According to the structure of the web encyclopedia data, the resources in an encyclopedia can be described by a series of triples shaped as 〈word name, word attribute, word attribute content〉, whose three elements correspond to entity, attribute and attribute value. Geographical entities can be classified into different categories based on a specific classification system.

Let 〈e, p, l〉 be such a triple. Then, ‘e’ is the corresponding entity, ‘p’ represents the attribute and ‘l’ stands for the attribute value. The geographical attribute dataset for each entity is denoted as

E = {p_{1}, p_{2}, p_{3}, \dots p_{n}}

. Each category attribute dataset is denoted as

C = {E_{1}, E_{2}, E_{3}, \dots E_{n}}

The clustering is performed separately in different categories, and the attribute dataset obtained after clustering for each category is denoted as

G = {P_{1}, P_{2}, P_{3}, \dots P_{n}}

. After clustering, each class attribute dataset is denoted as

V = {G_{1}, G_{2}, G_{3}, \dots G_{n}}

, n represents the number of datasets.

Specifically, in dataset C, the similarity between attributes such as

p_{1}, p_{2}

,

p_{1}, p_{3}

and

p_{1}, p_{n}

is calculated as

s i m_{i, j}

. The attributes that are greater than a certain threshold are clustered to obtain dataset G. Then, co-occurrence analysis and rule-based reasoning are performed on the dataset to obtain the synonyms. The specific schematic diagram is shown in Figure 2.

3.2. Method Modeling: Optimal Granularity Attribute Clustering Based on Labeled Target Detection Algorithm

In this paper, we use Baidu Encyclopedia data as the initial corpus, train the word vector based on the CBOW (continuous bag-of-word) model [37] of word2vec, use all attribute names as custom dictionaries and use the jieba word splitting tool for word segmentation. The introduction of custom dictionaries in word segmentation can identify the attribute names more accurately, which makes the word context of each text in the corpus change. This, in turn, results in a change in the position of words in the high-dimensional vector space, which has a greater impact on the similarity relation between words of geographic attribute names. It also adds a large number of words related to geographic attribute names, which makes the constructed word vectors more inclined to the geographic subject area and, therefore, provides a foundation for the subsequent geographic attribute. The word vector is more oriented to the geographic discipline field, thus, laying the foundation for the subsequent geographic attribute name synonym identification.

The essence of attribute normalization is to find the word pairs with the same meaning in the attribute names and merge them to ensure that the synonymous attributes of entities in each category have the same attribute names. If we can divide all the attribute names into different classes, the synonym search in each class can not only improve the efficiency of the query but also impose a similarity constraint to exclude word pairs with high similarity but different meanings, thereby improving the accuracy.

To this end, we designed the following classification system for geographic attribute data. It is shown in Figure 3.

In the clustering process, we can achieve attribute name clustering by setting reasonable modularity using a community discovery algorithm to reduce the number of data operations. However, a reasonable choice for the threshold value is the key point for a reasonable clustering effect. Adjusting the parameters to get different clustered datasets

V_{1}, V_{2}, V_{3}, \dots V_{n}

, the following three cases will occur.

Case 1: As shown in Figure 4a, assuming that the class

V_{2}

is too small, the

p_{r}

and

p_{s}

, which should belong to the same class, will be divided into different classes.

Case 2: As shown in Figure 4b, it is a standard classification result.

Case 3: As shown in Figure 4c, assuming that the class

V_{1}

is too large, the

p_{i}

and

p_{j}

, which should not belong to the same class, will be clustered into one class.

We have to choose a standard class. For the selection of the threshold value, most of them adopt the method of an empirical threshold value. However, the thresholds for clustering should be different because of different data sources, and the accuracy of the obtained thresholds is not high because of the poor adaptability of the parameters of empirical thresholds. Therefore, we propose an optimal granularity of attribute clustering algorithm based on the marker target detection algorithm with manual intervention.

The idea of the algorithm is roughly as follows: manually select the target set, then we define

S_{p}

as the positive sample set,

S_{n}

as the negative sample set, n as the number of positive sample sets, and m as the number of negative sample sets. Therefore, the positive target set is denoted as

S_{p} = {(p_{1}, p_{2}), (p_{3}, p_{4}), (p_{5}, p_{6}) \dots (p_{n - 1}, p_{n})}

; the negative target set is denoted as

S_{n} = {(p_{1}, p_{2}), (p_{3}, p_{4}), (p_{5}, p_{6}) \dots (p_{m - 1}, p_{m})}

. The parameters are adjusted to obtain the different clustered datasets

V_{1}, V_{2}, V_{3}, \dots V_{n}

, and we find the attribute pairs

S_{p}

and

S_{n}

in the clustered datasets

V_{1}, V_{2}, V_{3}, \dots V_{n}

, respectively. Then, the evaluation index is

N

; or, if the attribute pair

S_{p}

is found in the clustered dataset, it is

N + 1

. If the attribute pair

S_{n}

is found in the clustered dataset, then

N - 1

, and finally the evaluation index is obtained. The accuracy of the evaluation is determined by the formula

\frac{2 N}{n + m}

.

The algorithm formula is summarized as:

N = {\begin{cases} N + 1 (\exists S_{P} \subset V_{i}) \\ N - 1 (\exists S_{n} \subset V_{i}) \end{cases}

(3)

A c c u r a c y = \frac{2 N}{n + m}

(4)

The parameter with the highest accuracy is the parameter that achieves the optimal granularity.

3.3. Method Modeling: Accurate Identification of Synonymous Attributes Based on Co-Occurrence Analysis and Rule Reasoning

We use two methods to accomplish the exact identification of synonymous attributes, including Section 3.3.1 co-occurrence analysis, and Section 3.3.2 rule inference.

3.3.1. Outcome Scoring Strategy Based on Co-Occurrence Analysis

Words with higher similarity that are calculated from the word vector are related words, and the word pairs after clustering are also related words, but not necessarily synonyms (e.g., “length”, “width” and “thickness”). Therefore, we need to filter them again. “Co-occurrence” refers to the phenomenon where the information described by the feature items of an entity appears together, and the feature items are the names of the attributes contained in an entity. “Co-occurrence analysis” is the study of co-occurrence phenomenon that reveals the content association of information and the knowledge implied by the feature items.

In this paper, we need to analyze the co-occurring attribute names to find the error phenomena in each class that are more similar and have different meanings. The idea used is that if two or more attribute names are used to describe an entity at the same time, these attribute names expressed different meanings and are not synonymous. That is, we find out whether the attribute pair appears in the set of attributes of an entity at the same time. The specific model is shown in Formula (5).

{\begin{cases} p_{i} = p_{j} p_{i}, p_{j} \notin E (p_{i}, p_{j} \in G) \\ p_{i} \neq p_{j} p_{i}, p_{j} \in E (p_{i}, p_{j} \in G) \end{cases}

(5)

As the amount of entity data used in the co-occurrence query and the frequency of the attribute names will have a large impact on the co-occurrence result, a score is given to the result, and the score indicates the likelihood that it is a synonym. The higher the score of an attribute pair, the more likely it is to be a synonym, which provides a better reference for the attribute normalization.

In this paper, the amount of data for the entities is fixed, so it is the frequency of occurrence of the attribute name that is the most important influence. If the frequency of the attribute name is low for both words, its non-co-occurrence may be due to the low frequency. When the frequency of the attribute name is high for both words, but it still does not co-occur, it is unlikely to be used to describe an entity; that is, it has the same meaning and is a synonym. Therefore, the following scoring criteria are given. The difference between the word frequency for the first word in the current word pair and the average word frequency is calculated first. The final sum of the scores of the two words is obtained by giving different scores according to the value of the difference. As a criterion for the number of occurrences of the attribute, the average word frequency is defined as the ratio of the number of occurrences of the attribute to the non-repeating attribute. It means the number of times the attributes appear the same, as shown in the following Formulas (6)–(8):

s u b_{i} = f_{i} - \frac{\sum_{j = 1}^{n} f_{i}}{n}

(6)

s c o r e_{i} = {\begin{array}{l} 0.5 & s u b_{i} > 20 \\ 0.45 & s u b_{i} > 0 \\ 0.4 & s u b_{i} < 0 \end{array} \begin{matrix}  \end{matrix} i = 1, 2

(7)

s c o r e = s c o r e_{1} + s c o r e_{2}

(8)

where

f_{i}

indicates the frequency of the attribute name in each pair of attribute names,

f_{j}

indicates the frequency of the attribute name in all attribute names, n indicates the total number of non-repeating attributes,

s u b_{i}

indicates the difference in frequency of each pair of attribute names and

s c o r e_{i}

indicates the score of the attribute name in each pair of attribute names. We treat the attribute names of each entity as a set, and all the attribute names in this set are non-synonymous. After the clustering is completed, the words in each class are queried two by two in each of the above sets to see if they co-occur, and if they co-occur, the two words are not in a synonymous relation. Therefore, it is possible to effectively exclude words that are highly correlated but not synonymous with the characteristics of the data structure itself. A concrete example is shown in Figure 5, where the left figure shows the set obtained after clustering and the figure on the right shows the set of attribute names corresponding to the entities. In the right figure, the words marked in red appear in the same clustering set, so they are near-synonyms rather than synonyms.

3.3.2. Result Optimization Based on Rule Reasoning

As the attributes contained in a single entity are not comprehensive, the co-occurrence method alone can exclude some related words to improve the correct rate of attribute normalization, but cannot completely exclude related words. Therefore, it is necessary to form a rule set through the rules of multiple misclassified attributes and obtain the interpretation results through rule set filtering.

We analyzed the characteristics of the attributes that are easily misclassified in the results of co-occurrence analysis and analyzed the reasons for the following four points:

For the attribute names of language classes, their relevance is high but the inclusion of a single entity for language classes is poor, for example, “Arabic” and “Portuguese”.

Like the language class, for the attribute names in the orientation class, their correlation is high, but the inclusion of a single entity is also poor, for example, “east–west long” and “north–south long”.

For words containing the same word class, some attribute names contain the same words and they belong to the same superordinate word. Their semantic similarity is also high, such as “resident population” and “household population”.

Some words are ambiguous, such as “main peak”, which can be used to describe the name of the main peak or the height of the main peak.

To solve the above problems, we set three different normalizations to build a rule set (shown in Table 1). For the language and orientation categories, they will be classified as near-synonyms. For the same part of the class, it will be removed from the same part, and then the similarity of the remaining part will be calculated. If the remaining part of the similarity is also high, then they are judged as synonyms, and the results of each rule are tested to ensure that each is true and valid.

After the above series of processes, we obtained the synonym phrase. Finally, we have to choose one of these attribute names as the criterion for the attribute name of the class. We use frequency of occurrence as an indicator for attribute selection, using the attribute that occurs most frequently as its attribute name.

4. Experiment and Discussion

For the above principles and methods, we design the experiments and discuss the experimental results. Firstly, Section 4.1 introduces the dataset; Section 4.2 introduces the experimental conditions; and Section 4.3 presents the experimental results and analyzes them.

4.1. Introduction to DataSets

This study is based on data from Baidu Encyclopedia. The classified data of “mountain, water, forest, field, lake, and grass” is used as the original dataset to be normalized, in which there are 63,386 ternary data on mountains, 45,630 ternary data on water, 3693 ternary data on forests, 6150 ternary data on fields, 14,853 ternary data on lakes and 3971 ternary data on grass.

For the word vector training dataset, we have two datasets to be selected: Encyclopedia triples dataset and Encyclopedia data profile triples dataset. The Encyclopedia triples dataset is a large knowledge graph composed of metadata, integrating a large number of resources on the web, currently containing 25,455,709 triples and more than 2 million entities. The “Introduction” dataset is a collection of triples constructed from all triples resources through information queries and other techniques. This dataset contains 28,196 triples and 27,686 entities. In order to compare the training effect of a small amount of data with good contextual relevance and a large amount of data with poor relevance, the model was validated by using the dataset with the whole triples data and the dataset with the attribute name “Introduction” in the encyclopedia data. The experiments show that a large quantity of ternary data could improve the accuracy of word vector training compared with a small amount of data, with high contextual relevance. The specific comparison is in Table 2.

4.2. Experimental Condition

4.2.1. Experimental Parameters

(1): Word2vec parameter setting

In this study, we mainly use the CBOW model in word2vec to train word vectors, which has many parameters to be configured. First, there is the vector dimension, and we use 100-dimensional data and 200-dimensional data for testing. The obtained word vectors are 1,364,380,928 bytes and 2,728,761,728 bytes, respectively. Using the obtained word vectors, we calculate the similarity of the “mountain” dataset and discriminate the data with similarity greater than 0.8. The error rate is 17.1% for 100-dimensional data and 4.6% for 200-dimensional data. Although the 200-dimensional word vector generation time is longer, and the data are larger, it is better than the 100-dimensional data. We use 200 as the word vector dimension. Moreover, a word frequency less than the number of the Min Count is discarded. The Min Count value is 1 to include all attribute names.

(2): Similarity threshold parameter setting

In a similarity calculation, word pairs below a certain value have weak relevance and, thus, no value in the research. Therefore, a threshold value needs to be set. The word pairs that are greater than the threshold value are identified as related words. Therefore, it is necessary to find a value such that the proportion of near-synonym words in word pairs larger than this value is larger, and the proportion of near-synonym words in word pairs smaller than this value is smaller. Therefore, we change the threshold value θ and calculate the proportion of near-sense words when it is greater than 0.7, 0.75, 0.8, 0.85 and 0.9, and the proportion of near-sense words when it is less than 0.7, 0.75, 0.8, 0.85 and 0.9, by manual reading. The trends in the accuracy rate (P) and the completeness rate (R) on threshold values from 0.7–0.9 are shown in Figure 6.

When the threshold value is 0.75, P increases compared with 0.7, although sum and R slightly decrease (Figure 6). We use the threshold value of 0.75. The optimal threshold value is θ = 0.75, which is the value that makes both the accuracy (P) and the completeness (R) larger.

4.2.2. Experimental Evaluation Index

The effectiveness of synonym identification is measured by the Precision, Recall and F1 values. TPs (True Positives) indicate that the synonym is correctly predicted, while FPs (False Positives) indicate that the near-synonym is misjudged as a synonym. FNs (False Negatives) indicate that the synonym is judged to be a near-synonym. TNs (True Negatives) indicate that the near-synonym is correctly predicted. Then P, R and F1 are given by:

P = \frac{T P}{T P + F P}

(9)

R = \frac{T P}{T P + F N}

(10)

F 1 = \frac{2 \times P \times R}{P + R} \times 100 %

(11)

The larger the F1 the greater the corresponding precision and recall, and the better the results.

4.3. Experimental Results and Analysis

4.3.1. Clustering Granularity Experiments

To address the problem of clustering granularity, we propose an optimal granularity attribute clustering algorithm based on a marker target detection algorithm. For the experiments on the “mountain, water, forest, field, lake, grass” data, 100 samples of “mountain, water, forest, field, lake, grass” are selected, of which 50 are positive samples and 50 are negative samples. The number of target samples and the accuracy obtained by each parameter are shown in Table 3. In order to prove that the algorithm is effective for discriminating the granularity of clusters, the error rate of each class under each parameter is obtained by manual reading, and the results are shown in Table 3.

From Table 3, for each category of “mountain, water, forest, field, lake, grass”, the parameters with higher accuracy based on the marker target monitoring algorithm correspond to the parameters with the lowest error rate obtained by manual interpretation. For the geographic data of “mountain, water, forest, field, lake, grass”, each category has its own unique attribute name. When clustering, it is difficult to distinguish these unique attribute names because they appeared more concentrated in the text expression, so the clustering effect of these unique attribute names was poor. Therefore, when selecting the target sample, more representative data are selected as the target sample, which is conducive to the selection of the optimal granularity of clustering. From the results of parameter selection, the results of both methods are consistent, so the method can be effectively used to discriminate the clustering granularity. In addition, the method is more universal than the traditional empirical threshold method, which is able to select the most appropriate granularity of clustering based on different data.

At this granularity, we manually count classification results of geographic attributes and obtain the following results in Table 4.

Overall, spatial relations have higher precision but lower recall than data attributes. This is due to the fact that spatial relations are mostly used for the description of two and more entities, while data attributes are mostly used for the description of a particular entity itself. Thus, spatial relations are less binding than data attributes and are harder to distinguish from other categories. For example, “surrounding attractions” are usually classified as “attractions” class and are not recognized as a topological relation. This is also the difference between spatial relations and common attributes.

For spatial relations, we can find that the precision and recall of topological relations are lower than the other two types of spatial relations. The same reason as above is that topological relations are the weakest binding among spatial relations, and in addition, there are more types of topological relations with diverse and complex semantic descriptions.

4.3.2. Experimental Results of Exact Identification of Synonymous Attributes

To verify the effectiveness and superiority of this experimental method, we compared the results obtained from this experiment with the results obtained by using semantic similarity alone, the results after co-occurrence finding and the results after rule-based modification. We conduct experiments using each category of “landscape, forest, field, lake and grass”. First, we use the trained word vector to obtain the similarity between word pairs and filter out the pairs with a threshold greater than 0.7. Then, we manually find out the wrong word pairs and missing word pairs, and finally get their P, R, and F1 values. The results are shown in Table 5 and the average indices of the three methods are shown in Table 6. The resulting graph is shown in Figure 7.

The similarity results in Table 5 and Table 6 are the results obtained by the similarity method only when the threshold value is 0.75, which is associated with a low detection rate. To improve the accuracy, the co-occurrence method is proposed in this paper to exclude the word pairs with high similarity but different meanings. The accuracy, precision and recall rates of synonymy and related relation recognition of “mountain, water, forest, field, lake and grass” after the co-occurrence method are shown in Table 5. It can be assumed that the accuracy rate is significantly higher and the recall rate is slightly lower. The results are analyzed and modified to further improve the accuracy of the synonymy and correlation recognition of “mountain, water, forest, field, lake, grass”. The analysis reveals that the factors affecting the accuracy rate are due to the fact that the types of attributes contained in an entity are not comprehensive enough, and the semantic similarity of related words is high, so there are many words with similar meanings in the synonyms but could not be attributed to one. The recall rate is high and the main reason for the error is the ambiguity of attribute names. The results indicate that the geographic data of “mountain, water, forest, field, lake, grass” are unique.

The location attributes had a strong correlation and the similarity could not distinguish them well, so the names of these attributes are modified by rules. The accuracy, precision and recall of the results based on the modified rules are shown in Table 5. It can be assumed that the accuracy rate is significantly higher, so the method is effective for the attribute normalization.

We have plotted the results of the three methods in a graph and can observe that their overall trends are similar. Accuracy gradually increases and recall gradually becomes smaller in small increments, while there is a clear improvement in F1. Thus, the method in this paper is able to improve the accuracy of attribute alignment and has strong generalizability.

5. Conclusions

In recent years, knowledge graphs have received increasing attention. Related work includes geospatial data sharing and inference analysis. In this paper, starting from the attribute characteristics of geographic data and Baidu Encyclopedia, we proposed the encyclopedia attribute normalization process, which improves the previous semantic-based attribute normalization method. First, the semantic similarity of attribute names is used as the basis for clustering attribute names to obtain related words, and the optimal granularity algorithm for attribute clustering is introduced in this process. Second, based on the optimal clustering results, co-occurrence analysis is performed using the characteristics of the knowledge graph structure to exclude the near-sense words in the related words. Then, rules are constructed based on the characteristics of geographic data and the results are further optimized by modifying the rules. Finally, the experimental dataset of “mountain, water, forest, field, lake and grass” is used to analyze the results. The experiment indicates that the method could effectively improve the accuracy, precision and recall rate of synonymous attribute identification and achieve the normalization of encyclopedic geographic attributes. Thus, the process improves the normality of attribute names and has important implications for knowledge graph application areas, such as intelligent search. The method not only identifies synonymous attributes but also identifies near-synonymous attributes, meaning it can make the connection between statements that do not use exactly the same attributes. In this paper, the normalization study was conducted for all geographic attribute data of the encyclopedia, without fine-grained categorization of its spatial characteristics. Fine-grained alignment for geospatial attributes will be the direction of later research.

Author Contributions

Conceptualization, Chuan Yin; methodology, Chuan Yin and Binyu Zhang; writing—original draft preparation, Chuan Yin; writing—review and editing Binyu Zhang; supervision, Mingyi Du; project administration, Mingyi Du and Wanzeng Liu; English editing, Nana Luo; investigation, Xi Zhai; data curation, Tu Ba. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation (NSFC) of China (Key Project #41930650); Beijing Key Laboratory of Urban Spatial Information Engineering (No. 2020202); and Scientific Research Projects of Beijing Municipal Education Commission—General Projects of Science and Technology Program (Surface Projects) (KM202110016003).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the website http://www.openkg.cn (accessed on 14 December 2018) for providing the free data.

Conflicts of Interest

The authors declare no conflict of interest.

References

Deren, L.I. From Geomatics to Geospatial Intelligent Service Science. Acta Geod. Cartogr. Sin. 2017, 46, 1207–1212. [Google Scholar] [CrossRef]
Rowley, J. The Wisdom Hierarchy: Representations of the DIKW Hierarchy. J. Inf. Sci. 2007, 33, 163–180. [Google Scholar] [CrossRef] [Green Version]
Golledge, R.G. The Nature of Geographic Thought. Ann. Assoc. Am. Geogr. 2002, 92, 1–14. [Google Scholar] [CrossRef]
Stoltman, J.; Lidstone, J.; Kidman, G. The 2016 International Charter on Geographical Education. Int. Res. Geogr. Environ. Educ. 2017, 26, 1–2. [Google Scholar] [CrossRef] [Green Version]
Dong, X.; Gabrilovich, E.; Heitz, G.; Horn, W.; Lao, N.; Murphy, K.; Strohmann, T.; Sun, S.; Zhang, W. Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 601–610. [Google Scholar] [CrossRef]
Ji, S.; Pan, S.; Cambria, E.; Marttinen, P.; Philip, S.Y. A Survey on Knowledge Graphs: Representation, Acquisition and Applications. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 494–514. [Google Scholar] [CrossRef]
Zhang, N.; Deng, S.; Chen, H.; Chen, X.; Chen, J.; Li, X.; Zhang, Y. Structured Knowledge Base as Prior Knowledge to Improve Urban Data Analysis. ISPRS Int. J. Geo-Inf. 2018, 7, 264. [Google Scholar] [CrossRef] [Green Version]
Zhang, Y.; Zhu, J.; Zhu, Q.; Xie, Y.; Li, W.; Fu, L.; Zhang, J.; Tan, J. The Construction of Personalized Virtual Landslide Disaster Environments Based on Knowledge Graphs and Deep Neural Networks. Int. J. Digit. Earth 2020, 13, 1637–1655. [Google Scholar] [CrossRef]
Sun, K.; Hu, Y.; Song, J.; Zhu, Y. Aligning Geographic Entities from Historical Maps for Building Knowledge Graphs. Int. J. Geogr. Inf. Sci. 2021, 35, 2078–2107. [Google Scholar] [CrossRef]
Shen, Y.; Chen, Z.; Cheng, G.; Qu, Y. CKGG: A Chinese Knowledge Graph for High-School Geography Education and Beyond. In Proceedings of the International Semantic Web Conference, TBA, Virtual event, 24–28 October 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 429–445. [Google Scholar]
Auer, S.; Lehmann, J.; Hellmann, S. LinkedGeoData: Adding a Spatial Dimension to the Web of Data. In Proceedings of the 8th International Semantic Web Conference (ISWC ‘09), the Westfields Conference Center, Washington, DC, USA, 25–29 October 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 731–746. [Google Scholar] [CrossRef] [Green Version]
Maltese, V.; Farazi, F. A Semantic Schema for GeoNames; Università Di Trento: Trento, Italy, 2013. [Google Scholar]
Ballatore, A.; Wilson, D.C.; Bertolotto, M. A survey of volunteered open geo-knowledge bases in the semantic web. In Quality Issues in the Management of Web Information; Springer: Berlin/Heidelberg, Germany, 2013; pp. 93–120. [Google Scholar]
Suchanek, F.M.; Kasneci, G.; Weikum, G. Yago: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, Banff, AB, Canada, 8–12 May 2007. [Google Scholar] [CrossRef] [Green Version]
Deng, S. CrowdGeoKG: Crowdsourced Geo-Knowledge Graph. In Proceedings of the China Conference on Knowledge Graph and Semantic Computing, Chengdu, China, 26–29 August 2017. [Google Scholar] [CrossRef]
Speer, R.; Havasi, C. ConceptNet 5: A Large Semantic Network for Relational Knowledge. In The People’s Web Meets NLP; Springer: Berlin/Heidelberg, Germany, 2013; pp. 161–176. [Google Scholar] [CrossRef]
Chen, j.; Liu, W.; Wu, H. Basic Issues and Research Agenda of Geospatial Knowledge Service. Geomatics and Information Science of Wuhan University. Geomat. Inf. Sci. Wuhan Univ. 2019, 44, 38–47. [Google Scholar]
Du, C.; Si, W.; Xu, J. Querying and Reasoning of Spatial Relations Based on Geographic Semantics. J. Geo-Inf. Sci. 2010, 12, 48–55. [Google Scholar] [CrossRef]
Yang, C.; Wu, H.; Huang, Q.; Li, Z.; Jing, L. Using spatial principles to optimize distributed computing for enabling the physical science discoveries. Proc. Natl. Acad. Sci. USA 2011, 108, 5498–5503. [Google Scholar] [CrossRef] [Green Version]
Chen, X.; Jia, S.; Xiang, Y. A review: Knowledge reasoning over knowledge graph. Expert Syst. Appl. 2020, 141, 112948. [Google Scholar] [CrossRef]
Haihong, E.; Cheng, R.; Song, M.; Zhu, P.; Wang, Z. A Joint Embedding Method of Relations and Attributes for Entity Alignment. Int. J. Mach. Learn. Comput. 2020, 10, 605–611. [Google Scholar]
Gunaratna, K.; Thirunarayan, K.; Jain, P.; Sheth, A.; Wijeratne, S. A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data. In Proceedings of the 9th International Conference on Semantic Systems, Graz, Austria, 4–6 September; 2013; pp. 33–40. [Google Scholar] [CrossRef] [Green Version]
Zhang, Z.; Gentile, A.L.; Blomqvist, E.; Augenstein, I.; Ciravegna, F. An Unsupervised Data-Driven Method to Discover Equivalent Relations in Large Linked Datasets. Semant. Web 2017, 8, 197–223. [Google Scholar] [CrossRef] [Green Version]
Bauer, F.; Kaltenböck, M. Linked Open Data: The Essentials; Mono/Monochrom: Vienna, Austria, 2011; Volume 710. [Google Scholar]
Ristad, E.S.; Yianilos, P.N. Learning string-edit distance. IEEE Trans. Pattern Anal. Mach.-Intell. 1998, 20, 522–532. [Google Scholar] [CrossRef] [Green Version]
Tsuruoka, Y.; Mcnaught, J.; Tsujii, J.; Ananiadou, S. Learning String Similarity Measures for Gene/Protein Name Dictionary Look-up Using Logistic Regression. Bioinformatics 2007, 23, 2768–2774. [Google Scholar] [CrossRef] [Green Version]
Liu, Y.; Chen, S.-H.; Chen, J.-G.G. Property Alignment of Linked Data Based on Similarity between Functions. Int. J. Database Theory Appl. 2015, 8, 191–206. [Google Scholar] [CrossRef]
Huang, T.; Zhang, W.; Liang, X.; Fu, K. Data-driven method for fine-grained property alignment between Chinese open datasets. J. Southeast Univ. (Nat. Sci. Ed.) 2017, 47, 660–666. [Google Scholar] [CrossRef]
Šmíd, J.; Neruda, R. Comparing Datasets by Attribute Alignment. In Proceedings of the 2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Orlando, FL, USA, 9–12 December 2014; pp. 56–62. [Google Scholar] [CrossRef]
Hinton, G.E. Learning distributed representations of concepts. In Proceedings of the Eighth Conference of the Cognitive Science Society, Amherst, MA, USA, 15–17 August 1986. [Google Scholar]
Newman, M.E.; Girvan, M. Finding and evaluating community structure in networks. Phys. Rev. E 2004, 69, 026113. [Google Scholar] [CrossRef] [Green Version]
Blondel, V.D.; Guillaume, J.; Lambiotte, R.; Lefebvre, E. Fast Unfolding of Communities in Large Networks. J. Stat. Mech. Theory Exp. 2008, 2008, P10008. [Google Scholar] [CrossRef] [Green Version]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Chen, Z. An Approach to Measuring Semantic Relatedness of Geographic Terminologies Using a Thesaurus and Lexical Database Sources. ISPRS Int. J. Geo-Inf. 2018, 7, 98. [Google Scholar] [CrossRef] [Green Version]
Zhang, S.; Hu, Y.; Bian, G. Research on String Similarity Algorithm Based on Levenshtein Distance. In Proceedings of the 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 25–26 March 2017; IEEE: New York, NY, USA, 2017; pp. 2247–2251. [Google Scholar]
Ren, X.; Han, J. Automatic Synonym Discovery with Knowledge Bases. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017. [Google Scholar] [CrossRef] [Green Version]
Le, Q.; Mikolov, T.; Com, T.G. Distributed Representations of Sentences and Documents. In Proceedings of the International Conference on Machine Learning, Detroit, MI, USA, 3–6 December 2014; Volume 32, pp. 1188–1196. [Google Scholar]

Figure 1. Flow chart of attribute normalization technique.

Figure 2. Property Alignment Schematic.

Figure 3. Classification System for Geographic Attribute Data.

Figure 4. Clustering granularity diagram. Each color represents a class and each circle represents a property node. (a) Too-fine-grained classification, members that should be in A are divided into other classes; (b) standard classification, suitable granularity of classification; (c) too-coarse classification, members that should not be in A are divided into A.

Figure 5. Flow chart of attribute normalization technique.

Figure 6. 0.7–0.9 similarity threshold graph.

Figure 7. Comparison chart of the three methods and average results.

Table 1. Rule set.

Rule 1	$\begin{array}{l} i f \forall p_{i} p_{j} \subset “ \dots Language ” o r “ East, West, North, South ” \\ t h e n p_{i} p_{j} \Rightarrow p \end{array}$
Rule 2	$\begin{array}{l} i f \exists p_{i} \subseteq p_{j} or p_{j} \subseteq p_{i} and s i m (p_{i} - p_{j}) > 0.75 (p_{i} \geq p_{j}) \\ t h e n p_{i} p_{j} \Rightarrow p \end{array}$
Rule 3	$\begin{array}{l} i f t y p e (p_{i}) \neq t y p e (p_{j}) \\ t h e n p_{i} \neq p_{j} \end{array}$

Table 2. Comparison of word vector training corpus.

Data Source	Number of Triples	Features	Vector Dimension	Results
Encyclopedia data all triples	25,455,709	Larger corpus	100	The similarity is greater than 0.8 for 2233 articles, with a high accuracy rate
Encyclopedia Data Introduction Triples	28,196	strong contextual relevance	100	Similarity greater than 0.8 is 253,275 articles, the effect is poor

Table 3. Comparison of clustering accuracy based on labeled target detection algorithms.

Category	Comparison of Attribute Clustering Accuracy Methods						Parameter Selection Results
Category	Clustering Parameters	1.0	0.8	0.6	0.4	0.2	Parameter Selection Results
Mountain	Precision	0.862	0.871	0.902	0.916 *	0.895	0.4
Mountain	Error rate	0.060	0.056	0.041	0.036 *	0.047	0.4
Water	Precision	0.912	0.912	0.942	0.965 *	0.942	0.4
Water	Error rate	0.056	0.056	0.043	0.026 *	0.043	0.4
Forest	Precision	0.833	0.833	0.962 *	0.926	0.882	0.6
Forest	Error rate	0.126	0.126	0.025 *	0.063	0.088	0.6
Field	Precision	0.897	0.897	0.966 *	0.951	0.930	0.6
Field	Error rate	0.081	0.081	0.011 *	0.023	0.069	0.6
Lake	Precision	0.908	0.949 *	0.937	0.937	0.888	0.8
Lake	Error rate	0.066	0.042 *	0.054	0.054	0.084	0.8
Grass	Precision	0.934	0.934	0.934	0.962 *	0.934	0.4
Grass	Error rate	0.054	0.054	0.054	0.027 *	0.054	0.4

Note: The accuracy indicators in the table indicate the clustering accuracy of the target detection algorithm, and the error rate refers to the clustering error rate of manual discrimination. * denotes the highest precision and the lowest error rate data corresponding to all parameters.

Table 4. Geographic Attributes Classification Results.

Attribute		P	R
Spatial Relations	Cardinal Direction Relation	100%	96.7%
	Topological Relation	92.4%	89.4%
	Distance Relation	100%	95.2%
Data Attributes	Metrology	97.4%	99.6%
	Coordinate	95.7%	100%
	Time	100%	100%

Table 5. Comparison of the results of the three methods.

Category	Method	P	R	F1
Mountain	Similarity results	34.0%	98.3%	50.5%
	Co-occurrence analysis results	49.1%	98.2%	65.4%
	Rule-based modification results	91.1%	96.3%	93.6%
Water	Similarity results	29.0%	98.2%	44.8%
	Co-occurrence analysis results	57.4%	97.3%	72.2%
	Rule-based modification results	88.5%	96.7%	92.4%
Forest	Similarity results	31.0%	99.3%	47.2%
	Co-occurrence analysis results	61.2%	99.2%	75.6%
	Rule-based modification results	92.7%	98.1%	95.3%
Field	Similarity results	32.0%	98.5%	48.3%
	Co-occurrence analysis results	63.0%	98.3%	76.7%
	Rule-based modification results	93.3%	94.9%	94.1%
Lake	Similarity results	28.0%	98.7%	43.6%
	Co-occurrence analysis results	59.4%	98.6%	74.1%
	Rule-based modification results	90.5%	95.7%	93.0%
Grass	Similarity results	33.0%	99.2%	49.5%
	Co-occurrence analysis results	62.8%	99.0%	76.8%
	Rule-based modification results	90.1%	93.9%	92.0%

Table 6. Comparison of the average indicators of the three methods.

Method	P	R	F1
Similarity results	31.2%	98.7%	47.3%
Co-occurrence analysis results	58.8%	98.4%	73.5%
Rule-based modification results	91.0%	95.9%	93.4%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yin, C.; Zhang, B.; Liu, W.; Du, M.; Luo, N.; Zhai, X.; Ba, T. Geographic Knowledge Graph Attribute Normalization: Improving the Accuracy by Fusing Optimal Granularity Clustering and Co-Occurrence Analysis. ISPRS Int. J. Geo-Inf. 2022, 11, 360. https://0-doi-org.brum.beds.ac.uk/10.3390/ijgi11070360

AMA Style

Yin C, Zhang B, Liu W, Du M, Luo N, Zhai X, Ba T. Geographic Knowledge Graph Attribute Normalization: Improving the Accuracy by Fusing Optimal Granularity Clustering and Co-Occurrence Analysis. ISPRS International Journal of Geo-Information. 2022; 11(7):360. https://0-doi-org.brum.beds.ac.uk/10.3390/ijgi11070360

Chicago/Turabian Style

Yin, Chuan, Binyu Zhang, Wanzeng Liu, Mingyi Du, Nana Luo, Xi Zhai, and Tu Ba. 2022. "Geographic Knowledge Graph Attribute Normalization: Improving the Accuracy by Fusing Optimal Granularity Clustering and Co-Occurrence Analysis" ISPRS International Journal of Geo-Information 11, no. 7: 360. https://0-doi-org.brum.beds.ac.uk/10.3390/ijgi11070360

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Geographic Knowledge Graph Attribute Normalization: Improving the Accuracy by Fusing Optimal Granularity Clustering and Co-Occurrence Analysis

Abstract

1. Introduction

2. Related Work

3. Methods and Models

3.1. Overview

3.2. Method Modeling: Optimal Granularity Attribute Clustering Based on Labeled Target Detection Algorithm

3.3. Method Modeling: Accurate Identification of Synonymous Attributes Based on Co-Occurrence Analysis and Rule Reasoning

3.3.1. Outcome Scoring Strategy Based on Co-Occurrence Analysis

3.3.2. Result Optimization Based on Rule Reasoning

4. Experiment and Discussion

4.1. Introduction to DataSets

4.2. Experimental Condition

4.2.1. Experimental Parameters

4.2.2. Experimental Evaluation Index

4.3. Experimental Results and Analysis

4.3.1. Clustering Granularity Experiments

4.3.2. Experimental Results of Exact Identification of Synonymous Attributes

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI