Using Machine Learning Language Models to Generate Innovation Knowledge Graphs for Patent Mining

Trappey, Amy J. C.; Liang, Chih-Ping; Lin, Hsin-Jung

doi:10.3390/app12199818

Open AccessFeature PaperArticle

Using Machine Learning Language Models to Generate Innovation Knowledge Graphs for Patent Mining

by

Amy J. C. Trappey

^*

,

Chih-Ping Liang

and

Hsin-Jung Lin

Department of Industrial Engineering and Engineering Management, National Tsing Hua University, Hsinchu 300, Taiwan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(19), 9818; https://0-doi-org.brum.beds.ac.uk/10.3390/app12199818

Submission received: 13 September 2022 / Revised: 24 September 2022 / Accepted: 25 September 2022 / Published: 29 September 2022

(This article belongs to the Special Issue Innovations in Intelligent Machinery and Industry 4.0)

Download

Browse Figures

Versions Notes

Abstract

:

To explore and understand the state-of-the-art innovations in any given domain, researchers often need to study many domain patents and synthesize their knowledge content. This study provides a smart patent knowledge graph generation system, adopting a machine learning (ML) natural language modeling approach, to help researchers grasp the patent knowledge by generating deep knowledge graphs. This research focuses on converting chemical utility patents, consisting of chemistries and chemical processes, into summarized knowledge graphs. The research methods are in two parts, i.e., the visualization of the chemical processes in the chemical patents’ most relevant paragraphs and a knowledge graph of any domain-specific collection of patent texts. The ML language modeling algorithms, including ALBERT for text vectorization, Sentence-BERT for sentence classification, and KeyBERT for keyword extraction, are adopted. These models are trained and tested in the case study using 879 chemical patents in the carbon capture domain. The results demonstrate that the average retention rate of the summary graphs for five clustered patent texts exceeds 80%. The proposed approach is novel and proven to be reliable in graphical deep knowledge representation.

Keywords:

knowledge graph visualization; natural language processing (NLP); patent mining; machine learning language modeling

1. Introduction

The scholars of technology and IP management have observed that the essential knowledge of utility patents consist of their novel and advanced processes, applications, and material formulations [1,2]. Therefore, a system that can support a clear knowledge graph is critical for researchers to effectively and efficiently grasp the state-of-the-art innovations in any given technical domain. This research proposes a ML language modeling approach to generate deep knowledge graphs for chemical utility patents. We realize that, for any patent consisting of chemical-related process flows, it is laborious to read the whole article and sort out the processes [3]. Thus, picking out the relevant paragraphs can speed up the research time of the patent reviews. The purpose of this research is to visualize the entire patent by a knowledge graph. First, the ALBERT model is trained to identify the chemical experiment-relevant paragraphs and then, node-arc network graphs representing the content of the materials and reaction processes are generated. Further, the research also refines the traditional automatic summarizing method by generating a summary graph from the automatically summarized text. In this study, the Sentence-BERT [4] model is trained and tested for vectorizing the patent sentences for summarization. The essential sentence vectors are selected using the LexRank algorithm [5]. Finally, the keywords are extracted from the essential sentences (using KeyBERT) as the data nodes in the summary graph [6]. The relationships between the collections of keywords form the links between the “keyword” nodes to produce the entire summary graph. Using the case study of a carbon capture patent-set (with 879 patents) in five sub-clusters, the summary graphs provide the researchers with effective, clear, and reliable overviews of the domain-specific patent knowledge.

This paper is organized in the following sections. In Section 2, a literature review of the related background and methodologies, e.g., chemical-related text mining, main patent-document mining, and summarization methodologies, especially those using ML-based natural language models, is conducted from Section 2.1, Section 2.2, Section 2.3, Section 2.4 and Section 2.5. Section 3 depicts the entire process of the proposed methodology for readers to reapply in their domain patent knowledge graph generation. Section 4 describes the case study of generating the summary graphs for the carbon capture patent set, with 879 patents in five sub-clusters. The average retention rates are used to measure the effectiveness and reliability of the proposed system. The final concluding remarks, our contribution, and the future research directions are presented in Section 5.

2. Literature Review

This section of the literature review is divided and presented in several sub-sections. In Section 2.1, we explore the techniques related to chemical text mining. Section 2.2 discusses the key term extraction methods. Text vectorization is discussed in Section 2.3. In Section 2.4 and Section 2.5, the literature related to the methodologies of generating summary graphs automatically is discussed. Section 2.4 discusses how to rank the importance of texts (sentences) to obtain automatic summaries. Finally, in Section 2.5, how to graphically represent textual content is described.

2.1. Chemical Text Mining

The purpose of chemical text mining is to study tokenized chemical names and texts to be able to train chemically specific phrases [7,8]. Text mining in chemistry typically uses machine learning methods to extract raw information from the text [9]. Some related research has already developed chemical taggers for chemical text exploration [10], which analyze the text while classifying the chemical information in the text [11] and can be used to define the relationships with actions, such as heating and stirring. This paper examines the literature of the widely used the ChemDataExtractor [12] tool and uses it as the chemical text exploration tool used in this system [13] to analyze the English texts in scientific literature.

First, ChemDataExtractor uses chemical word tagging, chemical substance entity recognition combining conditional random fields and dictionaries, regular phrase analysis [14], and word clustering algorithms to improve the efficiency of unsupervised machine learning. Next, information is extracted from semi-structured datasets, and the potential interdependence of the data between the information extracted from different parts of the text is addressed by using file-level post-processing algorithms [12]. Procedurally, ChemDataExtractor uses tokenization, word clustering, lexical tagging, and chemical substance entity identification to extract the correct chemical information from the text. In the tokenization process, paragraphs of the text are split into sentences, and then sentences are split into individual words and punctuation marks, by using the appropriate natural language processing markers [15]. Word clustering uses the Brown word clustering algorithm to generate word-grouping patterns based on the context in which the words occur, and to improve the efficiency of the lexical tagging and named entity recognition [12]. Lexical tagging is used to mark the lexical nature of a word, based on the function of the word in the text. Chemical substance entity identification [16] uses a chemical noun identifier, based on Conditional Random Fields (CRFs), combined with a tag-based data dictionary identifier to improve the efficiency of the chemical named entity identification by removing the complex and trivial words [17]. Through these methods, the chemical information from documents can be extracted automatically and the time required to build large chemical databases for data analysis can be reduced.

2.2. Keyword Extraction

Keyword extraction aims to identify the important words in articles. There are three main approaches in keyword extraction, i.e., the unsupervised, supervised, and semi-supervised methods [18]. Unsupervised keyword extraction can be divided into simple statistics approaches, linguistics approaches, and machine learning approaches [19]. The simple statistical approach is based on the calculation of word frequencies and the calculation of linguistic data. By calculating the frequency of words and the order of their occurrence in the text, the simple statistical methods can obtain words with relatively high statistics as keywords, regardless of the language and the content of the text domain, and common methods include N-gram, TF-IDF, and co-word analysis [20]. The linguistic methods are used to obtain the relevant linguistic features from words, sentences, and documents, and to identify the important keywords by combining a syntactic and semantic analysis in the lexicography [19]. However, both the simple statistical and linguistic methods have some drawbacks. When analyzing domain-specific literature, the simple statistical method tends to overlook important keywords, due to the low frequency of their occurrence in the text. It requires a lot of manpower and time. Because of the above, the use of machine learning methods for keyword extraction is gradually becoming a trend in research.

Machine learning methods can be used to train models by marking keywords in the training set in a supervised or semi-supervised manner. Because of the need for tagging, machine learning methods are often used for domain-specific keyword extraction, including Naïve Bayes, and Support Vector Machine [18]. In the case of unsupervised machine learning methods for keyword extraction, the text is transformed into vectors [21] and then clustered. The word vector with the greatest weight in each cluster can be obtained as the word representing the cluster, according to the relevant clustering algorithm [22].

For patent databases to be helpful in decision making, it is important to provide accurate information, and the information must be presented in a comprehensible format. This must be done in the keyword extraction, pattern recognition and pattern analysis [23].

2.3. Text Vectorization

Since text is not a numerical quantity, it is necessary to quantify the text content to perform statistical calculations. Generally, a word counts of the text is done first, and the result is put into a high-dimensional vector space. In this study, we focus on the vectorization of words and sentences, and the technique used for the word vectorization is called word embedding, while the technique used for the sentence vectorization is called sentence embedding.

2.3.1. Word Embedding

Word embedding is a natural language processing technique where the words in the text are mapped into space and converted into real number vectors. The advantage of word embedding is that it can reduce the dimensionality of word vectors and increase the computational efficiency of the computer [24]. After converting words into vectors, the semantic information can also be processed simultaneously [25], and commonly used methods include Word2Vec [26], GloVe [27], and FastText [15] among others.

Word2Vec is a technique for representing words as vectors using a corpus of text as the input. The word vectors generated by Word2Vec can be used for natural language processing and machine learning applications by observing the semantic similarity in the vector space in terms of the proximity between words [28]. GloVe is based on an unsupervised algorithm to compute the word vectors, which is trained by building a co-occurrence matrix based on the statistics of the words in a large text corpus so that the resulting vectors can display a vector space with the linear relationships. GloVe does not rely on the local statistics of a single word but combines the statistics of the whole text [29]. These word vectors can generate higher probabilities for words that appear in the context of the sentence, lower probabilities for words that do not belong to their context, and random vectors for words that do not belong to the input model training [30]. FastText is a method for learning word features. The main purpose of word embedding is to consider the internal structure of words, rather than learning only the surface features of words. The learning process of FastText can be considered as operating on a three-layer neural network, with two layers of weights. The two outer layers of the three-layer neural network have one neuron for each word, while the middle layer contains as many neurons as there are dimensions in the embedding space. This approach can also be used to learn the subparts of vector words to ensure that similar words have similar vector representations, even when they occur in different contexts, to enhance the processing of the textual narratives with different tones [31].

2.3.2. Sentence Embedding

Sentence embedding is the representation of the semantics, and the sentence is represented as a single numerical vector, which is more useful for understanding the meaning of a sentence than word embedding. There are two types of sentence embeddings [32], namely smooth inverse frequency (SIF) and unsupervised smoothed inverse frequency (uSIF). The main logic of smooth inverse frequency [33] is to calculate the weighted average of the word vectors in a sentence, and remove the projection of the first principal component vector of the matrix formed by the vectors of all sentences to represent the whole sentence. In contrast, the unsupervised smoothed cepstral approach [34] constructs a model by setting the generation rate of words inversely proportional to the angular distance between the word and the sentence embedding. In the smoothed cepstral approach, the unsupervised smoothed cepstral approach uses the calculation of the angular distance in space, because the length of the word vector images the result of the embedding of the sentence.

2.4. Automatic Summarization

The purpose of automatic summarization is to produce a summary that represent the whole text. These summaries use fewer words than the original text to concisely convey the message of the entire text [35]. Automatic summaries can be divided into extractive summarization and abstractive summarization. Extractive Summarization is the combination of important sentences from a text into a summary, while abstractive summarization is the ability to restructure important text into a summary through a semantic analysis. Extractive summaries can be divided into the lexical chain, clustering, and pictorial methods [36].

A lexical chain is a sequence of words in a text, where each word is found to be related to other words to construct a lexical chain [37]. The use of the lexical chain method involves three steps: extracting sentences from the text, identifying the lexical chains and extracting the sentences with the strongest lexical chains, which are used to construct the final summary. During the process, the link strength of each word chain is calculated, with the length of the chain and the homogeneity index being the two parameters used to calculate the strength of the chain. The length of a word chain is the number of words contained in the chain. The strength of a word chain is the length multiplied by the homogeneity index [36]. Clustering clusters similar sentences, while sentences belonging to different clusters are less similar. Depending on the number of sentences in each cluster, a higher number means more information is contained, and the cluster is more important. Then, a summary of the important sentences in each cluster is selected, based on the location feature [38]. The graphical approach is based on graph theory and is used to show the structure of the text and the relevance of the sentences to each other. Each sentence is regarded as a node, and the relationship between the sentences is described by line segments [39]. The graphical method uses the cosine similarity between sentences to calculate a feature vector that is weighted to obtain the central node of the graph, i.e., the core sentence, which usually has a high similarity to most sentences, and is taken out as the sentence that forms the summary [5].

2.5. Network Graph Visualization

An ontology describes knowledge with formal terms that can be used to support human understanding of the domain and its structural elements, such as instances, attributes, and relations [40]. A graph is a collection of data nodes and line segments. A graph consists of several data nodes and lines that represent the relationships between the data nodes. By definition [41], data nodes are called nodes (also known as Vertex), which are used to represent the data nodes, and the set formed by all the nodes is defined as V(G). An edge, on the other hand, is a line segment between data nodes. A pair of nodes represent an edge, and the set formed by all edges is defined as E(G). Depending on whether the edges are directional or not, there are two types of graphs, the directed graph, which represents the sequential relationship between the data nodes, and the undirected graph, which represents the adjacency between the data nodes. The adjacency matrix is a two-dimensional matrix that represents the link weights between the data nodes. An adjacency matrix consists of an m times m matrix of the nodes m. The elements of the matrix, aij, are used to represent the relationship the weights between node i and node j. An adjacency list is a list of nodes that are used to represent the relationship weights between nodes i and j. An adjacency list is used to represent the composition of an undirected graph, where all the data nodes are listed in a one-dimensional array, and a linked list is used to represent the order of the data nodes linked to the elements of the array.

To increase the readability of graphs, a force-oriented graph layout algorithm is used to produce visually easier-to-read symmetric graphs and increase the visualization of the graphs [42]. Spring layout is one of the most common methods. There are three main properties of the spring layout. Firstly, the algorithm assumes that the edges of the drawing are elastic ideal springs with constant length and elasticity coefficients. Secondly, the nodes repel each other. Thirdly, the edges of the graph are not of an arbitrary length [43]. An organic layout is a variant of the force-oriented graph. This algorithm treats nodes as electrons with mutual repulsion to control the layout of the nodes. Considering that there may be multiple disconnected node clusters between the nodes, the algorithm minimizes the sum of the repulsive forces between the nodes to increase the visibility of each node cluster to avoid the edge restrictions that affect the distribution of the node clusters that are not the main cluster [44]. The diagram of these three types of layouts are shown in Figure 1.

3. Methodology

The main system architecture of this study can be divided into two parts. The first part is to generate chemical (experimental) process graphs from the paragraphs automatically identified using a supervised learning approach (detailly described in our previous research) [8]. The second part is the graphical representation of the patent summary. The system first obtains the HTML file of the patent from Google Patent, extracts the descriptive text of the patent from it, and pre-processes the extracted text. After that, the trained paragraph classification model (i.e., an ALBERT model) can identify the experimental paragraphs. The ALBERT model reduces the dimensionality of the embedding by matrix decomposition to achieve the effect of parameter reduction and to enhance the efficiency of the model training [45]. It then breaks up the sentences and words, treats each sentence as a step, and performs NER (named-entity recognition) [46] from the sentences to the classified nodes. By using ChemDataExtractor (http://chemdataextractor.org/) and part-of-speech (POS) tagging [47], the research classified the graph into three categories, i.e., substance, unit and action. Next, it defines the edge by rules: 1. Treat the sentence containing both the substance and the action as one step. 2. Units will be linked to each substance in written order. 3. Substances will be linked to the first action in the same step. 4. Actions will be linked in written order. 5. The last action in the step will be linked to the first substance in the next step [8]. Thus, Cytoscape (https://cytoscape.org/) can generate the process graph for further chemical process editing, if needed. The second part of the summary graph is to break up the sentences in the patent text after the patent document is processed, and embed the sentences using the Sentence-BERT model to obtain the sentence vector. Afterward, the LexRank algorithm can pick out the important sentences. Then, KeyBERT is used to select the important keywords from these important sentences as the graph nodes. The links between the nodes are obtained using OpenIE to build a co-occurrence matrix to define the relations (edges) between the nodes. At last, Cytoscape is also used to generate the summary graph. The core algorithmic sequences of the chemical (experimental) process graph generation (from the chemically relevant paragraphs) and the patent summary graph generation (from the collected text of the patents) are shown in the system architecture’s two parallel modules (in Figure 2).

3.1. Preprocess of Patent Text and Paragraphs

In this study, we mainly used Google Patent to obtain the patent text. First, the patent text was obtained from Google Patent web pages as a HTML file. In this study, we analyzed the description section of the patent text, instead of the abstract section, to analyze the patent content in more detail. Specifically, we extracted all the class attributes of the description section from the HTML file and removed the HTML syntax to obtain the description section of the patent text. Then, the text is broken into smaller units by breaking up the sentences and words, and the words are word-reduced so that the words with the same meaning can be counted as the same element by the computer. The stagnant words that are less relevant to the text are further removed by the system.

In this study, ChemDataExtractor was used for the sentence and word splitting [12]. After saving each paragraph of the text into an array, the specific desired paragraph was extracted for the sentence and word splitting. The purpose of the sentence splitting is to mark the end of a sentence with a punctuation mark at the end of the paragraph, and then retrieve the sentences one-by-one. In the case of broken sentences, the words are taken out as separate units, according to the spaces and punctuation marks in the split sentences. In the past, the system would treat non-text punctuation marks, such as commas, as separate words in the sentence to be split. However, in the case of the division of a sentence containing a chemical noun, the nomenclature of the IUPAC is often incorrectly split due to its characteristics. Therefore, to avoid this, ChemDataExtractor uses chemical substance entity identification to avoid this situation and allow the chemical nouns to be correctly asserted.

To prevent the computer from treating the same word as a different word in the statistical calculation due to different spellings of the word or the factors such as tense and singularity, we need to restore the word patterns by word lemmatization during the preprocessing. The first step is to convert the first letter of a word at the beginning of a sentence to lowercase, because it is capitalized in English; the second step is to obtain the main part of the word and remove the suffixes of the word. The above is the traditional word lemmatization method of text mining. This study makes some adjustments to the words belonging to chemical nouns. If there is no capital letter, then the same method is used as in the previous paragraph; if there is a capital letter, then one or more capital letters are judged to be present in the word, and the word is not restored. The reason for doing so is that most of the elements in the periodic table will be an uppercase letter (such as H for hydrogen) or an uppercase letter with a lowercase letter exist (such as Ag for silver). Therefore, when conducting the pre-processing, if the word contains only one uppercase letter, the system will read whether it exists in the periodic table; if it does not exist in the periodic table before the word restoration, or the word contains more than two capital letters, the system will presume the word is a chemical material term (e.g., NaCl) or a company name (e.g., BASF), and these words are not subject to word lemmatization.

Lastly is the stop words removal, the main purpose of which is to remove words that are less relevant to the text to reduce the impact of these irrelevant words on the statistical results. In English, stop words usually refer to words that occur frequently but have little meaning, such as “in”, “on”, “a”, “the”, etc. The high frequency of stagnant words often leads to the misclassification of stagnant words as important words in the calculation of the keywords.

3.2. Sentence-BERT for Sentence Embedding

Sentence-BERT [4] is used to generate structured sentence embeddings using a twin-network structure. Firstly, the Sentence-BERT model encodes two different sentences into two different BERT models, and the parameters of these two BERT models are shared between each other. Then, the two BERT models are pooled separately to obtain two sentence vectors of the same size.

To emphasize the difference of content in the similarity syntax, 1001 pairs of similar sentences and 1001 pairs of non-similar sentences, totaling 2002 data, were used in this study for the Sentence-BERT sentence embedding training. In this study, 1009 sentences were manually selected from polymer patents and divided into eight categories, including 121 sentences for the first category of invention background, 152 sentences for the second category of experimental procedures, 109 sentences for the third category of scientific principles, 119 sentences for the fourth category of invention applications, 166 sentences for the fifth category of material properties, 118 sentences for the sixth category of testing, 82 sentences for the seventh category of selected materials, and 142 sentences for the eighth category of patent citations. There are 142 sentences in category eight for patent citations. The similar sentences are pairs of sentences belonging to the same cluster, depending on the sentence number and the next numbered sentence, and the non-similar sentences are pairs of two sentences selected from different clusters. To balance the dataset, after selecting the non-similar sentence pairs, the sentence pairs with the same number of similar sentence pairs are randomly selected as the final input training data.

Unlike the traditional method that used the average of the word embedding model to obtain the sentence vector, Sentence-BERT can reduce the factor of the sentence length, which affects the embedding effect to obtain a more accurate sentence embedding effect. Moreover, the sentence vectors embedded by Sentence-BERT can be used to calculate the cosine similarity better.

3.3. Graphical Patent Summarization

After obtaining the sentences vectors, the summary graph generation is performed. Section 3.3.2 will use KeyBERT to select the important keywords as graph nodes from these important sentences, and the association between the nodes is obtained by OpenIE in Section 3.3.3. Finally, Section 3.3.4 describes the output of the graphical summary.

3.3.1. Text Summarization

After obtaining the vector of sentences, the quantified sentences can be compared for similarity to obtain the importance of the sentences, and the important sentences can be extracted from them. This study uses the LexRank algorithm [48] to calculate the importance of the sentences, which is a graph-based summarization method that uses the concept of a community network to obtain the importance of the sentences and the relationship between the sentences in an article, with the more important sentences having a higher centrality in the graph. The LexRank algorithm determines the importance of a single sentence, based on the number of other sentences linked to it. The greater the number of sentences linked to a single sentence, the greater the weight of the sentence. The LexRank algorithm is expressed in Equation (1).

p (u) = \frac{d}{N} + (1 - d) \sum_{v \in a d j [u]} \frac{C o s i n e s i m i l a r i t y (u, v)}{\sum_{z \in a d j [v]} C o s i n e s i m i l a r i t y (z, v)} p (v)

(1)

where N represents the total number of sentences, d is the damping factor, adj[x] represents the set of nodes adjacent to node x, and p(x) is the centrality of node x.

The LexRank algorithm uses the inverse document frequency (idf) modified cosine value to calculate the sentence-to-sentence similarity. However, this approach is still susceptible to the effect of sentence embedding due to the sentence length, which affects the final similarity calculation. Thus, Sentence-BERT is used to obtain the sentence vectors for the calculation of the residual similarity, and these vectors are used for the calculation of the residual similarity to increase the representativeness of the embedded vectors to the sentences, which are applied to the subsequent LexRank calculation, i.e., Equation (2).

C o s i n e s i m i l a r i t y = \frac{A \cdot B}{| | A | | | | B | |} = \frac{\sum_{i = 1}^{n} A_{i} \times B_{i}}{\sqrt{\sum_{i = 1}^{n} {(A_{i})}^{2} \times \sum_{i = 1}^{n} {(B_{i})}^{2}}}

(2)

where A_i denotes the component of vector A, and B_i denotes the component of vector B.

3.3.2. Define Nodes

After obtaining the summary text, the text vector content is then graphically visualized. The first step is to obtain the nodal part of the graph, i.e., to find the keywords that represent the whole text. KeyBERT [6] is used as the keyword extraction method, which uses BERT embeddings to construct the most relevant keywords to the text. KeyBERT uses a self-monitoring keyword extraction method in machine learning to extract the contextual features from the text for training with a BERT model. First, the text is applied to the BERT model and the document is vectorized. Then, the words in the document are vectorized and the words are embedded in the BERT model. Third, the result of each word embedding is compared with the result of the document vector, in terms of the cosine similarity, to find the closest word vector to the text vector as the keyword.

After obtaining the keywords, a table is constructed to store the keywords and the cosine similarity values, and the cosine similarity values can be used to indicate the importance of the keywords. The value of the cosine similarity falls between 1 and −1, and the closer to 1 means that the two are more similar, i.e., more important, while the closer to −1 means that the two are less relevant. By reading this table with the web visualization analysis software, the nodes of the web map can be given different sizes according to the weight of the remaining chord similarity, the corresponding words will be mapped to the nodes, and the word size will be adjusted according to the remaining chord similarity. After constructing the keyword list, the next section of open information extraction can be used to obtain the correlation between the nodes, and the element information retrieved from the open information extraction can be mapped to the keyword list and presented as the nodes in the graph.

3.3.3. Define Edges

Open IE (Open Information Extraction) is a method for obtaining inter-word relationships using unsupervised extraction of structured relational triples from text [49]. It is one of the modules of Stanford CoreNLP [50]. First, Open IE splits the input text sentence into the initial relationship triples (the subject, object and relation). After obtaining multiple initial relational triads, it selects the best set of relational triads. The selection process starts by matching the subject and the recipient of each initial relational triad with a named entity, marked by a known grammar parser [51], which is used as the noun (thus the most important key term in the relation). Next, the most relevant rules are retained. Initially, important relationships are filtered out, and words that do not affect the decision of the relationship or cause disagreement between two triples of the same meaning are removed.

3.3.4. Graph Generation

In this research, Cytoscape (https://cytoscape.org/, accessed on 1 June 2022) is adopted to enable entities to be presented graphically as the nodes and interactions as the edges between the nodes. Its original application was for the biological domain. The data are integrated with the network using attributes, which map the nodes or edges to specific data values [52]. This study uses Cytoscape as the visualization tool to graphically map-out a chemical process and patent knowledge summary, which we have extracted from patent document text using the novel methodology.

A co-occurrence matrix is a symmetric matrix that represents the number of occurrences of any two words in the same relation tuple [27]. The co-occurrence matrix can be used to obtain the relationship between words and the weight between words, and this weight will be used to indicate the importance of the links between the nodes, i.e., the higher the weight, the thicker the links between the nodes [53]. The way to construct the co-occurrence matrix is to find the number of words in the subject and receiver of all the relationship tuples, and to list these words as the rows and columns in the co-occurrence matrix. If two different words are present in the same relation tuple, the corresponding row values in the co-occurrence matrix are added by one. The co-occurrence matrix construction and graph generation are shown in Figure 3.

After constructing the co-occurrence matrix, the system extracts the corresponding row values from the co-occurrence matrix as the weight of the node connection relationship. Next, it defines the relationship between these two sets of words in the relation tuples as the edges. Finally, a table that can be used for the input of Cytoscape is generated to present the edges between the nodes and the importance of the edges. The edge construction schema is illustrated in Figure 4.

4. Case Study: Graph Summaries of Carbon Capture Patents

This study uses Carbon Capture and Storage (CCS) as our technical domain for the case study. CCS is an important research and development field and there are more than 120 countries committed to be the regions with virtually zero carbon dioxide (greenhouse gas) emissions by 2050 [54]. CCS is one of the most promising strategies for reducing greenhouse gas emissions in the environment. It enables the continued use of fossil fuels while controlling CO₂ emissions, without compromising the security of the electricity supply. Carbon capture technology starts by capturing the carbon dioxide from the electricity generated by the power plant. There are three main techniques for capturing carbon from fossil fuel power plants, namely post-combustion, pre-combustion fuel decarbonization, and the combustion of fuel in pure oxygen. The post-combustion and pre-combustion processes include carbon capture by the absorption and adsorption of the CO₂, and carbon dioxide capture by separation, using permeable membranes for the CO₂, and gasification and combustion of the fuel syngas/hydrogen. After capturing the CO₂, the gas is compressed, transported, and injected. The pressurized supercritical carbon dioxide is transported from the capture point through a network of pipelines. Carbon dioxide can be recovered in two ways. The first step is to reuse the recovered carbon dioxide, for example, by adding it to a solution to adjust the pH value of the solution. The second step is by geological storage. There are three main types of underground storage sites: deep brine, depleted oil/gas reservoirs, and deep unrecoverable coal seams [55]. Based on the carbon capture literature, the overall knowledge ontology schema of CCS technologies and key terms are shown in Figure 5.

There are several promising new materials for capturing CO₂ from the post-combustion, pre-combustion, and oxyfuel combustion processes. In recent years, there are many studies that tend to address the energy-intensive regeneration steps and chemical degradation of solvents as absorbents in flue gas separation. Current trends in the development of carbon capture/absorbent materials include physical/chemical absorbents, membranes, adsorption on solids using pressure or temperature variations, distillation, natural gas hydrate formation, chemical cycle combustion using metal oxides, etc. [56].

The synthesis procedure of the material overview can be quickly presented graphically with appropriate web visualization tools to assist the user in reading and correcting. The nodes in the procedure graph are represented by three colors representing the material, operation, and unit. The entire flowchart will be presented as a directional graph, with arrows pointing to the next step or interconnected nodes, according to the sequence of steps. To present the step-sequence more clearly, the linked lines and arrows are presented in shades of color, with lighter colors indicating earlier steps and darker spectrum frequencies indicating later steps. This flowchart provides another easy way to present chemical flowcharts. Compared with the P&ID diagram (Piping and Instrumentation Diagram), which requires a relevant background and long training to be familiar with the way of reading P&ID diagrams, this flowchart can be faster and easier for people without the background knowledge to understand the process.

The summary graph allows the researchers to quickly understand the content of the patent. The nodes on the graph show the keywords in the text, and the more important keywords appear in larger nodes. The relationship between the nodes is shown by the thickness of the connecting lines, with the larger the association, the thicker the line. With the appropriate graph layout, the clustering of the nodes in a graph can be observed, which helps researchers to grasp the text content.

4.1. Case I US10722838B2

The patent US10722838B2 [57] is related to carbon dioxide absorption materials, carbon dioxide separation and the recovery system. This patent analyzes the production of solid carbonic acid gas absorption materials. The computer-generated diagram is shown in Figure 6, and is organized by experts. The left side of the figure is related to the addition of materials Pentaethylenehexamine (PEHA) and vinyl chloride resin, and the right side shows the addition of the mixture of nitrogen to pass into the carbon dioxide absorption materials.

As can be seen from Figure 7, the patent can be divided into three main parts, and the left side is the main invention background of the system. The whole system requires a lot of energy to capture carbon dioxide, and water will affect the efficiency of recovering carbon dioxide from the absorbent liquid, resulting in the need for more energy. Above right is the CO₂ separation and recovery system and the regeneration tower. This unit contains the liquid absorbent liquid. Below is a description of the hydrocarbon recovery system used primarily in a coal thermal power generation plant to effectively reduce the boiler combustion exhaust gas from the power plant.

4.2. CaseII US9663627B2

The patent US9663627B2 [58] is related to carbon dioxide adsorbent resin materials. Figure 8 shows the fabrication of {[(C2)3-C6H3]2[(CH3)3Si-C6H2-CHO]3}n. From the left side of the diagram, we can see that the material is first foamed with DMF and argon gas, and then 2,6-dibromo-4-trimethylsilylbenzaldehyde, 1,3,5-triethynylbenzene, bis(triphenylphosphine) palladium (II) dichloride, and CuI are added. On the right side, the compound is cleaned, purified, and then dried by high-temperature vacuum to obtain the subject material.

As can be seen from Figure 9, this patent mainly illustrates the material development of a porous network of carbon dioxide gas. From the lower half of the figure, US9663627B2 [58] takes advantages of the property that the polymer network or porous network of the metal organic network will adsorb more carbon dioxide after the amine functionalization. The main method is at the upper part, the “method” cluster, in the porous network structure of the aldehyde group, functionalized by amine to enhance its gas affinity.

4.3. Validation

This study uses k-means clustering [38,59] to cluster the retrieved 879 patents for subsequent investigation and validation of the individual clusters. The k-means clustering method is a widely used algorithm in signal processing. This method can cluster the points in space, so that similar instances are grouped into the same cluster. After discussing with the experts, the 879 patents were divided into five clusters, namely, adsorbent (258 patents), process design (141 patents), process control (40 patents), chemical properties (337 patents), and others (103 patents). Then, this section will randomly select ten patents in each of these five clusters, and a total of fifty patents will be used as testing sample sets for validation.

The validation method is to calculate the retention rate of each cluster. First, the top 50 keywords in the normalized term frequency–inverse document frequency (NTF-IDF) [60] were used as the keyword list from the patent texts in each cluster. Then, the full-text contents of the single patent text were obtained for the original analysis, and to determine the number of keywords corresponding to the keyword list, multiplied by the NTF-IDF value of the keywords, as the amount of information in the original text. The amount of information contained in the summary text is determined by multiplying the number of keywords in the keyword list by the NTF-IDF value of the keywords. The amount of information contained in the figure is determined by multiplying the number of keywords in the keyword list by the NTF-IDF value of the keywords. The retention rate is determined by dividing the amount of information in the original text by the amount of information in the summary graph. The larger the retention rate, the more information in the text is contained in the summary graph, calculated as follows.

retention rate = \frac{A m o u n t o f i n f o r m a t i o n i n t e x t}{A m o u n t o f i n f o r m a t i o n i n g r a p h}

(3)

The following tables (Table 1, Table 2, Table 3, Table 4 and Table 5) show the validation results of each cluster: “Patent” denotes the patent number randomly selected from each cluster for validation; “Origin” denotes the amount of information in the original patent text; “st-text” denotes the amount of information in the original patent text, with the compression rate of 30% by the LexRank summary algorithm; and “graph” denotes the amount of information in the graph. The retention rate is divided into three types: Ratio (st-t/o) is the retention rate of the summary text obtained by the LexRank algorithm to the original patent text; Ratio (g/o) is the retention rate of the graph to the original patent text, and Ratio (g/st-t) is the retention rate of the graph to the summary text obtained by the LexRank algorithm. Table 1 shows the validation results of Cluster 1, “Absorbent.” The retention rate of the summary text to the original text is 89%; the retention rate of the summary graph to the original text is 73%; and the retention rate of the summary graph to the summary text is 78%.

Table 2 shows the validation results of Cluster 2, “Process Design”. The retention rate of the summary text to the original text is 87%; the retention rate of the summary graph to the original text is 78%; and the retention rate of summary graph to the summary text is 89%. The validation results of the Cluster 3, “Process Control”, patent summaries are listed in Table 3. The retention rate of the summary text to the original text is 88%; the retention rate of the summary graph to the original text is 71%; and the retention rate of the summary graph to the summary text is 81%.

Table 4 shows the validation results of Cluster 4, “Chemical Properties”. The retention rate of the summary text to the original text is 88%; the retention rate of the summary graph to the original text is 67%; and the retention rate of the summary graph to the summary text is 76%. Finally, Table 5 shows the validation results of Cluster 5, “Others”. The retention rate of the summary text to the original text is 89%; the retention rate of the summary graph to the original text is 70%; and the retention rate of the summary graph to the summary text is 79%.

From above Table 1, Table 2, Table 3, Table 4 and Table 5, we can conclude that the average retention rate of the summary text to the original text is near 89%, and the average retention rate of the summary graph to the summary text reaches 81%. In other words, the original summary texts extracted by the summary algorithm retain an average of 89% of the information in the original documents. The summary texts in all the sub-clusters are then converted and presented graphically, adopting our proposed approach, retaining a reliable average level of 81%.

5. Discussion

As demonstrated in the detailed case study (using carbon capture patents as examples), the chemical utility patent summaries are converted into easy-to-read knowledge graphs for R&D engineers to capture the state-of-the-art innovations during patent mining. Different domains of patent sets can be analyzed and represented in knowledge graphs, adopting the proposed systematic methodology. With a good retention rate in the thorough validation process, the novel methodology proposed in this study for summary graph generation has proven to be efficient and reliable. Hence, whenever engineers reading complex patent documents, especially the in chemical research, the time reading and analyzing these complex patent texts can be largely reduced using patent summary graphs. This study accelerates the knowledge capturing processes by knowledge graph generation. On the other hand, for the demonstration of the chemical process graph generation module, to avoid repeating the publication content, please refer to our sequel research paper for case examples [8]. With complementary results, chemical process graphs provide detailed visual knowledge representations of individual chemical patents, which support prior art chemical patent reviews for research innovations.

6. Conclusions

In summary, the concept of graphical representation of patent knowledge can be applied to any given field (not merely in chemical utility patents, as illustrated in this research), such as the designs of other manufacturing processes, e.g., in semiconductors, autonomous vehicles, satellites, and green energy. This study demonstrates that the graphical representation of knowledge approach will bring a different experience compared to the traditional text-reading method. The contributions can be highlighted with three main points. First, the system can be used to shorten the time of reading and understanding patent contents and identify the key relevant patents quickly (by easily viewing graphs, instead of lengthy paragraphs of texts). Second, the system generates knowledge graphs that can be used to reduce the time and cost of organizing the invention procedures of patents, so that the researchers can reproduce similar experiments quickly, based on the novel procedures (e.g., chemical processes) depicted in the graphs. Third, the system can be applied to help R&D engineers find relevant state-of-the-art patents quickly by reviewing their patent knowledge graphs and comparing their graph similarities.

Nonetheless, the training and testing knowledge patent pools are quite diversified in different fields. To demonstrate the validity of the proposed system, applied in different domains, still requires careful testing and validation. In the immediate future research, the general applicability to other domains (working with industry R&D teams) will be our devoted research focus.

Author Contributions

Conceptualization, A.J.C.T.; methodology, A.J.C.T. and C.-P.L.; validation, A.J.C.T. and H.-J.L.; data curation, A.J.C.T. and H.-J.L.; writing—original draft, A.J.C.T. and C.-P.L.; writing—review and editing, A.J.C.T. and H.-J.L.; funding acquisition, A.J.C.T.; formal analysis, C.-P.L. and H.-J.L.; investigation, C.-P.L. and H.-J.L.; visualization, H.-J.L.; software, H.-J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research is partially supported by the Ministry of Science and Technology, Taiwan (Grant No.: MOST 111-2221-E-007-050-MY3).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data generated or analyzed during this study are available from the corresponding authors on reasonable request.

Acknowledgments

The authors express their gratitude toward the funding support from the Ministry of Science and Technology, Taiwan (Grant No.: MOST 111-2221-E-007-050-MY3).

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, T.; Sahinidis, N.V.; Rosé, C.P.; Amaran, S.; Shuang, B. Forty years of Computers and Chemical Engineering: Analysis of the field via text mining techniques. Comput. Chem. Eng. 2019, 129, 106511. [Google Scholar] [CrossRef]
Akhondi, S.A.; Klenner, A.G.; Tyrchan, C.; Manchala, A.K.; Boppana, K.; Lowe, D.; Zimmermann, M.; Jagarlapudi, S.A.R.P.; Sayle, R.; Kors, J.A.; et al. Annotated Chemical Patent Corpus: A Gold Standard for Text Mining. PLoS ONE 2014, 9, e107477. [Google Scholar] [CrossRef] [PubMed]
Ashaari, A.; Ahmad, T.; Awang, S.; Shukor, N. A Graph-Based Dynamic Modeling for Palm Oil Refining Process. Processes 2021, 9, 523. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 2019, Hong Kong, China, 3–7 November 2019; pp. 3973–3983. [Google Scholar]
Mallick, C.; Das, A.K.; Dutta, M.; Das, A.K.; Sarkar, A. Graph-Based Text Summarization Using Modified TextRank. In Soft Computing in Data Analytics; Springer: Berlin, Germany, 2019; pp. 137–146. [Google Scholar] [CrossRef]
Sharma, P.; Li, Y. Self-supervised contextual keyword and keyphrase retrieval with self-labelling. Preprints 2019. [Google Scholar] [CrossRef]
Kim, E.; Huang, K.; Kononova, O.; Ceder, G.; Olivetti, E. Distilling a Materials Synthesis Ontology. Matter 2019, 1, 8–12. [Google Scholar] [CrossRef]
Trappey, A.; Trappey, C.; Liang, C.-P.; Lin, H.-J. IP Analytics and Machine Learning Applied to Create Process Visualization Graphs for Chemical Utility Patents. Processes 2021, 9, 1342. [Google Scholar] [CrossRef]
George, J.; Hautier, G. Chemist versus Machine: Traditional Knowledge versus Machine Learning Techniques. Trends Chem. 2020, 3, 86–95. [Google Scholar] [CrossRef]
Hawizy, L.; Jessop, D.M.; Adams, N.; Murray-Rust, P. ChemicalTagger: A tool for semantic text-mining in chemistry. J. Cheminform. 2011, 3, 17. [Google Scholar] [CrossRef]
Jessop, D.M.; Adams, S.; Willighagen, E.L.; Hawizy, L.; Murray-Rust, P. OSCAR4: A flexible architecture for chemical text-mining. J. Cheminform. 2011, 3, 41. [Google Scholar] [CrossRef] [Green Version]
Swain, M.; Cole, J.M. ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature. J. Chem. Inf. Model. 2016, 56, 1894–1904. [Google Scholar] [CrossRef]
Gao, X.; Tan, R.; Li, G. Research on Text Mining of Material Science Based on Natural Language Processing. IOP Conf. Ser. Mater. Sci. Eng. 2020, 768. [Google Scholar] [CrossRef]
Kim, E.; Huang, K.; Saunders, A.; McCallum, A.; Ceder, G.; Olivetti, E. Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning. Chem. Mater. 2017, 29, 9436–9444. [Google Scholar] [CrossRef]
Tao, J.; Brayton, K.A.; Broschat, S.L. Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database. Appl. Sci. 2020, 11, 24. [Google Scholar] [CrossRef]
Campos, D.; Matos, S.; Oliveira, J.L. A document processing pipeline for annotating chemical entities in scientific documents. J. Cheminform. 2015, 7, S7. [Google Scholar] [CrossRef] [PubMed]
Das, A.; Ganguly, D.; Garain, U. Named Entity Recognition with Word Embeddings and Wikipedia Categories for a Low-Resource Language. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2017, 16, 1–19. [Google Scholar] [CrossRef]
Beliga, S. Keyword Extraction: A review of Methods and Approaches; University of Rijeka, Department of Informatics: Rijeka, Croatia, 2014. [Google Scholar]
Zhang, C.; Wang, H.; Liu, Y.; Wu, D.; Liao, Y.; Wang, B. Automatic keyword extraction from documents using conditional random fields. J. Comput. Inf. Syst. 2008, 4, 1169–1180. [Google Scholar]
Chen, P.-I.; Lin, S.-J. Automatic keyword prediction using Google similarity distance. Expert Syst. Appl. 2010, 37, 1928–1938. [Google Scholar] [CrossRef]
Bharti, K.S.; Babu, K.S. Automatic keyword extraction for text summarization: A survey. arXiv 2017, arXiv:1704.03242. [Google Scholar]
Turney, P. Learning to Extract Keyphrases from Text. arXiv 2002. [Google Scholar] [CrossRef]
Madani, F.; Weber, C. The evolution of patent mining: Applying bibliometrics analysis and keyword network analysis. World Pat. Inf. 2016, 46, 32–48. [Google Scholar] [CrossRef]
Goldberg, Y. A Primer on Neural Network Models for Natural Language Processing. J. Artif. Intell. Res. 2016, 57, 345–420. [Google Scholar] [CrossRef] [Green Version]
Bengio, Y. Neural net language models. Scholarpedia 2008, 3, 3881. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014. [Google Scholar]
Li, S.; Gong, B. Word embedding and text classification based on deep learning methods. MATEC Web Conf. 2021, 336, 06022. [Google Scholar] [CrossRef]
Gupta, P.; Roy, I.; Batra, G.; Dubey, A.K. Decoding Emotions in Text Using GloVe Embeddings. In Proceedings of the 2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), Greater Noida, India, 19–20 February 2021; pp. 36–40. [Google Scholar] [CrossRef]
Parwita, I.M.M.; Siahaan, D. Classification of Mobile Application Reviews using Word Embedding and Convolutional Neural Network. Lontar Komput. J. Ilm. Teknol. Inf. 2019, 1–8. [Google Scholar] [CrossRef]
Santos, I.; Nedjah, N.; Mourelle, L.D.M. Sentiment analysis using convolutional neural network with fastText embeddings. In Proceedings of the 2017 IEEE Latin American Conference on Computational Intelligence (LA-CCI), Arequipa, Peru, 8–10 November 2017; pp. 1–5. [Google Scholar] [CrossRef]
Moghadasi, M.N.; Zhuang, Y. Sent2Vec: A New Sentence Embedding Representation with Sentimental Semantic. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 4672–4680. [Google Scholar] [CrossRef]
Arora, S.; Liang, Y.; Ma, T. A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of the International conference on learning representations (ICLR) 2017 Conference, Palais des Congrès Neptune, Toulon, France, 24–26 April 2017. [Google Scholar]
Arora, S.; Li, Y.; Liang, Y.; Ma, T.; Risteski, A. A Latent Variable Model Approach to PMI-based Word Embeddings. Trans. Assoc. Comput. Linguist. 2016, 4, 385–399. [Google Scholar] [CrossRef]
Meena, Y.K.; Gopalani, D. Evolutionary Algorithms for Extractive Automatic Text Summarization. Procedia Comput. Sci. 2015, 48, 244–249. [Google Scholar] [CrossRef]
Saranyamol, C.; Sindhu, L. A survey on automatic text summarization. Int. J. Comput. Sci. Inf. Technol. 2014, 5, 7889–7893. [Google Scholar]
Pal, A.R.; Saha, D. An approach to automatic text summarization using WordNet. In Proceedings of the 2014 IEEE International Advance Computing Conference (IACC), Gurgaon, India, 21–22 February 2014. [Google Scholar]
Khazaei, A.; Ghasemzadeh, M. Comparing k-means clusters on parallel Persian-English corpus. J. Artif. Intell. Data Min. 2015, 3, 203–208. [Google Scholar] [CrossRef]
Ramesh, A.; Srinivasa, K.; Pramod, N. SentenceRank—A graph based approach to summarize text. In Proceedings of the The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014), Bangalore, India, 17–19 February 2014. [Google Scholar]
Li, G.-K.J.; Trappey, C.V.; Trappey, A.J.; Li, A.A. Ontology-based knowledge representation and semantic topic modeling for intelligent trademark legal precedent research. World Pat. Inf. 2022, 68, 102098. [Google Scholar] [CrossRef]
West, D.B. Introduction to Graph Theory; Prentice Hall: Upper Saddle River, NJ, USA, 2001; Volume 2. [Google Scholar]
Fruchterman, T.M.J.; Reingold, E.M. Graph drawing by force-directed placement. Softw. PR. Exp. 1991, 21, 1129–1164. [Google Scholar] [CrossRef]
Kobourov, S.G. Spring embedders and force directed graph drawing algorithms. arXiv 2012, arXiv:1201.3011. [Google Scholar]
Cline, M.S.; Smoot, M.; Cerami, E.; Kuchinsky, A.; Landys, N.; Workman, C.; Christmas, R.; Avila-Campilo, I.; Creech, M.; Gross, B.; et al. Integration of biological networks and gene expression data using Cytoscape. Nat. Protoc. 2007, 2, 2366–2382. [Google Scholar] [CrossRef] [PubMed]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
Li, J.; Sun, A.; Han, J.; Li, C. A Survey on Deep Learning for Named Entity Recognition. IEEE Trans. Knowl. Data Eng. 2020, 34, 50–70. [Google Scholar] [CrossRef]
Schmid, H. Part-of-speech tagging with neural networks. arXiv 1994, 1, 172. [Google Scholar] [CrossRef]
Erkan, G.; Radev, D.R. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. J. Artif. Intell. Res. 2004, 22, 457–479. [Google Scholar] [CrossRef]
Angeli, G.; Premkumar, M.J.J.; Manning, C.D. Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 26–31 June 2015. [Google Scholar]
Manning, D.C.; Surdeanu, M.; Bauer, J.; Finkel, J.R.; Bethard, S.; McClosky, D. The Stanford CoreNLP natural language processing toolkit. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA, 23–24 June 2014. [Google Scholar]
Schmitz, M.; Soderland, S.; Bart, R.; Etzioni, O. Open language learning for information extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, 12–14 June 2012. [Google Scholar]
Smoot, M.E.; Ono, K.; Ruscheinski, J.; Wang, P.-L.; Ideker, T. Cytoscape 2.8: New features for data integration and network visualization. Bioinformatics 2010, 27, 431–432. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.Y.; Klose, T.; Lippy, J.; Barcelon-Yang, C.S.; Zhang, L. Leveraging text analytics in patent analysis to empower business decisions–A competitive differentiation of kinase assay technology platforms by I2E text mining software. World Pat. Inf. 2014, 39, 24–34. [Google Scholar] [CrossRef]
Maehara, Y.; Kuku, A.; Osabe, Y. Macro analysis of decarbonization-related patent technologies by patent domain-specific BERT. World Pat. Inf. 2022, 69, 102112. [Google Scholar] [CrossRef]
Maroto-Valer, M.M. Developments and Innovation in Carbon Dioxide (CO₂) Capture and Storage Technology: Carbon Dioxide (CO₂) Storage and Utilisation; Woodhead Publishing, Headquarters: Sawston, UK, 2010. [Google Scholar]
D’Alessandro, D.M.; Smit, B.; Long, J.R. Carbon Dioxide Capture: Prospects for New Materials. Angew. Chem. Int. Ed. 2010, 49, 6058–6082. [Google Scholar] [CrossRef]
Kondo, A.; Kuboki, T.; Suzuki, A.; Udatsu, M.; Watando, H. Carbon Dioxide Absorbent and Carbon Dioxide Separation and Recovery System. US Patent 2020. [Google Scholar]
Eddaoudi, M.; Guillerm, V.; Weselinski, L.; Alkordi, M.H.; Mohideen, M.I.H.; Belmabkhout, Y. Amine functionalized porous network. US Patent 2017. [Google Scholar]
Likas, A.; Vlassis, N.; Verbeek, J.J. The global k-means clustering algorithm. Pattern Recognit. 2003, 36, 451–461. [Google Scholar] [CrossRef] [Green Version]
Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Example of random layout, spring layout and organic layout: (a) Random layout; (b) Spring layout; (c) Organic layout.

Figure 2. The system architecture shows the parallel modules of chemical process and patent summary graphs.

Figure 3. Schematic diagram of co-occurrence matrix construction.

Figure 4. Adding edge interactions between nodes.

Figure 5. The ontology schema of carbon capture and storage technologies.

Figure 6. Process graph of US10722838B2.

Figure 7. Summary graph of US10722838B2.

Figure 8. Process graph of US9663627B2.

Figure 9. Summary graph of US9663627B2.

Table 1. Cluster 1 validation results.

Patent	Origin	st-Text	Graph	Retention Ratio (st-t/o)	Retention Ratio (g/o)	Retention Ratio (g/st-t)
US20110120308A1	5.20	4.33	3.58	0.83	0.69	0.83
US20140004016A1	5.01	4.07	3.43	0.81	0.68	0.84
US20140196481A1	5.14	4.71	2.94	0.92	0.57	0.62
US20160151742A1	4.55	4.14	3.63	0.91	0.88	0.80
US20180162729A1	5.68	4.75	3.73	0.84	0.78	0.66
US20190308137A1	5.46	4.96	3.59	0.91	0.66	0.72
US20190388836A1	4.86	4.47	4.21	0.92	0.87	0.94
US20200001231A1	5.03	4.64	3.59	0.92	0.71	0.77
US20200276730A1	3.97	3.38	3.06	0.85	0.77	0.90
US20200289975A1	3.82	3.72	2.70	0.97	0.71	0.73
Average				89%	73%	78%

Table 2. Cluster 2 validation results.

Patent	Origin	st-Text	Graph	Retention Ratio (st-t/o)	Retention Ratio (g/o)	Retention Ratio (g/st-t)
US9856755B2	6.66	4.47	4.24	0.67	0.64	0.95
US8983791B2	5.80	5.07	4.80	0.87	0.83	0.95
US10005032B2	6.05	5.14	4.70	0.85	0.78	0.92
US9157353B2	5.23	4.60	4.43	0.88	0.85	0.96
US9737848B2	5.45	4.88	4.54	0.90	0.83	0.93
US10155194B2	6.04	5.71	4.06	0.95	0.67	0.71
US10183865B2	5.55	4.73	4.47	0.85	0.81	0.95
US20140301927A1	5.90	5.79	5.01	0.98	0.85	0.87
US20150246312A1	5.59	4.95	3.97	0.89	0.71	0.80
US20170173522A1	5.79	5.19	4.50	0.90	0.78	0.87
Average				87%	78%	89%

Table 3. Cluster 3 validation results.

Patent	Origin	st-Text	Graph	Retention Ratio (st-t/o)	Retention Ratio (g/o)	Retention Ratio (g/st-t)
US20080141672A1	4.01	3.78	3.10	0.94	0.77	0.82
US20080196584A1	3.89	3.35	2.94	0.86	0.76	0.88
US20080196585A1	3.89	3.35	2.94	0.86	0.76	0.88
US20080196587A1	3.87	3.26	2.75	0.84	0.71	0.84
US20090277326A1	3.81	3.27	2.48	0.86	0.65	0.76
US20140190351A1	3.27	3.04	2.33	0.93	0.71	0.77
US20160193561A1	3.27	3.07	2.33	0.94	0.71	0.76
US20190282952A1	3.57	3.16	2.31	0.89	0.65	0.73
US20190299151A1	2.91	2.41	2.19	0.83	0.75	0.91
US20200075981A1	3.86	3.33	2.43	0.86	0.63	0.73
Average				88%	71%	81%

Table 4. Cluster 4 validation results.

Patent	Origin	st-Text	Graph	Retention Ratio (st-t/o)	Retention Ratio (g/o)	Retention Ratio (g/st-t)
US8834822B1	5.06	5.00	3.39	0.99	0.67	0.68
US20110250121A1	5.20	4.45	3.14	0.86	0.60	0.71
US20130121903A1	4.43	3.90	3.19	0.88	0.72	0.82
US20140161697A1	4.96	4.22	3.12	0.85	0.63	0.74
US20150158013A1	4.43	3.60	3.00	0.81	0.68	0.83
US20150238894A1	5.25	4.71	3.14	0.90	0.60	0.67
US20160002044A1	5.08	4.47	3.25	0.88	0.64	0.73
US20160101385A1	4.97	4.25	3.52	0.86	0.71	0.83
US20180214849A	4.97	4.41	3.68	0.89	0.74	0.83
US20190358585A1	5.25	4.58	3.61	0.87	0.69	0.79
Average				88%	67%	76%

Table 5. Cluster 5 validation results.

Patent	Origin	st-Text	Graph	Retention Ratio (st-t/o)	Retention Ratio (g/o)	Retention Ratio (g/st-t)
US20080178733A1	5.06	4.52	4.06	0.89	0.80	0.90
US20090257941A1	4.54	4.16	3.24	0.92	0.71	0.78
US20100282082A1	4.96	4.29	3.19	0.86	0.64	0.74
US20110232286A1	4.72	4.67	3.50	0.99	0.74	0.75
US20150013300A1	4.99	4.55	3.43	0.91	0.69	0.75
US20160214057A	5.67	4.71	3.90	0.83	0.69	0.83
US20160346727A1	4.10	3.52	3.06	0.86	0.75	0.87
US20180187887A	4.99	4.65	3.43	0.93	0.69	0.74
US20200147544A	5.26	4.32	3.49	0.82	0.66	0.81
US20210245092A1	5.45	4.84	3.50	0.89	0.64	0.72
Average				89%	70%	79%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Trappey, A.J.C.; Liang, C.-P.; Lin, H.-J. Using Machine Learning Language Models to Generate Innovation Knowledge Graphs for Patent Mining. Appl. Sci. 2022, 12, 9818. https://0-doi-org.brum.beds.ac.uk/10.3390/app12199818

AMA Style

Trappey AJC, Liang C-P, Lin H-J. Using Machine Learning Language Models to Generate Innovation Knowledge Graphs for Patent Mining. Applied Sciences. 2022; 12(19):9818. https://0-doi-org.brum.beds.ac.uk/10.3390/app12199818

Chicago/Turabian Style

Trappey, Amy J. C., Chih-Ping Liang, and Hsin-Jung Lin. 2022. "Using Machine Learning Language Models to Generate Innovation Knowledge Graphs for Patent Mining" Applied Sciences 12, no. 19: 9818. https://0-doi-org.brum.beds.ac.uk/10.3390/app12199818

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Using Machine Learning Language Models to Generate Innovation Knowledge Graphs for Patent Mining

Abstract

1. Introduction

2. Literature Review

2.1. Chemical Text Mining

2.2. Keyword Extraction

2.3. Text Vectorization

2.3.1. Word Embedding

2.3.2. Sentence Embedding

2.4. Automatic Summarization

2.5. Network Graph Visualization

3. Methodology

3.1. Preprocess of Patent Text and Paragraphs

3.2. Sentence-BERT for Sentence Embedding

3.3. Graphical Patent Summarization

3.3.1. Text Summarization

3.3.2. Define Nodes

3.3.3. Define Edges

3.3.4. Graph Generation

4. Case Study: Graph Summaries of Carbon Capture Patents

4.1. Case I US10722838B2

4.2. CaseII US9663627B2

4.3. Validation

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI