Deep Learning and Text Mining: Classifying and Extracting Key Information from Construction Accident Narratives

Li, Jue; Wu, Chang

doi:10.3390/app131910599

Open AccessArticle

Deep Learning and Text Mining: Classifying and Extracting Key Information from Construction Accident Narratives

by

Jue Li

^* and

Chang Wu

School of Traffic and Transportation of Engineering, Changsha University of Science and Technology, Changsha 410114, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(19), 10599; https://0-doi-org.brum.beds.ac.uk/10.3390/app131910599

Submission received: 31 July 2023 / Revised: 20 August 2023 / Accepted: 25 August 2023 / Published: 22 September 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Construction accidents can lead to serious consequences. To reduce the occurrence of such accidents and strengthen the execution capabilities in on-site safety management, managers must analyze accident report texts in depth and extract valuable information from them. However, accident report texts are usually presented in unstructured or semi-structured forms; analyzing these texts manually requires a lot of time and effort, it is difficult to cope with the demand of analyzing a large number of accident texts, and the quality of key information extracted manually may be poor. Therefore, this study proposes a classification method based on natural language processing (NLP) technology. First, we developed a text classification model based on a convolutional neural network (CNN) that can automatically classify accident categories based on accident text features. Next, taking the classified fall accidents as an example, we extracted key information from accident narratives using the term frequency-inverse document frequency (TF-IDF) method and presented it visually using word clouds. The results show that the overall accuracy of the CNN model reaches 84%, which is better than the other three shallow machine-learning models. Then, eight key accident areas and three accident-prone operations were identified using the TF-IDF algorithm. This study can provide important guidance for project managers and can be used for on-site safety management to help prevent production safety accidents.

Keywords:

deep learning; text mining; TF-IDF algorithm; accident narrative classification; convolutional neural network

1. Introduction

The construction industry is a pillar industry of China’s national economy. In 2021, the total output value of Chinese construction enterprises reached 29.3 trillion CNY, representing a growth of 1.14 times compared to 2012 [1]. Construction accidents are an integral part of the construction industry. A construction accident is characterized by its suddenness, prevalence, wide impact, and severe consequences. Therefore, ensuring the safety of personnel is of paramount importance. Figure 1 displays the number of incidents and fatalities in China’s housing and municipal engineering production safety accidents over the past five years. From 2016 to 2019, the number of accidents and fatalities showed an upward trend, with a slight decrease from 2019 to 2020, although it remained at a high level, indicating that the safety situation in China’s housing and municipal engineering production remains severe. Within these accidents, object strikes, electrocution, collapse of objects, falling, and mechanical injury—known as the “five major hazards” in the construction industry—account for over 85% of the total number of accidents [2]. Therefore, the rapid identification of accident types and the extraction of key information from accident texts are crucial for preventing accidents and managing emergencies.

Building accidents generate a large amount of accident text, and manually analyzing this text is a time-consuming and inefficient task [3]. In recent years, with the advancement in natural language processing (NLP) techniques, computers have become capable of analyzing vast amounts of text [4,5]. NLP can address the classification of unstructured or semi-structured text by analyzing the semantic meaning, syntactic style, and textual content [6,7,8]. Traditional machine learning methods include SVM (Support Vector Machine), NB (Native Bayesian), KNN (K-Nearest Neighbor), LR (Logistic Regression), and RF (Random Forest). Ye et al. [9] proposed an SVM-NB classification method for text sentiment recognition to obtain more emotion polarities. Huang et al. [10] applied the multi-label classification algorithm (ML-KNN) to social media users for multi-label classification tasks. Jalal et al. [11] introduced an improved random forest algorithm (IRFTC) for text classification. Deep learning is a subset of machine learning that can be applied to computer vision, prediction, and semantic analysis [12]. Deep learning methods include CNN (Convolutional Neural Network), RNN (Recurrent Neural Network), and LSTM (Long Short-Term Memory). Alsaleh et al. [13] proposed a genetic algorithm-based CNN for Arabic text classification. Gu et al. [14] used a Bi-RNN (Bidirectional Recurrent Neural Networks) with LSTM and topic models to capture more contextual and semantic information for short text classification.

In a literature search, we found some studies on machine learning methods used in building accident analysis. Shuang et al. [15] used machine learning techniques to identify key combinations of causes for fatal construction site accidents to predict different types of accidents. Choi et al. [16] utilized machine learning to identify potential risks of fatal construction site accidents. Zermane et al. [17] used a random forest classification machine learning model to predict fatal high fall accidents and detect possible factors influencing fatal high fall accidents in the Malaysian construction industry. These studies have validated the reliability and applicability of machine learning methods in identifying and predicting building accidents. Building accidents generate a significant amount of accident text, and managers rely on browsing these texts to understand and control accident information, which requires considerable time and effort. Hence, many machine learning methods have been developed to process accident texts [18]. Chen et al. [19] proposed a text mining approach for the causal classification of accidents using a relation graph convolutional network (R-GCN) and pre-trained BERT (Bidirectional Encoder Representation from Transformers) to automatically explore causal information in accident investigation reports. Zhang et al. [5] analyzed construction accident reports using the term frequency-inverse document frequency (TF-IDF) method and NLP techniques, presenting five baseline models and an ensemble model for accident cause classification. Pan et al. [20] developed a graph-based deep learning method for the automatic classification of accident reports labeled with accident types and injury types. Xu et al. [21] utilized deep learning to automatically classify and predict the causes of accidents through text features and fast extraction of cause information. These studies proposed different classification methods and approaches for processing accident report texts that demonstrated good application effectiveness. However, there is still limited research on automatically classifying Chinese building accident narratives using machine learning methods. The automatic classification of accident narratives can help managers quickly understand the types of accidents. Although the automatic classification of accident narratives itself does not generate new knowledge [22], it provides a means to improve management efficiency. By knowing the specific accident types, managers can take appropriate measures. Text classification models such as SVM, CNN, and RNN can improve management efficiency but cannot extract detailed knowledge from text information [23]. Text mining techniques provide a method for obtaining detailed knowledge from textual information. It is the process of extracting previously unknown, understandable, potential, and useful patterns or knowledge from a collection of textual data [24]. Qiu et al. [25] combined text mining techniques with complex networks to explore the causal mechanism of coal mine accidents. Jing et al. [26] developed a chemical accident case text-mining method based on Word2vec and bidirectional LSTM for the correlation analysis and text classification of chemical accident cases. Hu et al. [27] used the Latent Dirichlet Allocation (LDA) model to mine the best safety evidence regarding accident causal topics and causal factors, providing safety decision support. From these studies, it is evident that combining machine learning and text mining methods holds the potential to play a role in handling construction accident texts, and their applicability should be explored.

In this context, this study aims to develop an automated categorization method for construction accident narratives in China, in order to help managers analyze and understand accident texts and to enhance their execution capabilities in terms of on-site safety management. Specifically, we developed a CNN-based text classification model to automatically classify Chinese accident narratives and then analyzed the accident narratives using the TF-IDF algorithm, employing the classified falling accidents as an example to further extract the key information. Finally, we created word clouds to visualize key information, aiding managers in understanding and analyzing accident text information.

The purpose of this paper is to develop a machine learning-based automatic classification method to help managers understand and analyze accident text. Using this approach, managers can directly acquire accident experience knowledge from accident texts, enabling them to prevent accidents and respond effectively in the event of an incident. The main contributions of this research are as follows: (1) the development of a CNN-based text classification model for automatic classification of building accident narratives, (2) the analysis of classified accident narratives using text mining and TF-IDF algorithm to extract key information, and (3) the visualization of key information in accident narratives to enhance the understanding and analysis efficiency of accident text information.

The remaining parts of this paper are organized as follows: Section 2 describes the proposed research methodology in detail. Section 3 presents the obtained results and analyzes them. Section 4 discusses the significance and limitations of the proposed method. Finally, Section 5 concludes the paper.

2. Materials and Methods

To prevent the “Five Major Hazards” accidents in the construction industry, we propose an automatic accident narrative classification method based on deep learning and text mining. This method consists of two stages: predicting the accident category and mining key information from the accident narratives. Before classification, the accident texts undergo preprocessing. This study follows the three steps shown in Figure 2 to achieve this goal. In the first step, relevant accident reports are extracted from a large dataset of construction accident reports. Then, the accident narratives are manually labeled to determine the corresponding accident categories. Next, the text is segmented using the Jieba. Finally, irrelevant words without practical meaning are removed based on an imported stopword dictionary to reduce noise. In the second step, Word2vec technology converts the text into word vectors [28]. The generated word vectors are then embedded into the CNN via a convolutional kernel to construct a text classification model. In the third step, taking falling accidents as an example, the TF-IDF algorithm is applied for text mining to extract key information from the accident narratives. The key information is visualized using word clouds.

2.1. Data Set

Accident investigation reports are important documents that provide comprehensive narratives of various accidents, including their causes, processes, and rescue situations. These reports are often used for accident statistics and analysis, and important data can be obtained through the summarization and organization of accident investigation reports. In this study, 1727 construction production accident investigation reports were collected from the official websites of emergency management departments (bureaus) in various provinces and cities in China, as well as from the internet (Table 1 and Table 2). These reports cover five types of accidents: object strikes, electrocution, collapse of objects, falling, and mechanical injury. After screening the accident reports, some investigation reports with incomplete descriptions of the accident process were removed, and the extracted valid accident reports were ultimately compiled and saved in “.csv” format.

2.2. Data Preprocessing

Preprocessing consists of three main steps: (1) labeling accident categories, (2) word segmentation, and (3) removing stopwords. CNN can perform labeled classification tasks. Therefore, it is necessary to add a column for the accident category label to each accident text and manually label them based on the accident process and according to the accident type. Since multiple events may occur during an accident, the labeling of cases follows the principle of identifying the first occurrence of loss of control or unintentional behavior [19]. The accident process, as shown in Figure 3, describes an electrocution accident that occurred to a worker at a construction site.

In this study, the Jieba in Python is used for word segmentation operations. Jieba can break down complete sentences into individual words, facilitating subsequent text-processing tasks. Among the three segmentation modes provided by Jieba (precise mode, full mode, and search engine mode), the precise mode is the fastest and most suitable for practical use. It also allows for better text analysis while ensuring accuracy. Therefore, the precise mode was adopted in this study. In addition, Jieba segmentation supports the filtering and removal of words based on part-of-speech tagging. The posseg module was used in this study to obtain the corresponding part of speech for each word. After part-of-speech tagging, words can be directly deleted based on their part of speech, such as names of people or places.

In natural language processing, stopwords refer to words that appear frequently in a text but do not carry much actual meaning. These words are often common vocabulary, such as “in general”, “or”, “however”, “most”, etc. They do not provide much information but can significantly affect the accuracy of the results. To exclude these influential stopwords as much as possible, this study uses the Harbin Institute of Technology stop-word list, Baidu stop-word list, and Sichuan University Machine Intelligence Laboratory stop-word list, consolidating and removing duplicates from all stopwords. Additionally, some domain-specific technical terms, such as “shield tunneling machine”, are included to enhance text analysis.

2.3. Automatic Classification of CNN

2.3.1. Word Embedding

Word embedding is a technique in the field of natural language processing that represents words or phrases as real-valued vectors, enabling computers to recognize them. The goal of this technique is to map semantically similar words to vector spaces where they are close to each other, capturing the associations between vocabulary words.

Word2vec is a model architecture proposed by Mikolov et al. [29] for computing continuous vector representations of words. It is primarily used to transform text information from unstructured to vectorized forms [30]. Two methods of word embedding are One-hot vectors and Word2vec. One-hot vectors are a traditional approach widely used for text quantification [22]. Compared to One-hot vectors, Word2vec has stronger semantic expressiveness. One-hot vectors only represent the occurrence position of words in a vocabulary list without considering the correlations between words, resulting in limited expressive power. On the other hand, Word2vec captures the various types of relationships between words through training and provides more contextual information for similar words. In addition, Word2vec has lower dimensions, which improves computational efficiency when dealing with high-dimensionality data. Finally, the Word2vec algorithm can efficiently handle large-scale corpora.

Word2vec includes two models: Continuous Bag of Words (CBOW) and Skip-gram [31]. The CBOW model takes each word in the context as input and attempts to predict the center word. The Skip-gram model learns vectors from the center word and tries to predict the surrounding context words based on these vectors. When dealing with construction accident narrative texts with incomplete context information, the Skip-gram model is more suitable for handling sparse text features. Therefore, this study selects Word2vec as the CNN model for generating word embedding representations and chooses Skip-gram as the training algorithm.

2.3.2. CNN-Based Text Classification Model

CNN, as an efficient recognition algorithm, can accurately extract and analyze semantic features from text and image data [32,33], as well as automatically extract and combine these features from the data without human intervention [34]. In the field of text classification, the core idea of CNN is to search for relationships between neighboring words in the text using convolutional kernels, extract useful features, and use these features for text classification. The CNN model consists of five parts: inputting layer, convolutional layer, pooling layer, fully connected layer, and outputting layer, as shown in Figure 4.

In the CNN model, the preprocessed text is transformed into a text matrix through word embedding, and then convolutional kernels extract local features from the text matrix, generating a series of two-dimensional feature maps. Next, the pooling layer reduces the dimensionality of each feature map, resulting in a high-dimensionality text vector representation. The fully connected layer takes the text vector as input and maps it to scores for all possible categories, and finally, the softmax function is used to calculate the probability for each category. Since CNN is a supervised learning machine learning method, category labels must be added to each incident description. During the classification process, activation functions and dropout techniques need to be added. Activation functions increase the model’s nonlinearity characteristics, enabling it to better adapt to complex input data. In the CNN model constructed in this study, relu and softmax are used as activation functions, and dropout is introduced to reduce the occurrence of overfitting. The pseudocode of the CNN-based text classification model is shown in Table 3.

2.4. Key Information Extraction and Word Cloud Visualization

Building accident reports contain valuable information, and mastering key information can help prevent accidents more effectively. TF-IDF is a method that ranks the importance of terms based on their frequency of occurrence in a given document [35]. It calculates the importance of a term in the entire corpus by considering its frequency in the text and its frequency across all documents. In simple terms, if a term appears frequently in a specific text but rarely in other texts, its TF-IDF value will be high, indicating its importance. Conversely, if a term appears in multiple texts, its TF-IDF value will be low, indicating its poor distinctiveness and lower importance. Therefore, the keywords extracted by the TF-IDF algorithm have strong representativeness in the text and can effectively distinguish them from other corpora. Thus, the TF-IDF algorithm is used to further evaluate key information, and the calculation formula is shown as Equation (1).

TF - IDF = TF \times IDF = \frac{N_{w}}{N} \times \log (\frac{Y}{Y_{w} + 1})

(1)

where N_w is the number of times the word w appears in a given text, and N is the total number of all words in the text. Y is the total number of texts in the text collection. Y_w is the number of texts containing the word w.

Word cloud visualization is a common technique used for data visualization. It visually represents the frequent occurrence of keywords in a text. This technique helps users quickly and intuitively understand the most important information in the text data. By visualizing with a word cloud, safety managers can quickly grasp the core content of the text data and easily extract key information from construction accidents, enabling them to take appropriate safety-management measures. The generation of a word cloud is based on the TF-IDF values of the keywords. If the TF-IDF value of a keyword is higher, it will occupy a larger proportion and size in the word cloud visualization. The pseudocode for extracting key information using the TF-IDF algorithm and generating word clouds is shown in Table 4.

3. Results

3.1. Configuration

In this study, we developed a CNN model based on Python 3.9 and Tensorflow 2.11.0 in the Anaconda environment. Several key libraries were utilized during the model construction process. Firstly, the Jieba library was used for text segmentation and the removal of stopwords. Additionally, libraries such as pandas, numpy, scikit-learn, gensim, keras 2.1.1, and matplotlib were employed for developing the CNN model. Finally, the Wordcloud library was utilized to generate word cloud visualizations, enabling the presentation of text information in a visually intuitive manner.

3.2. Automatic Category Classification

3.2.1. Parameter Setting

Setting appropriate CNN parameters is crucial for achieving accurate classification models of construction accident narratives. The filter size in the convolutional layers was set to 5 to capture features of different lengths and improve classification accuracy. The maximum length of the text sequences was set to 250. This choice was based on the analysis and understanding of construction accident narratives, ensuring that the model captures sufficient contextual information while limiting the impact of excessively long sequences. A dropout probability of 0.4 was employed to enhance the model’s generalization ability and reduce overfitting to the training data.

In summary, the parameter settings for the CNN model were as follows: maxlen = 250, word vector dimension = 128, minimum word frequency = 3, number of CPU cores = 4, context window size = 4, dropout = 0.4, epochs = 12, batch size = 32, filter size = 5, number of filters = 256, and padding = same. These parameters were chosen to develop an optimal CNN model. In this paper, shallow learning methods (SVM, NB, and KNN) are compared with the CNN model to examine the effectiveness of the developed CNN model.

3.2.2. Evaluation Indicators

The classification performance metrics of the CNN model are key indicators for evaluating its performance, including accuracy, recall, and F1 score. Accuracy represents the proportion of correctly classified samples out of the total number of samples. Recall measures the proportion of correctly identified positive samples out of all positive ones. Both accuracy and recall are used to assess the classification performance of the model, where higher values indicate better performance. The F1 score is a measure that balances both precision and recall, providing a single metric to evaluate the model’s overall performance. It is the harmonic mean of precision and recall, considering both false positives and false negatives. In practical applications, a combination of accuracy, recall, and F1 score is commonly considered to evaluate the classification performance of CNN models. The formulas for these metrics are as follows:

Precision = \frac{TP}{(TP + FP)}

(2)

Recall = \frac{TP}{(TP + FN)}

(3)

F 1 score = \frac{2 \times precision \times recall}{precision + recall}

(4)

where TP is the number of true positives, FP is the number of false positives, FN is the number of false negatives, and TN is the number of true negatives.

3.2.3. Model Testing and Evaluation

First, this study compared two word-embedding methods, Word2vec and One-hot encoding, in the CNN model. The comparison results in Table 5 summarize the performance of these two methods.

The results in Table 5 indicate that the overall accuracy of the model is expected to be around 82%. Upon careful examination, the following findings were observed:

(1): Word2vec achieves an average accuracy of 84%, showing only a slight difference compared to the One-hot encoding model. In construction accident texts, the lack of complete contextual information often leads to sparse text features. However, with Word2vec training, it captures various types of relationships between words and provides more contextual information for similar words. Therefore, we expect the Word2vec model to better handle such texts.
(2): From the perspective of the F1 score, the One-hot encoding model performs similarly to Word2vec in the “falling” category. However, in other categories, Word2vec outperforms One-hot encoding.
(3): Compared to the CNN model using One-hot encoding as word embedding, the CNN model using Word2vec as word embedding performs better in classifying construction accident narrative texts.

To further validate the performance of the CNN model, it was compared with shallow machine learning algorithms such as SVM, KNN, and NB, and the results are presented in Figure 5 and Figure 6.

As shown in Figure 5, the CNN model performs well in categories such as “ object strikes”, “collapse of objects”, and “falling”. In terms of mechanical injury, KNN is less effective, and the difference in accuracy between CNN and the other two machine learning models is so small as to be negligible. However, in the case of “electrocution”, the performance of the CNN model is not as good as the other three shallow machine learning models. Due to the loss of control over the body during electrocution accidents, it is more prone to cause fall accidents. When extracting features, the CNN model may extract features related to fall accidents, leading to confusion between electrocution accidents and falling accidents, thus affecting its performance in electrocution accident classification. In summary, the CNN model demonstrates higher accuracy in classifying construction accident narrative texts compared to the other three shallow machine learning models.

Figure 6 illustrates the comparison results of different machine learning models in terms of accuracy. The CNN model achieved an overall accuracy of 84%, the NB model achieved an overall accuracy of 82%, the SVM model achieved an overall accuracy of 75%, and the KNN model achieved an overall accuracy of 70%. It is evident from the data that the CNN model has higher accuracy compared to the other three shallow machine learning models, making it more suitable for construction accident text classification tasks.

3.3. Key Information Mining and Word Cloud Visualization

3.3.1. Key Information Mining

To demonstrate the effectiveness of text classification, using classified falling accident texts as an example, this study employed the TF-IDF algorithm to calculate the importance of words. Due to the large number of features, some irrelevant words such as “personnel”, “company”, “operation”, and “construction site” were manually removed. Ultimately, the top 20 representative and weighted accident features were extracted, as shown in Table 6.

Among these keywords, eight common locations leading to falling accidents were identified, namely, “scaffolding”, “elevator”, “roof”, “exterior wall”, “hoist”, “opening”, “floor”, and “canopy”. Management personnel need to strengthen protective measures for these key areas to reduce the occurrence of falling accidents. Firstly, regarding scaffolding, management personnel should ensure proper construction that complies with regulations and conduct regular inspections and maintenance. Employees using scaffolding must undergo relevant training to understand correct assembly and usage methods. Additionally, they should be equipped with safety devices such as safety belts and safety nets to provide additional protection. Secondly, for the management of elevators and hoists, regular inspections of equipment performance should be carried out to ensure their safety and reliability. Employees must adhere to safety regulations while using elevators, including not overloading or pressing buttons indiscriminately. Emergency stop devices and corresponding escape routes should be in place for emergency evacuation during unexpected situations. Thirdly, for roofs and exterior walls, management personnel should ensure the installation and stability of safety barriers or fences to prevent personnel from accidentally falling. When performing work at heights, employees must wear personal protective equipment such as safety belts and slip-resistant shoes and strictly follow relevant operating procedures. Regular safety training and drills should be conducted to enhance employees’ safety awareness and emergency response capabilities. Openings and floors are common hazardous areas for falling accidents. Therefore, management personnel should set up prominent warning signs to remind employees to be cautious and take measures to restrict employees’ activities in these areas, such as installing protective railings and safety nets.

Given the word “unintentionally”, it can be inferred that many falling accidents occur due to a lack of safety awareness among construction workers, leading to accidents. To prevent such incidents, companies should strengthen safety training, and management personnel must enhance on-site safety supervision. Additionally, setting up warning signs is an important way to cultivate safety awareness among construction workers. Furthermore, falling accidents often occur during activities such as “demolition”, “painting”, and “cleaning”. Therefore, management personnel need to strengthen the supervision of workers during these operations to ensure they properly wear safety equipment and prevent accidents. Accident characterization tables for object strikes, electrocution, collapse of objects, and mechanical injury can be viewed in Appendix A, Table A1, Table A2, Table A3 and Table A4.

3.3.2. Word Cloud Visualization

In this study, to provide a clearer illustration of the importance of keywords related to falling accidents, the TF-IDF algorithm was employed to extract the top 20 weighted accident features, which were then visualized in a word cloud analysis, as shown in Figure 7. The word cloud analysis helps to understand the relationships between these features more effectively. In the word cloud, the size and position of the words reflect their importance, and larger words positioned closer to the center indicate greater importance. Notably, the top-ranking keywords in the word cloud analysis include “ground” (1.81 × 10⁻²), “fall” (1.79 × 10⁻²), “scaffolding” (1.67 × 10⁻²), “death” (1.39 × 10⁻²), “unintentionally” (1.16 × 10⁻²), and “elevator” (1.13 × 10⁻²). The values in parentheses represent the weights of these keywords in the word cloud visualization. Word clouds for object strikes, electrocution, collapse of objects, and mechanical injury can be viewed in Appendix B, Figure A1, Figure A2, Figure A3 and Figure A4.

4. Discussion

With the continuous advancement in natural language processing (NLP) technology, machine learning methods have found extensive applications in accident management. Although significant progress has been made in analyzing construction accident texts using machine learning methods in foreign countries, the application of intelligent methods based on machine learning to construction accidents in China is still relatively limited. This is primarily attributed to the linguistic characteristics, the cultural background differences in the Chinese context, and the lack of large-scale annotated Chinese construction accident datasets, which collectively pose challenges to the application of machine learning methods.

In order to better meet the analytical needs of Chinese construction accident texts and to enhance the accuracy and efficiency of accident analysis, this study has developed a deep learning-based text classification model. The model adopts Convolutional Neural Network (CNN) technology to extract text features and automatically recognize accident types. Additionally, the model integrates the TF-IDF algorithm from text mining and the visualization technique of word clouds to facilitate the extraction and comprehension of crucial accident-related information.

The significance of this research is demonstrated in several aspects. Firstly, through empirical evidence, the feasibility of various machine learning methods in processing Chinese accident texts has been established. Specifically, the deep learning CNN model efficiently and accurately identifies accident types within Chinese construction accident narratives, thereby helping management personnel to analyze incident texts and eliminating the need for the manual analysis of accident narratives. Secondly, the TF-IDF text mining method enables the extraction of essential knowledge from construction accident narratives. Diverging from previous research that primarily focused on identifying key accident causes, this study emphasizes critical accident-prone areas and operational activities, thereby providing a novel perspective for on-site management personnel. Lastly, the amalgamation of deep learning and text mining techniques not only holds potential in the domain of construction accidents but is also applicable to other accident-prone sectors such as chemical, power, and transportation industries. These fields also involve substantial unstructured or semi-structured textual data, encompassing accident reports, descriptions, and records. Similar to the construction accident domain, these sectors need to extract key information, identify accident types, and uncover latent issues from massive textual data. In the future, applying deep learning and text mining methodologies to these sectors is anticipated to further enhance the analytical capabilities and management efficiency of accident-related data, thereby providing more comprehensive and precise support for safety incident management.

It is worth noting that the CNN model proposed in this study falls under the category of supervised learning and involves the labeling of only one accident category during data preprocessing. However, incidents often involve multiple accident types. Therefore, future research could focus on developing classifiers capable of recognizing various accident types, enabling a comprehensive evaluation of incident types when accidents occur. Furthermore, this study exclusively concentrates on the automatic classification of five typical accident types within the Chinese construction industry, although the spectrum of actual accident types is far more diverse. As such, future endeavors should be directed towards creating more comprehensive classifiers, encompassing a wider range of accident types to assess model performance. Moreover, the introduction of more diverse data sources will enhance the effectiveness of the model. Future studies could consider including accident text data from other domains to train the model, thus further validating its generalization capability and extending its applicability to diverse accident classification tasks. By training the model with data from various domains, the model’s understanding and processing capabilities of diverse accident-related texts can be enhanced, thereby augmenting accuracy and adaptability in practical applications.

In conclusion, future research could emphasize the development of classifiers capable of recognizing multiple accident types and integrating more extensive data sources, thereby further elevating model performance and its potential for wider adoption. This will enhance accuracy and efficiency in accident management, providing more precise and effective support and guidance for safety management in various sectors.

5. Conclusions

Construction accidents generate a large amount of unstructured or semi-structured text. The effective utilization of this data has the potential to provide managers with valuable insights that can enhance the safety management of on-site construction operations. The automatic classification of accident types and extraction of key information from accident narratives are the challenges addressed in this study. Therefore, we proposed combining deep learning and text mining to analyze construction accident texts. The main conclusions of the paper are as follows:

(1): We proposed a text classification model based on CNN that can automatically classify texts related to five types of accidents: electrocution, falling, object strikes, collapse of objects, and mechanical injury. The overall accuracy of the model reaches 84%. Compared to other shallow machine learning methods, the CNN model demonstrates higher accuracy in classifying construction accident narrative texts, outperforming the other three shallow machine learning models. This indicates that our model exhibits accuracy in handling construction accident narratives.
(2): Using the categorized fall accident text as an illustration, we employed the TF-IDF algorithm for text mining, extracting the foremost 20 weighted accident features that stand as representatives. Delving into these features illuminates eight pivotal accident zones and highlights three operations particularly susceptible to accidents. The presentation of this crucial information through the visualized format of a word cloud serves as a lucid guide for on-site safety management by managers, facilitating the prevention of analogous accidents.
(3): Our innovative approach, combining deep learning and text mining, swiftly identified diverse accident types and extracted essential insights from accident-related Chinese texts. This research can help managers analyze and understand accident narratives, offering robust direction for post-accident emergency response and prevention measures. Moreover, it introduces fresh perspectives and methodologies into construction safety management.

Author Contributions

Conceptualization, C.W.; methodology, C.W.; software, C.W.; validation, C.W. and J.L.; formal analysis, C.W.; investigation, C.W.; resources, C.W.; data curation, C.W. and J.L.; writing—original draft preparation, C.W.; writing—review and editing, C.W. and J.L.; visualization, C.W. and J.L.; supervision, J.L.; project administration, J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Hunan Province, China (Grant No. 2021JJ30744) and the Research Foundation of Education Bureau of Hunan Province, China (Grant No. 20K011).

Data Availability Statement

The data presented in this study are available on reasonable request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Object strikes accident characteristics table.

Keyword	TF-IDF	Keyword	TF-IDF
Dismantling	1.06 × 10⁻²	Elevator	0.72 × 10⁻²
Falling	1.02 × 10⁻²	Formwork	0.60 × 10⁻²
Head	1.01 × 10⁻²	Dropping	0.58 × 10⁻²
Death	0.99 × 10⁻²	Pipes	0.57 × 10⁻²
Ground	0.98 × 10⁻²	Hopper	0.53 × 10⁻²
Concrete	0.91 × 10⁻²	Exterior walls	0.46 × 10⁻²
Scaffolding	0.90 × 10⁻²	Dislodged	0.45 × 10⁻²
Pump truck	0.88 × 10⁻²	Pouring	0.44 × 10⁻²
Tower crane	0.78 × 10⁻²	Fasteners	0.43 × 10⁻²
Wire rope	0.74 × 10⁻²	Cutting	0.43 × 10⁻²

Table A2. Electrocution accident characteristics table.

Keyword	TF-IDF	Keyword	TF-IDF
Scaffolding	1.45 × 10⁻²	Punching	0.46 × 10⁻²
Electrocution	1.04 × 10⁻²	Switch	0.45 × 10⁻²
Crane	0.75 × 10⁻²	Contact	0.44 × 10⁻²
Wires	0.71 × 10⁻²	Fire protection	0.44 × 10⁻²
Concrete	0.68 × 10⁻²	Demolition	0.43 × 10⁻²
High-voltage lines	0.64 × 10⁻²	Restrooms	0.43 × 10⁻²
Ceiling	0.57 × 10⁻²	Electric shock	0.41 × 10⁻²
Wiring	0.55 × 10⁻²	Lighting	0.40 × 10⁻²
Power supply	0.54 × 10⁻²	Submersible pumps	0.39 × 10⁻²
Ground	0.51 × 10⁻²	Cleaning	0.38 × 10⁻²

Table A3. Collapse of objects accident characteristics table.

Keyword	TF-IDF	Keyword	TF-IDF
Scaffolding	1.62 × 10⁻²	Earthwork	0.67 × 10⁻²
Collapse	1.56 × 10⁻²	Burial	0.66 × 10⁻²
Walls	1.51 × 10⁻²	Clearance	0.65 × 10⁻²
Trench	1.35 × 10⁻²	Excavation	0.65 × 10⁻²
Pouring	1.32 × 10⁻²	Ground	0.60 × 10⁻²
Concrete	1.18 × 10⁻²	Falling	0.55 × 10⁻²
Demolition	1.17 × 10⁻²	Formwork	0.54 × 10⁻²
Pits	1.15 × 10⁻²	Excavator	0.53 × 10⁻²
Pipes	0.90 × 10⁻²	Fence	0.53 × 10⁻²
Collapse	0.75 × 10⁻²	Slope	0.50 × 10⁻²

Table A4. Mechanical injury accident characteristics table.

Keyword	TF-IDF	Keyword	TF-IDF
Mixing	1.08 × 10⁻²	Head rope	0.60 × 10⁻²
Reinforcing	1.06 × 10⁻²	Conveyor belt	0.60 × 10⁻²
Mixer	0.91 × 10⁻²	Belt	0.59 × 10⁻²
Concrete	0.87 × 10⁻²	Pumper	0.57 × 10⁻²
Drill	0.83 × 10⁻²	Switch	0.56 × 10⁻²
Equipment	0.83 × 10⁻²	Power	0.55 × 10⁻²
Death	0.76 × 10⁻²	Head	0.54 × 10⁻²
Winches	0.70 × 10⁻²	Body	0.53 × 10⁻²
Operation	0.69 × 10⁻²	Shutdown	0.53 × 10⁻²
Drill pipe	0.66 × 10⁻²	Scene	0.53 × 10⁻²

Appendix B

Figure A1. Object strikes accident word cloud map of the top 20 words. (a) Chinese word cloud and (b) English word cloud.

Figure A2. Electrocution accident word cloud map of the top 20 words. (a) Chinese word cloud and (b) English word cloud.

Figure A3. Collapse of objects word cloud map of the top 20 words. (a) Chinese word cloud and (b) English word cloud.

Figure A4. Mechanical injury word cloud map of the top 20 words. (a) Chinese word cloud and (b) English word cloud.

References

National Bureau of Statistics of China. High-Quality Development of the Construction Industry to Strengthen the Foundation to Benefit People’s Livelihood and Create a New Road—The Fourth in a Series of Reports on the Achievements of Economic and Social Development Since the 18th National Congress of the CPC. 2022. Available online: http://www.stats.gov.cn/xxgk/jd/sjjd2020/202209/t20220920_1888501.html (accessed on 20 July 2023).
Han, Y.; Li, R. Research on the causes and control measures of the “five major injuries” in construction based on accident causation theory. J. Chifeng Univ. (Nat. Sci. Ed.) 2017, 33, 123–126. (In Chinese) [Google Scholar] [CrossRef]
Behm, M.; Schneller, A. Application of the Loughborough Construction Accident Causation model: A framework for organizational learning. Constr. Manag. Econ. 2013, 31, 580–595. [Google Scholar] [CrossRef]
Ferrari, A.; Gori, G.; Rosadini, B.; Trotta, I.; Bacherini, S.; Fantechi, A.; Gnesi, S. Detecting requirements defects with NLP patterns: An industrial experience in the railway domain. Empir. Softw. Eng. 2018, 23, 3684–3733. [Google Scholar] [CrossRef]
Zhang, F.; Fleyeh, H.; Wang, X.R.; Lu, M.H. Construction site accident analysis using text mining and natural language processing techniques. Autom. Constr. 2019, 99, 238–248. [Google Scholar] [CrossRef]
Le, T.Y.; Jeong, H.D. NLP-Based Approach to Semantic Classification of Heterogeneous Transportation Asset Data Terminology. J. Comput. Civil. Eng. 2017, 31, 13. [Google Scholar] [CrossRef]
Tixier, J.P.; Hallowell, M.R.; Rajagopalan, B.; Bowman, D. Automated content analysis for construction safety: A natural language processing system to extract precursors and outcomes from unstructured injury reports. Autom. Constr. 2016, 62, 45–56. [Google Scholar] [CrossRef]
Diamantopoulos, T.; Symeonidis, A. Enhancing requirements reusability through semantic modeling and data mining techniques. Enterp. Inf. Syst. 2018, 12, 960–981. [Google Scholar] [CrossRef]
Ye, Z.H.; Zuo, T.; Chen, W.E.; Li, Y.X.; Lu, Z.Y. Textual emotion recognition method based on ALBERT-BiLSTM model and SVM-NB classification. Soft Comput. 2023, 27, 5063–5075. [Google Scholar] [CrossRef]
Huang, A.Z.; Xu, R.; Chen, Y.; Guo, M.W. Research on multi-label user classification of social media based on ML-KNN algorithm. Technol. Forecast. Soc. Change 2023, 188, 10. [Google Scholar] [CrossRef]
Jalal, N.; Mehmood, A.; Choi, G.S.; Ashraf, I. A novel improved random forest for text classification using feature ranking and optimal number of trees. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 2733–2742. [Google Scholar] [CrossRef]
Shinde, P.P.; Shah, S. A review of machine learning and deep learning applications. In Proceedings of the 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 16–18 August 2018; pp. 1–6. [Google Scholar] [CrossRef]
Alsaleh, D.; Larabi-Marie-Sainte, S. Arabic Text Classification Using Convolutional Neural Network and Genetic Algorithms. IEEE Access 2021, 9, 91670–91685. [Google Scholar] [CrossRef]
Gu, Y.H.; Gu, M.; Long, Y.; Xu, G.D.; Yang, Z.L.; Zhou, J.S.; Qu, W.G. An enhanced short text categorization model with deep abundant representation. World Wide Web 2018, 21, 1705–1719. [Google Scholar] [CrossRef]
Shuang, Q.; Zhang, Z.R. Determining Critical Cause Combination of Fatality Accidents on Construction Sites with Machine Learning Techniques. Buildings 2023, 13, 22. [Google Scholar] [CrossRef]
Choi, J.; Gu, B.; Chin, S.; Lee, J.S. Machine learning predictive model based on national data for fatal accidents of construction workers. Autom. Constr. 2020, 110, 14. [Google Scholar] [CrossRef]
Zermane, A.; Tohir, M.Z.M.; Zermane, H.; Baharudin, M.R.; Yusoff, H.M. Predicting fatal fall from heights accidents using random forest classification machine learning model. Saf. Sci. 2023, 159, 10. [Google Scholar] [CrossRef]
Qiu, Q.J.; Xie, Z.; Wu, L.; Tao, L.F. Automatic spatiotemporal and semantic information extraction from unstructured geoscience reports using text mining techniques. Earth Sci. Inform. 2020, 13, 1393–1410. [Google Scholar] [CrossRef]
Chen, Z.L.; Huang, K.; Wu, L.; Zhong, Z.Y.; Jiao, Z.Y. Relational Graph Convolutional Network for Text-Mining-Based Accident Causal Classification. Appl. Sci. 2022, 12, 13. [Google Scholar] [CrossRef]
Pan, X.; Zhong, B.T.; Wang, Y.H.; Shen, L.X. Identification of accident-injury type and bodypart factors from construction accident reports: A graph-based deep learning framework. Adv. Eng. Inform. 2022, 54, 12. [Google Scholar] [CrossRef]
Xu, H.; Liu, Y.; Shu, C.M.; Bai, M.Q.; Motalifu, M.; He, Z.X.; Wu, S.C.; Zhou, P.G.; Li, B. Cause analysis of hot work accidents based on text mining and deep learning. J. Loss Prev. Process Ind. 2022, 76, 11. [Google Scholar] [CrossRef]
Goh, Y.M.; Ubeynarayana, C.U. Construction accident narrative classification: An evaluation of text mining techniques. Accid. Anal. Prev. 2017, 108, 122–130. [Google Scholar] [CrossRef]
Tian, D.; Li, M.C.; Shi, J.; Shen, Y.; Han, S. On-site text classification and knowledge mining for large-scale projects construction by integrated intelligent approach. Adv. Eng. Inform. 2021, 49, 12. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, M.; Liu, L. A review on text mining. In Proceedings of the 2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 23–25 September 2015; pp. 681–685. [Google Scholar] [CrossRef]
Qiu, Z.X.; Liu, Q.L.; Li, X.C.; Zhang, J.J.; Zhang, Y.Q. Construction and analysis of a coal mine accident causation network based on text mining. Process Saf. Environ. Protect. 2021, 153, 320–328. [Google Scholar] [CrossRef]
Jing, S.F.; Liu, X.W.; Gong, X.Y.; Tang, Y.; Xiong, G.; Liu, S.; Xiang, S.G.; Bi, R.S. Correlation analysis and text classification of chemical accident cases based on word embedding. Process Saf. Environ. Protect. 2022, 158, 698–710. [Google Scholar] [CrossRef]
Hu, J.Q.; Huang, R.; Xu, F.T. Data Mining in Coal-Mine Gas Explosion Accidents Based on Evidence-Based Safety: A Case Study in China. Sustainability 2022, 14, 16. [Google Scholar] [CrossRef]
Onan, A. Two-Stage Topic Extraction Model for Bibliometric Data Analysis Based on Word Embeddings and Clustering. IEEE Access 2019, 7, 145614–145633. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
Zhang, D.W.; Xu, H.; Su, Z.C.; Xu, Y.F. Chinese comments sentiment classification based on word2vec and SVMperf. Expert Syst. Appl. 2015, 42, 1857–1863. [Google Scholar] [CrossRef]
Khatua, A.; Khatua, A.; Cambria, E. A tale of two epidemics: Contextual Word2Vec for classifying twitter streams during outbreaks. Inf. Process. Manag. 2019, 56, 247–257. [Google Scholar] [CrossRef]
Fu, H.P.; Niu, Z.D.; Zhang, C.X.; Ma, J.; Chen, J. Visual Cortex Inspired CNN Model for Feature Construction in Text Analysis. Front. Comput. Neurosci. 2016, 10, 64. [Google Scholar] [CrossRef]
Guo, Q.; Wang, F.L.; Lei, J.; Tu, D.; Li, G.H. Convolutional feature learning and Hybrid CNN-HMM for scene number recognition. Neurocomputing 2016, 184, 78–90. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Curiskis, S.A.; Drake, B.; Osborn, T.R.; Kennedy, P.J. An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit. Inf. Process. Manag. 2020, 57, 21. [Google Scholar] [CrossRef]

Figure 1. Statistics on the number of production safety accidents and deaths in China’s housing and municipal engineering industry from 2016 to 2020.

Figure 2. Workflow of the research.

Figure 3. Sample of accident description and labeling.

Figure 4. The CNN-based classification framework.

Figure 5. Performance comparison of various text classification models in the five categories.

Figure 6. Overall precision comparison of the four models.

Figure 7. Fall accident word cloud map of the top 20 words. (a) Chinese word cloud and (b) English word cloud.

Table 1. Number of accidents of various types in the accident data set.

Type of Accidents	Quantity
Object strikes	221
Electrocution	201
Collapse of objects	374
Falling	771
Mechanical injury	160

Table 2. Statistics on the number of accidents in each region of the dataset.

Regions	Quantity	Regions	Quantity
Beijing	145	Fujian	47
Tianjin	34	Jiangxi	29
Shanghai	126	Shandong	73
Chongqing	72	Henan	62
Shanxi	25	Hubei	92
Jiangsu	307	Hunan	67
Zhejiang	107	Guangdong	137
Anhui	153	Sichuan	125
Guizhou	14	Yunnan	26
Guangxi	73	Shaanxi	13

Table 3. Pseudocode for CNN models.

Pseudocode for CNN-Based Text Classification Models

1: Load CSV file
df = pd.read_csv (‘accidents.csv’)
2: Split dataset into train and test sets
train_data, test_data = splitData (df)
3: Preprocess data
trainLabel, testLabel = preprocessLabels (train_data, test_data)
trainCut, testCut = preprocessText (train_data, test_data)
saveWordData (trainCut + testCut)
4: Train Word2Vec model
w2v_model = trainWord2Vec (loadWordData ())
5: Tokenize and pad sequences
tokenizer = Tokenizer ()
vocab = tokenizeAndPreprocess(loadWordData (), tokenizer)
trainSeq, testSeq = convertAndPad (tokenizer, trainCut, testCut)
6: Encode labels
trainCate, testCate = encodeOneHot (trainLabel, testLabel)
7: Build and train CNN model
model = buildAndTrainCNNModel (vocab, w2v_model, trainSeq, trainCate)
8: Evaluate and visualize model
evaluateAndVisualize (model, testSeq, testCate)
9: Generate reports
generateReports (model, testSeq, testCate)

Table 4. Pseudocode for the TF-IDF algorithm.

Pseudocode for TF-IDF Algorithm to Extract Key Information

1: Load CSV file
df = pd.read_csv (‘fall.txt’)
2: Load stopwords
stopwords = load_stopwords (‘stopwords.txt’)
3: Preprocess descriptions
for row in df [‘description’]:
row [‘description’] = preprocess(row [‘description’], stopwords)
4: Apply TF-IDF algorithm
vectorizer = create_vectorizer (max_features = 1000)
tfidf = vectorizer.fit_transform(df [‘description’])
feature_names = vectorizer.get_feature_names_out ()
tfidf_scores = calculate_tfidf_scores (tfidf)
5: Normalize and print top keywords
normalized_tfidf = normalize_scores (tfidf_scores) print_normalized_keywords
(normalized_tfidf)
6: Create and display word cloud
wordcloud = create_wordcloud (normalized_tfidf)
display_wordcloud (wordcloud)

Table 5. Results of two word-embedding methods.

Label	CNN-Word2vec			CNN-One-Hot
Label	Precision	Recall	F1 Score	Precision	Recall	F1 Score
Object strikes	0.67	0.66	0.66	0.70	0.49	0.57
Electrocution	0.77	0.83	0.80	0.68	0.60	0.64
Collapse of objects	0.93	0.92	0.93	0.84	0.94	0.89
Falling	0.93	0.90	0.91	0.87	0.95	0.91
Mechanical injury	0.90	1.00	0.95	0.88	0.83	0.86
Average	0.84	0.86	0.85	0.80	0.76	0.77

Table 6. Falling accident characteristics table.

Keyword	TF-IDF	Keyword	TF-IDF
Ground	1.81 × 10⁻²	Head	0.59 × 10⁻²
Fall	1.79 × 10⁻²	Painting	0.51 × 10⁻²
Scaffolding	1.67 × 10⁻²	Hoist	0.49 × 10⁻²
Death	1.39 × 10⁻²	Opening	0.43 × 10⁻²
Unintentionally	1.16 × 10⁻²	Floor	0.43 × 10⁻²
Elevator	1.13 × 10⁻²	Cleaning	0.42 × 10⁻²
Demolition	0.92 × 10⁻²	Material	0.42 × 10⁻²
Roof	0.85 × 10⁻²	Canopy	0.42 × 10⁻²
Template	0.78 × 10⁻²	Cement	0.42 × 10⁻²
Exterior wall	0.64 × 10⁻²	Wearing	0.40 × 10⁻²

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Wu, C. Deep Learning and Text Mining: Classifying and Extracting Key Information from Construction Accident Narratives. Appl. Sci. 2023, 13, 10599. https://0-doi-org.brum.beds.ac.uk/10.3390/app131910599

AMA Style

Li J, Wu C. Deep Learning and Text Mining: Classifying and Extracting Key Information from Construction Accident Narratives. Applied Sciences. 2023; 13(19):10599. https://0-doi-org.brum.beds.ac.uk/10.3390/app131910599

Chicago/Turabian Style

Li, Jue, and Chang Wu. 2023. "Deep Learning and Text Mining: Classifying and Extracting Key Information from Construction Accident Narratives" Applied Sciences 13, no. 19: 10599. https://0-doi-org.brum.beds.ac.uk/10.3390/app131910599

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning and Text Mining: Classifying and Extracting Key Information from Construction Accident Narratives

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Set

2.2. Data Preprocessing

2.3. Automatic Classification of CNN

2.3.1. Word Embedding

2.3.2. CNN-Based Text Classification Model

2.4. Key Information Extraction and Word Cloud Visualization

3. Results

3.1. Configuration

3.2. Automatic Category Classification

3.2.1. Parameter Setting

3.2.2. Evaluation Indicators

3.2.3. Model Testing and Evaluation

3.3. Key Information Mining and Word Cloud Visualization

3.3.1. Key Information Mining

3.3.2. Word Cloud Visualization

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI