Next Article in Journal
Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders
Previous Article in Journal
Mathematical Model of Suspended Particles Transport in the Estuary Area, Taking into Account the Aquatic Environment Movement
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Case-Based Reasoning and Attribute Features Mining for Posting-Popularity Prediction: A Case Study in the Online Automobile Community

1
School of Economics and Management, Tongji University, Shanghai 200092, China
2
School of Automation, Nanjing University of Science and Technology, Nanjing 210094, China
*
Author to whom correspondence should be addressed.
Submission received: 12 July 2022 / Revised: 3 August 2022 / Accepted: 8 August 2022 / Published: 11 August 2022
(This article belongs to the Section Mathematics and Computer Science)

Abstract

:
Social media is in a dynamic environment of real-time interaction, and users generate overwhelming and high-dimensional information at all times. A new case-based reasoning (CBR) method combined with attribute features mining for posting-popularity prediction in online communities is explored from the perspective of imitating human knowledge reasoning in artificial intelligence. To improve the quality of algorithms for CBR approach retrieval and extraction and describe high-dimensional network information in the form of the CBR case, the idea of intrinsically interpretable attribute features is proposed. Based on the theory and research of the social network combined with computer technology of data analysis and text mining, useful information could be successfully collected from massive network information, from which the simple information features and covered information features are summarized and extracted to explain the popularity of the online automobile community. We convert complex network information into a set of interpretable attribute features of different data types and construct the CBR approach presentation system of network postings. Moreover, this paper constructs the network posting cases database suitable for the social media network environment. To deal with extreme situations caused by network application scenarios, trimming suggestions and methods for similar posting cases of the network community have been provided. The case study shows that the developed posting popularity prediction method is suitable for the complex social network environment and can effectively support decision makers to fully use the experience and knowledge of historical cases and find an excellent solution to forecasting popularity in the network community.

1. Introduction

With the aid of smart terminals, such as mobile phones and tablets, social entertainment and experience-sharing through online media have become important channels for information dissemination in people’s daily lives [1,2]. Citing the 48th report issued by the China Internet Network Information Center, until June 2021, the number of Internet users in China had exceeded 1 billion, the average weekly time spent on the web was 26.9 h per person, and the Internet penetration rate reached 71.6% [3]. To prevent and control the spread of the epidemic, the implementation of protective isolation measures and home-office policies further stimulated persons’ usage frequency of social media. Social networks represented by WeChat, Blogs, Twitter, and online communities have increased the joy and convenience of daily life for the public but have also become the distribution center of information dissemination. In the era of “We Media”, the research on information dissemination of social networks is one of the academic hot spots [4,5,6,7,8].
The existing research on the effect and regular patterns of information dissemination in the social network can be roughly divided into three categories: (1) Information dissemination based on social network structure. Mathematical models are proposed based on the structure or distribution, network connection strength, or density of social networks [9,10,11]. Empirical research using Sina Weibo [12] shows that the popularity of content is well reflected in the structural diversity of early adopters. Kumar and Sinha [13] pointed out that the speed and intensity of information dissemination depend on the network topology and the initialization of network parameters; the node is the source of motivation, and information dissemination can gain expansion effects through network propagation [14]. Henry et al. [15] study mathematical models on information diffusion and summarize the influence on information flow by network structure, number, and distribution of linked network communities. (2) State and characteristics of network user groups. Wang and Zhu [16] pointed out that users’ social influence often enhances social media information dissemination inequality, and the corresponding social influence threshold model was developed. Wang et al. [7] explored virtual community reward systems’ influence and operation mechanism on sharing knowledge (explicit and implicit). Riquelme et al. [17] proposed a new centrality metric (MilestonesRank) to identify opinion leaders in social media. Firdaus et al. [18] studied the forwarding behavior and forwarding prediction of social network users and analyzed the information diffusion mechanism of online social networks. Ozer et al. present a multidimensional shape-based time-series clustering algorithm that could uncover meaningful clusters of popularity behaviors in real-world GitHub and Twitter datasets [19]. (3) Information dissemination analysis based on data characteristics. Foroozani and Ebrahimi [20] introduced an information diffusion model of social networks, and the densities of adjacent affected users were simulated from the perspectives of time and space for different types and scales of diffusions. Hnings et al. [21] and Fan et al. [22] studied the influence of different attribute features (such as tweet content information and Twitter users) on the speed and scale of information dissemination. Zhang et al. [23] confirmed the richness and reliability of information sources to positively impact information sharing and transmission. Li et al. [24] took the panel data of Sina Weibo as an example to explore the relationship between content characteristics, information source characteristics, and online interaction behaviors of microblog users.
To sum up, social networks have made revolutionary contributions to information sharing and dissemination. Considering social media’s profound and lasting impact on real society, end-users expect to know about all aspects of social life through the network sooner or later. Extracting useful or valuable information from the huge amount of online data to help anticipate, track, and solve issues (e.g., preventing natural hazards, estimating project funding, optimizing marketing campaigns) has become the critical expectation of researchers.

2. CBR Approach for Postings-Popularity Prediction in the Brand Automobile Community

2.1. CBR Approach and Application in Various Fields

Case-Based Reasoning (CBR) is an important knowledge reasoning technique in the AI field [25], which is used to support decision makers in finding ideal solutions. The CBR methodology is to simulate human thinking and reasoning, which solves the current problem by referring to historical experience. Specifically, by searching for historical cases similar to the new problem, the relevant knowledge of historical cases is successfully adopted to solve the new one. Certain principles or thresholds should be satisfied while retrieving and extracting historical cases. Finally, the solution of the fast, practical target case will be achieved, which is an important advantage of the CBR approach, such as the successful application in the fields of crisis prediction or emergency decisions [26,27,28], aided manufacturing [29,30], intelligent diagnosis [31,32,33], and expert intelligence [34,35].
The point of research and applications of the CBR include: Firstly, CBR research mainly focuses on the construction of case expression systems and similarity measures algorithms. According to the characteristics of heterogeneous information in the cases with a concealing property, Wu et al. [25] constructed a similar case extraction and amount estimation method based on a multi-dimensional characteristic system. Chang et al. [36] developed a novel CBR framework, including a collaborative filtering mechanism and a semantic-based case retrieval agent. Zhu et al. [28] studied data characteristics of urban floods. Based on the case similarity measurement of classification filtration, punctiform similarity, interval similarity, and entropy weight method, a four-layer model was proposed. Zhang et al. provided a hybrid similarity algorithm to assist law enforcement officers in finding the hidden property of judgment debtors and analyzed the characteristics of judgment debtors based on the hesitant fuzzy linguistic clustering method [37,38]. Zhang et al. explored the application of hesitant fuzzy linguistic term sets in evaluation information and analyzed the application of the interval type-2 fuzzy TOPSIS (IT2-FTOPSIS) method in risk evaluation [39,40]. He et al. [41] developed a hidden property evaluation model based on the probabilistic linguistic three-way multi-attribute decision-making (PL3W-MADM) method. Cai et al. selected the best performing k-nearest neighbor (KNN) as the evaluation function and developed a similarity calculation method based on normalized Euclidean distance [33]. Jin et al. [42] employed fuzzy similarity measurement (SM), numeric SM, textual SM, and interval SM to calculate the similarities between input variables and corresponding experimental values in visual prostheses research. Secondly, prediction research based on CBR has been applied in various fields. By introducing the concept of context into the CBR system, Zhang et al. [43] developed a predicting method of broadcasting ratings with historical data. Wei et al. [44] proposed a new traffic emission prediction model by combining the CBR with interval-valued intuitionistic fuzzy sets. Hui et al. [45] proposed hybridizing principles of CBR for business failure prediction based on datasets collected from normal economic and financial crisis environments.
It can be seen from the above that the CBR approach has been popularized in many fields, but it has not been widely applied in the field of social networks. At present, no suitable method based on the CBR has been offered to deal with the problem of online-information dissemination, and a set of feasible social-media network case feature systems and similarity measure calculations have not been constructed. However, the methodology of the CBR conforms to human thinking and reasoning logic, which supports decision makers in finding solutions to solve new problems by extracting a group of similar historical cases. Moreover, the CBR approach has unique advantages in solving complex decision-making problems. According to the needs of practical applications, the attribute features of different data types could be dealt with satisfactorily on the CBR approach, such as symbols, crisp numbers, interval numbers, and fuzzy language variables. Based on these attribute features, the historical and target cases could successfully be expressed by the mathematical method and calculated similarity measure between them.
On the basis of these features, the CBR approach is suitable for application and promotion in complex scenes. Therefore, this paper will differ from previous research methods on social network information. From the view of imitating the human reasoning and thinking process, this paper plans to analyze and solve the target object problem by extracting similar historical case sets, from which the particular history knowledge was obtained as a reference [37,44]. We plan to have a case study on the brand automobile discussion site. Each posting in the online automobile community will be considered an independent case. The CBR approach will be explored to analyze and predict the popularity of postings in online communities.

2.2. The Framework of the CBR-Based Popularity Prediction of Community Postings

The online automobile-community brand is a typical social media with the same complex environment and community characteristics, including: (1) The online automobile community is in an interactive and real-time changing network environment. Massive data arises from user interactions in the network community. As time goes on, the amount of data from online communities is not only overwhelmingly large, but in real-time, which makes the online information difficult to track and process. (2) The data from the network is generally rich in information and complex in structure, including simple features that can be observed directly. However, online data is composed of high-dimensional and variable length content, such as histories of posting messages or comments. (3) There are always some regulations such as “Recommendation” or “Page Topping”, which are common in social media. By pushing notifications or adjusting messages’ position on the site page, “Recommendation” and “Page Topping” generally mean that a part of the messages will get more chances in social network communities. These operation mechanisms of the network community support some messages to gain excess returns, and the social platform usually recognizes such messages. Moreover, “network effects” such as the Matthew effect and the information competition in the process of network information dissemination will lead to extreme events and excess returns.
All of the above make network data noisy, hard to quantify, and difficult to extract directly from the Web. It is a challenge to extract valuable information from noisy network data and convert it to a set of quantifiable features or usable attribute features. According to the above challenges, this paper proposes a CBR method for the popularity prediction of community postings, which is suitable for the social network environment, as follows:
(1)
Sort out the existing research and theories on information sharing and dissemination of social media. From the perspective of the AI domain imitating human knowledge reasoning, we explore the utilization of information of case-posting in the network community to solve the problem of the target one.
(2)
Clean up and preprocess the net data packages and collect directly observable information features (hereinafter referred to as simple features) to prepare for the subsequent qualitative and quantitative work.
(3)
Build a network case-postings database using statistical sampling combined with the reference milestone method.
(4)
Summarize and discuss valuable information about the popularity of postings in the online community.
(5)
Extract intrinsically interpretable features from internet information, including simple features and information features (covered features), which require data analysis and text mining technology, hereinafter referred to as covered features. Convert those features into attribute characteristics or hybrid attribute features of different data types, such as clear numbers, interval numbers, and fuzzy language variables.
(6)
Case presentation of historical network case postings and the target posting.
(7)
Calculation of the hybrid similarity measure under each attribute feature represented by different types of data between online case-postings and the target-posting, and calculation of overall hybrid similarity between cases.
(8)
Setting the extraction threshold of similar network case postings, trimming suggestions of extreme network postings, and establishing a reference network case database.
(9)
Generation of prediction results.
As shown in Figure 1, the left part of the framework is related to theoretical methods, and the right part of the frame is the specific process. The research framework consists of two parts: the preparation phase and the analysis phase to predict the popularity attention of the target posting.
Exploratory work of this paper: we propose a CBR method for posting popularity predictions in social networks (hereinafter referred to as CBR & AM). Compared with the previous methods, this paper explores it from the perspective of the AI domain, imitating human knowledge reasoning and making full use of the historical case information and experience in the social network to solve the target objects’ problem. Empirical research of the online automobile community and calculation results of examples show that the CBR approach has unique advantages in solving problems related to social network information, especially when dealing with noisy data in social networks. Based on various attribute features of different types of data, such as clear numbers, interval numbers, fuzzy language variables, and clear symbols, it is convenient for network information with complex construction to represent, in the form of the case, as well as hybrid similarity calculations between cases, etc. The hybrid similarity calculation and case-formalization of network information are very helpful for retrieving and extracting historical cases with a practical reference value from massive network information. The outstanding case representation and the concise algorithm will effectively support decision makers to find more reasonable solutions to current problems by fully drawing on historical experience and knowledge.
On the other hand, compared with the existing CBR methods, from the perspective of promoting the application of the CBR approach, this paper provides an idea of how to summarize useful information to deal with target problems from overwhelming and complex information. Furthermore, the interpretable features are mined from high-dimensional data with complex structures and utilized to construct a case-representation system, not just the features suitable for constructing a case system. Specifically, based on the theory and research of social networks, useful information would be successfully distinguished from the huge and noisy online data. Then, meaningful information is converted into a set of quantifiable features (including simple features and covered features) with the help of computer data mining and text analysis technology. Faced with the complex and changeable community environment, a network case-postings database suitable for the dynamic internet environment is constructed based on statistical sampling combined with milestones. Furthermore, we provide a new CBR-extraction rule based on typical case extraction regulations to reduce the special influence caused by “network effects” (mentioned above), which could be more helpful for selecting from network cases. Then, suggestions are offered for how many extreme case postings should be trimmed appropriately. All the above work helps to expand the application scenarios of the CBR approach, especially in the network environment. It greatly improves the algorithm quality and efficiency of the CBR approach.

2.3. Introduction of Background and Data Mining Analysis

2.3.1. Introduction of Background

The data package of more than 300,000 original records is taken as the research sample from the Magotan-Automobile social community in 2016 (1 January 2016 to 30 December 2016), website: https://club.autohome.com.cn/bbs/forum-c-496-1.html (accessed on 30 July 2022), and as of 30 December, the registered users of the Magotan-Automobile social community were about 100,000. This automobile network community is a Chinese online social community where users communicate, share, and spread information by creating postings, replying to postings, and clicking and browsing postings, similar to Facebook status updates or tweets on Twitter. The sample data package is obtained from individual postings, and each main posting could be regarded as an independent event.
The sample data package was cleaned up and transformed, resulting in a dataset of more than 27,000 posting records, from which we could directly obtain: (1) Simple features, such as user registered name (referred to as Username, abbreviated as US), number of user postings (Posting_number, PN), number of user responses to others (Reply-to_number, RT), number of user postings recommended by the community (Recommendation Postings_number, RE), user registration time (Member Time, MT), the time when the user published the posting (Posting Time, PT), the number of posting hits (Posting-Hit_number, PH), the number of replies to the posting (Posting-Reply_number, PR). (2) High-dimensional and variable length data, e.g., posting title text corpus (Posting-Title Corpus, TC), posting body text corpus (Posting-Body Corpus, BC).
Wu et al. [46] point out that the influence of social networks is related to many factors, which can only be reflected through the interaction between people. Network users make contact by publishing postings, sharing information, offering comments, and other means, and it is generally accepted that influence is linked to the dissemination of opinions. Different forms of social media influence have different manifestations. Hu et al. [5] put forward that users tend to watch videos with a large number of hits and browse postings that everyone likes to browse, which reflects social influence. Then, the information dissemination effect of video media can be measured by the amount of video viewing. For Sina Weibo, the popularity of Weibo is the sum of forwarding volume and replies. For postings from Tianya Forum, the number of replies to a posting is a measure of popularity. Based on the online automobile community background and sample dataset, the more the popularity or obtained attention of a posting in the community, the more clicks and replies to the posting, which means the influence of the posting is greater. Therefore, popularity (hereinafter referred to as attention) is understood as the sum of PH (the number of hits) and PR (the number of replies) of the posting, which can be used to measure the communication influence of a posting in the community; thus, the popularity of a posting in the community is named as Postings_attention (PA), which is defined as follows:
PA = PH + PR

2.3.2. Construction of Network Post Cases Database

Based on the reasoning thinking of the CBR approach, with the help of referring to the attention of similar historical postings in the network community, we try to evaluate the amount of attention to the target posting. However, the amount of attention to a posting can vary greatly over time. The vast majority of online postings will generally experience a period of browsing and replying until almost no one pays attention, and only a very few postings can maintain the ability to gain attention over time. Based on Twitter research, F. Riquelme et al. recognize the influence change of “specific topics” understood as “trend topics”; it is important to note that the trending topics do not last forever, nor disappear to never return [17]. The empirical results showed that the influence of topics changes over time; it was found that 73% of the trending topics were only one day long, 15% lasted two days, and 5% three days, but some isolated cases of topics remained intermittent for more than 10 days and even longer [17]. Moreover, D. Henry et al. studied the information dissemination of the media, determining that the mechanism of communities to which a user belongs encourages information to reach volume peaks between 30 and 40-day intervals [15]. Therefore, the feature of Posting Time (PT) that is more than 45 days is taken as a milestone for reference in the paper. The measure of the milestone that we proposed is based on the relatively stable network state of the reference object at that time.
Furthermore, the scale of network data is huge, and only a one-year scale of network data is collected as the sample set. There are still tens of thousands of records after the sample dataset is cleaned and preprocessed. The statistical sampling method will be combined to economically and efficiently construct the network case database. To ensure that the sampled records are sufficiently representative, the sampling method of stratified sampling (also known as type sampling) is adopted, which is one of the most common sampling techniques utilized in practical work. Before sampling, the N sampling units of the population are divided into k layers (classes) according to a certain mark and then conducted random sampling independently in each layer. The samples drawn in this way are called stratified sampling. Stratified sampling can make full use of the information about the population for stratification. Its sampling effect is usually better than simple random sampling, and its representativeness is also better, which can improve the accuracy of estimators [47]. To sum up, the process of building a network-posting case database is as follows:
(1)
Let P represent the dataset of more than 27,000 posting records from the network community. The records of P are divided into 12 groups according to the month of the records posting time, then:
P = PJanPFeb ∪ … ∪ PDec
(2)
Random sampling was carried out in each group P J a n , P F e b P D e c . Each group sampled n = 100 records and finally obtained a sample dataset with a total sample of 1200 records. The sample dataset name Pyear is as follows:
Pyear = PJanPFeb ∪ … ∪ PDec, nmonth = nJan + nFeb + … + nDec = 1200
To meet expectations and effectiveness, the qualified sample size should reach 400 [48]. For community or national research with a single theme, it is recommended to select 400 to 2500 samples. Ren [49] point out that for community or national research with a single theme, it is recommended to select 400 to 2500 samples.
(3)
The postings that have been published for more than 45 days are extracted from P y e a r as the network posting cases database, and the number of postings in the network posting cases database is named as n 45 , n 45 1200 :
Let P C B R represent the network posting cases database, P C B R = { P 1 C B R , P 2 C B R , P n C B R } , n = 1 , 2 , 3 n 45 , n N , where P n represents the nth posting record.
Let C represent the text corpus set of P C B R , there is C = { C 1 , C 2 , , C n } , n = 1 , 2 , 3 n 45 , n N , satisfying C P C B R , and the corresponding posting-title text corpus set is TC satisfying T C C , the corresponding posting-body text corpus set is BC satisfying B C C ; thus, C is the combination of TC and BC.
Let N W represent the total number of entries after the word segmentation of C, there is N W = { N W 1 , N W 2 , , N W n } , n = 1 , 2 , 3 n 45 , n N , satisfying N W 0 , N W N , where N W T is the corresponding number of entries after word segmentation of TC, and N W B is the corresponding number of entries after word segmentation of BC, then satisfying:
NW = NWT + NWB
Let N S represents the total number of professional terms after word segmentation of C, there is N S = { N S 1 , N S 2 , , N S n } , n = 1 , 2 , 3 n 45 , n N , satisfying 0 N S N W , N S N , where N S T is the corresponding number of professional terms after word segmentation of TC, and N S B is the corresponding number of professional terms after word segmentation of BC, then satisfying:
NS = NST + NSB

2.3.3. Attribute Features Extraction Based on Data Mining Technology

Communication of the online automobile community is mainly in written form rather than face-to-face sharing. The characteristics of postings of the online community are filled with freeness and personalization. The corpus text of postings is characterized by colloquial language, which often contains a lot of unlisted words such as regional dialects, networks, or popular social terms. According to the professional background and brand characteristics of the Magotan-Automobile online community, the community postings contain a large number of car brand appellations or abbreviations and professional terms of the automobile industry. Therefore, there is a huge demand for correctly identifying professional terms and unlisted words. Moreover, some empirical researchers have extracted information features such as theme, emotion, length, and expressions from the dataset of social media such as Twitter. The empirical results show that these features are of remarkable value for related research [21,22], exploring how to use computer algorithms to extract information [50,51].
Text mining or “knowledge discovery” refers to the process of extracting useful, meaningful, and important information from unstructured text [52]. Text-mining technology originated in the computer science literature to derive insights from user-generated content. Moreover, advances in computer natural language technology and the maturity of text mining technology have enabled researchers to mine intrinsically interpretable information from the overwhelming user-generated written content on the web [52,53,54]. Even simple language processing or textual analysis programs (e.g., professional word statistics and word count) can conduct objective quantitative research on large corpora of written content, then mined information will be converted into a set of quantifiable attributes or similarity features between postings. After analyzing and mining a large amount of user-generated written information, it is possible to pull insights into users’ thoughts or shed light on a host of psychological processes and know about the users’ knowledge boundaries to a certain extent [55,56,57].
Therefore, this chapter will use data analysis and text mining technology to analyze the network datasets and provide more approaches and opportunities for studying network information. Based on Python 3.7, this paper develops a data mining program with data processing, text mining, and word statistics functions for the Chinese automobile professional community. Python is an open source software and invented by van Rossum G at the end of 1989, and its developer is the Python Software Foundation (Wilmington, DE, USA). Python has rich standard libraries and also supports calling various powerful third-party libraries (such as “Chinese Stop Thesaurus”, “Automotive Specialty Thesaurus”, etc.) [58].
(1) Preprocessing of the original corpus. There are a large number of meaningless words on the Internet, which search engines have considered unnecessary stop words [59], such as punctuation marks “//, (), $”, or mood auxiliary words and turning connectives “of, yet, hey, still, or)”. Therefore, before starting the data analysis, the “Chinese Stop Thesaurus” constructed in this paper is needed to filter and delete meaningless words, punctuation marks, and symbol patterns to obtain a plain text corpus dataset, which is convenient to improve the accuracy and applicability of text word segmentation and semantic analysis.
(2) Text corpus word segmentation processing. In processing Chinese Web text information, such as information retrieval, information extraction, and the establishment of Library and information keywords, word segmentation is required for text information [60,61]. This third-party word segmentation library jieba is called to segment the text. The basic library of the jieba thesaurus has rich entries. The dynamic programming method is adopted to find the maximum probability path, and the maximum segmentation combination based on word frequency is found. It uses the Viterbi algorithm and HMM model for unlisted words based on Chinese character-forming ability [62]. Jieba also supports the custom dictionary. We customize the “Magotan Automobile-Profession Thesaurus”, which includes terms related to the automotive industry and disciplines, such as gearbox, wheel hub, airbag, decorative light, and reversing, and including the brand models, car configuration, and nicknames of the Magotan series that often appear in the context of the Magotan Automobile-community, such as Magotan, Magotan B8, b7, B8330 noble, Caesar Gold, Dynaudio. Moreover, we construct the “Auto-community User Thesaurus”, which adds hot words and buzzwords from the online community based on the “Magotan Automobile-Profession Thesaurus”, such as UAE Grand Prix, one-day tour, and travel strategy. After loading the customized “Auto-community User Thesaurus” and “Magotan Automobile-Profession Thesaurus”, the program can not only effectively deal with ambiguous words and unlisted words, but also greatly improve the accuracy of word segmentation and recognition efficiency.
(3) Statistics. Automatically count the length of the original text corpus, the length after data analysis, and the total number of words of the text corpus after word segmentation. Then, call “Magotan Automobile-Profession Thesaurus” to identify the automotive and industry-related professional words in the text automatically and count the number of professional words. See Table 1 and Postings—Title Corpus with punctuation and spaces. Text length is measured in bytes.
(4) Frequency analysis. In this paper, the frequency of entries of the posting text corpus (i.e., Postings—Title Corpus and Postings-Body Corpus) is counted. Please refer to Appendix A for the word frequency analysis results. In Appendix A, the frequency proportion (%) is the proportion of the occurrence times of the entry in all entries. The higher the proportion, the more times the entry appears in the entire corpus. If the entry belongs to the vocabulary of “Magotan Automobile-Profession Thesaurus”, then Y/N = 1, otherwise, Y/N = 0. Rank the top 30 high-frequency words in descending order, and the word frequency of the last ranked feature words converges to 1. ① From the perspective of the top 30 high-frequency words, taking “Magotan” as an example, “Magotan” accounts for 34% of the total number of words in the title entries, and the top 30 words in the title entries account for 76.6% of the total number of words in the title entries. Among them, the top 30 words in the body entries account for 58.1% of the total number of words in the body entries, and the top 30 high-frequency words account for more than 50% of the total number of words in the title entries and the body entries. ② The first-ranked and second-ranked entries in the title and body entries are highly consistent, which are “Maggotan” and “pick up the car”. “Maggotan” is the first-ranked, accounting for 34% of the title entries and 11.9% of the body entries. The second-ranked entry “pick up the car” is also closely related to “Maggotan”, which indicates that the topics related to “Maggotan” are highly recognizable in the community. User perception of the community theme is consistent with the image cognition setting by the community. From the side, it also reflects that the information in the community is always closely related to the community topic.
To sum up, the data analysis and text mining process are shown in Figure 2. After traversing each posting, the program could filter and segment the text corpus, automatically calculate the text length, count the number of entries, and extract customized professional entries. The postings after the program processing are shown in Table 1.

2.4. Attribute Features Analysis of the Posting Popularity

As mentioned in the introduction, scholars have carried out a lot of theoretical exploration and empirical research of social network information. Combined with the existing research results and the work of data analysis and text mining in Section 3.2.2, we will summarize valuable information about posting popularity in online communities and extract interpretable attribute features from summarized information.
(1) Member since (MS). The registered member duration MS (unit: day) is the time difference between the user registration time (MT) and the sample collection time on 30 December 2016. The earlier a user becomes a member, the more senior the user is, which indicates that the user has been involved in activities of the community for a longer time and gets more chances to join in cognitive processes that can transform into beneficial information. Albert and Thomas [63] point out that people engage in brand communities to connect with like-minded others. Furthermore, previous studies [64,65,66] have indicated that social media affordances significantly influence user behavior and usage, and those with an old account are more likely to rebroadcast information than users with a recent account [15]. Agichtein et al. [50] utilized the length of user registration time as one of the characteristic indicators when establishing the model. Wasko and Faraj [67] believed that the length of time an individual becomes a member of a professional association represents their professional experience in the industry, and the level of personal expertise could be assessed (from novice = 1 to expert = 5) by their expertise score in the area.
To sum up, we evaluate the user’s community experience level according to the length of registered members (from novice members = 0–6 months, junior members = 6–18 months, intermediate members = 18–36 months, senior members = 36–60 months, diamond members more than 60 months).
(2) Recommended Postings_number (RP). According to the community institutions of the Maggotan-Automobile online community, if the user’s postings are judged by the community to meet the standards of essence postings: original content, smooth sentences, and clear logic, they will be honored as “essence”. High-quality (beneficial) things might be easier to discuss and attract more users [66]. From the perspective of motivation, Lee and Suzuki [6] confirmed that reputation is one of the motivations for information sharing. However, valuable contributions can improve users’ reputations in the industry [67]. Henry et al. [15] observe, for instance, that on Twitter, celebrities receive more messages posted by others. Many other relevant studies have adopted the characteristics of similar honors or diamonds as research indicators [51,68,69].
According to the community’s certification standards of “Recommended Postings”, it not only requires the positive performance of members in the community, but also puts forward requirements for the content quality and writing specifications of postings. In fact, postings that could be recommended as “Recommended Postings” account for a very small percentage of the total number of postings. The proportion of “Recommended Postings” in the sample set of this research accounts for about 2.59%. Other studies have also shown a similar situation. For example, Agarwal et al. [70] counted the influential articles on blog sites in empirical research, and such articles accounted for only 4.1% of all articles. From the perspective of user groups, the majority of users obtained the number of “Recommended Postings” is 0, accounting for about 87.8% of the total. Users who have the experience of the “Recommended Postings” also showed a significant trend of centralization. Therefore, there is no actual interpretive significance of the real situation by directly utilizing the number of “Recommended Postings” as an attribute feature. Referring to the above method of evaluating user’s community experience, we evaluate the community honor level of users according to the number of “Recommended Postings” that have been obtained (from ordinary members = 0, Junior Elite = 1–2, Intermediate Elite = 3–5, Senior Elite = 6–19, Diamond Member more than 60 months).
(3) Posting_number (PN). Posting means sharing information or communicating actively, indicating that users are strongly willing to participate in community communication. Jonah [14] pointed out that similar behaviors above will promote the acquisition of information, and talking and sharing with others serves a bonding function [71]. Moreover, users who pay more attention to their personal social influence (such as reputation) are more inclined to actively publish their opinions. Posting opinions is an effective way to gain attention, expand influence, and improve their status. Lee and Suzuki [6] showed that the amount of one’s information shared a month ago increases by one unit, and the probability of information sharing and inquiry increase by 5% and 1%, respectively. Yang et al. [72] and Jonah [14] pointed out that sharing (posting) is one of the factors to measure opinion leaders, and the source of information is closely related to popularity [5].
(4) Reply-to_number (RT). Replying to others’ postings is (users) interaction behaviors about the information posted by others. More responses to others’ postings mean that users pay more attention and effort to others. It is a reciprocal behavior that is important in promoting information exchange [6]. Furthermore, users’ active participation in community interaction activities will increase their social presence in the community and increase their self-enhancement concerns. When users publish postings, this may seduce them into feeling they are not writing just to themselves but writing online for everyone to see [14]. The replying behavior also reflects the user’s interaction attribute features. Li et al. [4] regarded the interactive attribute features or interaction relationships of users as the reference for discovering network opinion leaders. Other relevant researchers also take the interactive attribute feature as one of the characteristics of building models [5,50,51,72].
(5) Activity_Frequency (AF). With the increase in user behaviors in online communities, the connection between the user and others will be closer. Thus, the users have a certain influence on others. Li et al. [64] point out that the frequency of users’ posting is correlated with their influence on social media after summarizing the number of users’ monthly posts. Through the observation of Twitter, some features of users, such as activity degree, affect the message diffusion in terms of volume and speed. Messages posted by highly active users spread more quickly than other users [15,46]. Ozer et al. [19] discovered patterns of online popularity with the Twitter dataset. The pattern of the steady temporal behaviors is popular (the group of persons who show steady periodic activities is the cluster of popularity). Adrien et al. [73] considered social interaction frequency to detect popular topics the members produce. Other relevant researchers utilize interaction frequency (or activity degree), and user interaction attributes as the characteristics of the information dissemination model and the algorithm for mining the opinion leaders [74].
Based on this research background and the above researchers’ work, “Activity_Frequency” is understood as the users’ frequency of interaction-activity in the community of registered members. Activity_Frequency is defined as follows:
AF = (PN + RT)/MS
(6) Length_Corpus (LC). “Rich” information can often get more attention in network information competition. The richness of information is one of the indicators to evaluate the influence of online media [72]. Hu et al. [5] point out that the length of a blog is positively related to its persuasiveness and its popularity. Hnings et al. [21] regarded corpus length as one of the characteristics that significantly influenced Tweet diffusion. The length of a tweet is positively correlated with the information diffusion value. Moreover, from the perspective of perception, when a posting contains a large amount of information (i.e., long length), it can effectively improve contextual certainty, which helps users extract the main point of information and disseminate information. Jeon et al. [68] proved that text length and vocabulary size are positively correlated with the information load and susceptibility, and Korfiatis et al. [75] found that online review length was not only positively correlated but also significantly affected the perceived usefulness of a review. Based on the previous data mining work, the length of the original corpus of the post title is successfully obtained, and the feature is named Length_Title (LT) (unit: Bytes). The length of the original corpus of the post-body is named Length_body (LB) (unit: Bytes).
(7) Richness Performance_Title (RPT). Stefan et al. [76] understand content richness as how much information is available. Content richness is regarded as one standard of information services related to media-for-monitoring or media-for-searching. In the propagation model, high-quality content highly correlates with external links [68]. Practically, useful content has social exchange value. Empirical evidence also suggests that useful information is more likely to be passed on, which could be shared because it makes the sharer seem smart and helpful [14,56]. Moreover, written content generated by users is also a primary resource for analyzing users’ writing styles and abilities. Empirical research on online product reviews shows that a review might be received by interested buyers, which is associated with its readability [75]. Agichtein et al. [69] analyze the text of large community Q&A portals. The web text uses too many punctuation marks or irregular arrangement spacing, which will be marked as low-quality text, just like common low-quality practice writing text. Wang et al. [77] studied Chinese online written text and found that word ratio could reflect whether there are spelling errors and grammatical problems in writing, and empirical evidence shows the level of writing expression will impact the credibility and authority of opinion leaders.
According to the interface layout of the auto-community, the postings are arranged and displayed in the form of titles. Therefore, including the information carried by postings, the title of postings also has the function of guiding the community information (similar to the hashtag in the hot search). The quality of the posting title content would play a more critical role in its gains. Based on the word ratio of Chinese online written text [77] and the understanding of Content richness [76], the Richness Performance_Title (RPT) is understood to be the proportion of useful information in the title original corpus. The length of the title’s useful information is named length_TitleContent (LTC) (unit: Bytes), and the length of the title original corpus is LT (unit: Bytes) (See Table 1). Thus, “Richness Performance_Title (RPT)” is defined as follows:
RPT = LTC/LT
(8) Posting Topic_Centrality (PTC). Generally, users’ actual knowledge often does not match their expected level. When faced with complex or professional products, people usually tend to accept suggestions from experts. Opinion leaders in network media not only have professional knowledge, but the content they publish is always adapted to their online environment [64,72]. Analyzing the term frequency from the content of messages produced by online members is one of the ways to detect popular topics. The hashtag in generated content is one of the key features affecting information dissemination and building prediction models [73,74]. Furthermore, in automobile-themed communities, auto-related content has a significant influence on the dissemination and popularity of information. On the individual level, one reason people engage in brand communities is to connect with like-minded others. The prevalence of different topics varied with the surroundings, e.g., food was always discussed in restaurants [14,15,22]. After studying the information search [21], it was found that in a given situation, corresponding issues and particular keywords are more resonant and attractive in society as a whole, and the more particular words in media coverage, the more users will search for corresponding content (similar to push-cueing).
According to the data-mining work in Section 3.2.2, 80.0% of the top 30 high-frequency words in the title entries of the posting were identified as the professional entries of the Magotan Auto community, and the proportion of the body entries of postings is 70.0%. Berger and Milkman [56] quantified emotionality as the percentage of words classified as either positive or negative in an article. Based on this research background and the above research results, the more particular entries related to the Magotan Auto community and the automobile industry, the more corresponding information the posting carries, and the more resonant to the given posting environment. On the single posting level, the higher the proportion of particular entries related to the given background, the closer the posting is related to the popular topics in the community, which means the posting has more chance of gaining eyeballs and becoming more popular in the community. Thus, the posting title of Topic_Centrality named “Title Topic_Centrality” (TTC) and the posting body of Topic_Centrality named “Body Topic_Centrality” (BTC) are defined as follows:
TTC = NST/NWT
BTC = NSB/NWB
To sum up, through the above theoretical analysis, we successfully pick the attribute features that can explain the popularity of online community postings. Some of the attribute features can be obtained directly, which are referred to as simple attribute features. The others need to conduct quantificational calculations combined with data-mining technology, which is referred to as covered attribute features. All attribute features are shown in Table 2 below. Linguistic terms of fuzzy linguistic variables and their corresponding triangular fuzzy numbers are shown in Table 3 below.

3. Construction of Network Posting Cases Characteristic System

3.1. Case Presentation of Network Posting Cases and the Target Posting

According to Table 2, a set of intrinsically interpretable attributes A1–A10 have been successfully extracted from the network data package. In the CBR-based approach, the network post cases are presented as “Case = {Network posting cases-description, Popularity of network case postings}.”
Case: according to the above section, there is P C B R = { P 1 C B R , P 2 C B R , P n C B R } , n = n 45 , and P 0 respectively represent the set of the network post cases and the target posting, where P n C B R represents the nth posting case, n N = { 1 , 2 , n 45 } . The popularity of the target posting P 0 is unknown, which needs to be solved by the proposed method.
It should be noted that when building the network case posting database in Section 2.3.2, a Posting_Time (PT) of more than 45 days is taken as a reference milestone (milestone ≥ 45 days). Therefore, the popularity of the target case is described as how much revenue the target posting will gain after publishing more than 45 days.
Network posting cases-description: Let A = { A 1 , A 2 , A m } , a n = { a n 1 , a n 2 , a n m } , a 0 = { a 1 0 , a 2 0 , a m 0 } be the collection of the attribute features of network case-postings, community case-postings, and the target posting, where A m , a n m , a m 0 , respectively, represent the mth attribute of network posting-cases, community case-postings, and the target posting, m N = { 1 , 2 , 10 } . Let w = { w 1 , w 2 w m } be the weight vector of the attribute features of network case postings, where w m is the weight of the mth attribute feature of network posting cases.
Moreover, the attribute values of the target posting a m 0 and the attribute of community posting cases a n m can be described with fuzzy linguistic variables and crisp numbers. Thus, in order to distinguish subsets of two kinds of data types of the attribute set of the network case postings: let the fuzzy linguistic variable attribute set be A l = { A 1 , A 2 A m 1 } and crisp number attribute set be A d = { A m 1 + 1 , A m 1 + 2 A m } , respectively, the corresponding subscript sets are M l = { 1 , 2 m 1 } , M d = { m 1 + 1 , m 1 + 2 m } , satisfying M = M d M l .
The popularity of network case postings: According to Formula (1), the popularity of a posting in the community is defined as Postings_attention (PA), which is crisp numbers (milestone ≥ 45 days). Let y = { y 1 , y 2 y n } be the PA of network case-postings, where y n represents the nth PA of the network case-posting, y 0 represents the PA of the target posting, which should be forecasted in the research.
Thus, the representation of the network posting cases P n C B R and the target posting P 0 is shown in Table 4.

3.2. Extraction of Similar Network Posting Cases

3.2.1. Hybrid Similarity Measure between the Network Posting Cases and the Target Posting

According to Table 2, the values of attributes (A1A10) are composed of crisp numbers and fuzzy linguistic variables. Referring to the hybrid similarity calculation method of the attribute feature value introduced by Zhang et al. [37], the hybrid similarity measures of the network case posting and the target posting in this paper are as follows:
(1) Fuzzy linguistic variables
According to the previous definition and Table 3 and Table 4, when the attribute feature values are fuzzy linguistic variables, A m A l . Suppose that a n m l , a m l 0 respectively are the attribute feature values of the network posting cases P n C B R , and the target posting P 0 , where the corresponding fuzzy triangular numbers of a n m l , a m l 0 are a n m l = ( a n m l h , a n m l u , a n m l v ) , a m l 0 = ( a m l 0 h , a m l 0 u , a m l 0 v ) , respectively. If Formulas (6) and (7) from Zhang et al. [37] are referred to, then the different degree δ ( a n m l , a m l 0 ) between P n C B R and the target post P 0 is as follows:
δ = 1 Δ ~ m l max a ~ n m l a ~ m l 0 = 1 Δ ~ m l max | a n m l h a m l 0 h | 2 + | a n m l u a m l 0 u | 2 + | a n m l v a m l 0 v | 2
where Δ ~ m l max = max { | a n m l h a m l 0 h | 2 + | a n m l u a m l 0 u | 2 + | a n m l v a m l 0 v | 2 | n N } , δ ( a n m l , a m l 0 ) [ 0 , 1 ] . Under attribute A l , similarity measure s i m m l ( P 0 , P n C B R ) is:
s i m m l ( P 0 , P n C B R ) = exp [ δ ( a n m l , a m l 0 ) ] , n N , m M l
(2) Crisp numbers
According to the previous definition and Table 3 and Table 4, when the attribute feature values are crisp numbers, A m A d . Suppose that a n m d , a m d 0 respectively are the attribute feature values of the network posting cases P n C B R , and the target posting P 0 represented by the crisp number. If Formulas (2) and (3) from Zhang et al. [37] are referred to, then the different degree δ ( a n m d , a m d 0 ) between P n C B R and the target posting P 0 is as follows:
δ ( a n m d , a m d 0 ) = 1 Δ m d max ( a n m d a m d 0 ) 2
where Δ m d max = max { ( a n m d a m d 0 ) 2 | n N } , δ ( a n m d , a m d 0 ) [ 0 , 1 ] . Under attribute A d , the similarity measure s i m m d ( P 0 , P n C B R ) is:
s i m m d ( P 0 , P n C B R ) = exp [ δ ( a n m d , a m d 0 ) ] , n N , m M d
(3) The hybrid similarity measure between the network posting cases and the target posting
According to the above Equations (10)–(13), the similarity measure s i m M ( P 0 , P n C B R ) of attribute A m between the network posting cases P n C B R and the target posting P 0 can be obtained, satisfying M = M d M l . Based on the weighted KNN strategy [25], the overall similarity measure between two cases is defined as the weighted sum of the similarity measure of each attribute feature.
Suppose that s i m ( P 0 , P n C B R ) is the hybrid similarity measure between the network posting cases P n C B R and the target posting P 0 . Thus, the calculation formula of the hybrid similarity measure is as follows:
S i m ( P 0 , P n C B R ) = 1 M s i m M ( P 0 , P n C B R ) w M
where satisfying 0 w M 1 , and 1 M w M = 1 . Obviously, s i m ( P 0 , P n C B R ) [ 0 , 1 ] and the larger the value of s i m ( P 0 , P n C B R ) , the higher the overall similarity between P n C B R and the target posting P 0 .

3.2.2. Extracting Rules of Similar Network Posting Cases

(1) Setting similarity threshold between the network posting cases and the target one
According to the hybrid similarity measure between cases, similar community postings can be extracted. For a specific network posting, the larger the value of the mixed similarity measure between it and the target posting, the higher its referential value to the target one. For the target posting, the smaller the value of the hybrid similarity measure, the more historical cases might be drawn.
Therefore, it is necessary to set a reasonable hybrid similarity threshold to obtain a group of appropriate similar case-postings from the large-scale database efficiently. Zhang et al. [37] introduced the similarity threshold based on the simple minority principle. The simple minority principle believes that the few cases with the highest similarity (about 1/3 of the highest similarity) have a strong reference value. Let τ be the similarity threshold. If Formula (9) from Zhang et al. [37] is referred to, then the calculation formula of τ is defined as follows:
τ = S i m ( + ) S i m ( + ) S i m ( ) 3
where S i m ( + ) = max { S i m ( P 0 , P n C B R ) | n N } , S i m ( ) = min { S i m ( P 0 , P n C B R ) | n N } .
(2) Extraction of Similar Network Posting Cases
When s i m ( P 0 , P n C B R ) τ , it means that the group of community postings has a high similarity with the target posting, which has a high value of reference. Thus, such a group of community postings should be extracted as case postings. On the basis of the principle and Formula (15), all the network posting cases P n C B R greater than or equal to the similarity threshold τ are extracted, and set P τ of the above extracted cases is constructed as follows:
P τ = { P j C B R | j N τ }
where N τ = { n | s i m ( P 0 , P n C B R ) τ , n N } , N τ being the subscript set of similar network posting cases providing a worthy value of reference. Then, P τ P n C B R , N τ n , where P i τ represents the ith of P τ , i N τ .
(3) Trimming suggestions of extreme network posting cases
Empirical research on social networks shows that about 4.1% of articles on blog sites can obtain the vast majority of attention on the website [70], information dissemination in social networks is unequal, and there is a Matthew effect in the process of dissemination [16]. Similarly, when analyzing the community sample data package, we find that about 2.59% of postings in the auto community are not only recognized by the community but also successfully gained a lot of attention. Such a situation may be related to the institutional construction of the auto community. Postings recognized by the community might often be placed in the most prominent position on the community page, or they are strongly “recommended” to users by the community, so they have more chances to gain attention.
According to the actual application scenarios, relevant researchers have adopted experts’ opinions as the basis for judging the similarity threshold [25,78]. Combined with the auto community’s research background and sample data, we propose trimming suggestions for extreme online cases. Suppose the number of recommended trimming extreme cases of one end is N e (one end: extreme obtained attention, or extreme lack of attention due to network effects), which is defined as follows:
N e = [ μ N τ ]
where N e N , N e = int ( N e ) , and μ [ 0 , 1 ] is an empirical parameter.
(4) Formation of referring similar posting cases dataset
In accordance with Table 2 and Table 4, the corresponding attention of P τ is y τ , thus the corresponding attention of P i τ is y i τ . According to the value of y i τ from big to small, sort the set P τ again to obtain a new set. Let such new set be Y max = { y i n τ | 1 n N τ } , n N , where y i n τ represents the nth item of the set Y max , abbreviated as y(n), satisfying y(n) > y(n + 1), where the corresponding posting of y(n) is P(n). Let the set P τ be sorted in the order of the items of P(n), satisfying the nth item of P(n) before the (n + 1)th item. In this way, a new set P n r a n k is constructed as P n r a n k = { P i n τ | 1 n N τ } , where the trimming number is N e according to Formula (17). Suppose the set of similar posting cases is P r e f e r τ , which is constructed as follows:
P r e f e r τ = { P n r a n k | n N r e f e r }
where N r e f e r = { n | N e + 1 n N τ N e , n N } , N r e f e r being the subscript set of P τ , which is sorted in the order of value of corresponding Y τ after extreme postings of two ends trimmed.

3.3. Generation of Forecast Results for Popularity

Although the information in the network community changes in real-time, the reference milestone (≥45 days) has been adopted, by which the community postings chosen could be considered in a relatively stable online state (i.e., the number of clicks and replies of the posting is relatively stable), some isolated network postings that continue to receive additional gains due to special situations will be trimmed as extreme cases. Feng et al. [78] studied the project cost estimation based on the CBR method and finally estimated the experimental case cost by taking the average cost of the three most similar historical cases as the fuzzy reasoning prediction value. Referring to this method and Equations (16)–(18), suppose the corresponding attention of P n r a n k is Y n τ . Thus, the calculation formula provided as the attention prediction of the target posting is as follows:
y 0 = [ n n = N r e f e r Y n r a n k N τ 2 N e ]
where y 0 N , y 0 = r o u n d ( y 0 ) .
To sum up, the analysis process of popularity prediction based on the CBR approach is as follows:
Step 1: Referring to Table 2, Table 3 and Table 4, based on the hybrid similarity attributes value of fuzzy linguistic variables and Crisp numbers, the network posting cases P n C B R and the target posting P 0 are presented in cases.
Step 2: According to Table 3 and Formulas (10)–(13), calculate the similarity measure s i m M ( P 0 , P n C B R ) of each attribute A m between the network posting cases and the target posting.
Step 3: According to Formula (14), calculate the hybrid similarity measure s i m ( P 0 , P n C B R ) between the network posting cases and the target posting.
Step 4: According to Formulas (15) and (16), set the similarity threshold τ between the network posting cases and the target one, determine the extraction rules of similar network case postings, and form a similar case posting database P τ .
Step 5: According to Formulas (17) and (18) construct a similar posting case P r e f e r τ with more reference significance for the social network environment. Propose trimming suggestions for extreme online postings, clarify the number of extreme online postings trimmed of one end N e , determine the trimming rules for extreme postings, and form the similar posting-cased P r e f e r τ based on the similar case posting database P τ in the above step.
Step 6: In accordance with Formula (19), the mean value of the attention obtained by postings P r e f e r τ is taken as the expected forecast result of the popularity of P 0 (milestone ≥ 45 days).

4. Case Study

Step (1): According to Section 2.3.2, a sample database P of the network auto-community is constructed, and the number of sample records of database P is 1200. Based on the reference milestone (≥45 days), the postings published for more than 45 days were extracted from database P to form a new database, which is named the network posting cases PCBR. The number of samples of PCBR is recorded as n45, in this example n45 = 487. The descriptive statistics of database P and the network posting cases PCBR are shown in Table 5 below.
In Table 5, the maximum value of PA of database P is 545,372, the minimum value is 49, and the average value is 3803.68, which shows the existence of extreme cases in the network situation. Moreover, the minimum PA value of database P and PA’s value of the network posting cases PCBR are 49 and 68, respectively, which are very close, indicating that it is suitable for the posting to take 45 days as the reference milestone for its stable network state.
Based on the work of Section 2.3, a set of intrinsically interpretable attributes A1–A10 has been successfully extracted, and their corresponding meanings, data type, and each property in this paper, which are all shown in Table 2, along with linguistic terms of fuzzy linguistic variables are shown in Table 3. However, in light of the theoretical and empirical research results, there are no differences between each attribute in explaining attention. Therefore, the weight vector of network case posting attributes based on the popularity is set as w = { 0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 } . Let the target posting be P01, and Attributes Am are normalized, variable Y represents the PA (Postings_attention) of the posting. Then, the information on the attributes of the network posting cases PnCBR, the target posting P01, and Y are shown in Table 6.
Step (2): According to Table 3 and Formulas (10)–(13), the similarity measure s i m M ( P 0 , P n C B R ) of each attribute between the network posting cases PCBRn and the target posting P01 can be calculated and using Formula (14), we can calculate the hybrid similarity measure sim(P0, PCBRn) between PCBRn and the target posting P01. Then, a new database of network postings named Pτ is created that has sorted order from big to small by the value of sim(PCBRn, P01). The information of Piτ, the value of sim (Piτ, P01), and the value of the corresponding variable Y are all shown in Table 7.
Step (3): According to Equations (15) and (16), calculate the similarity threshold τ between the network posting cases and the target one, τ = 0.870 of this example.
τ = S i m ( + ) S i m ( + ) S i m ( ) 3 = 0.993 0.993 0.624 3 = 0.870
As shown in Table 7, the max value of the hybrid similarity measure sim(Piτ, P01) between Piτ and the target posting P01, sim (Pτ1, P01) = 0.993, the min is sim(Pτ487, P01) = 0.624.
Guided by extracting rules and the similarity threshold τ, all the network postings of Piτ whose corresponding value of sim (Piτ, P01) is greater than or equal to τ should be extracted, forming a similar case posting database. Thus, until posting case Pτ375, of which the sim (Pτ375, P01) is equal to 0.870, all previous posting cases Pτ1–Pτ375 meet the extraction requirements of this example. In this way, the network posting cases Pτ1–Pτ375 are successfully extracted and form a similar case postings database with vital reference significance; let the similar case postings database be P i τ = { P i τ | 1 i 375 } , i N . Where Nτ = 375 of this example according to Equation (16).
Figure 3 shows the frequency distribution of the value of sim(Piτ, P01) between the network posting cases PCBR of 487 records and P01, and the value of sim(Piτ, P01) ranges from 0.993 to 0.624. The y-axis is the corresponding number of posting cases for each interval value of sim(Piτ, P01). As shown in Figure 3, there are 375 posting cases that satisfy the conditions of extracting rules and the similarity threshold τ.
Step (4): According to step (3), a similar case posting database P i τ = { P i τ | 1 i 375 } of this example has been obtained. Using Equation (17), the number of extreme cases of unilateral trimming Ne = int (μNτ). Based on the data mining analysis of this research sample collection and the relevant empirical research mentioned above, μ = 2%, then Ne = 7 of this example. Using Equation (18) and following the trimming suggestions of extreme online cases, let the set P i τ = { P i τ | 1 i 375 } be sorted in the order of the corresponding value of Y from big to small; in this way, a new set P n r a n k , n = 375 could be formed. Due to Ne = 7, the recommended trimming of extreme cases of one end are P 1 r a n k , P 2 r a n k P N e r a n k , another end of the trimmed cases is P 375 r a n k , P 374 r a n k P 375 N e + 1 r a n k . The attributes of the trimmed posting cases, the corresponding value of sim (Pnrank, P01), and the corresponding value of Y are all shown in Table 8.
As shown in Table 8, the hybrid similarity measure sim(Pnrank, P01) between the trimmed posting cases and the target posting P01 ranges from 0.974 to 0.872. Although the hybrid similarity measure sim(Pnrank, P01) between the above cases all meet the extracting rules (≥τ 0.87), where the largest value of corresponding Y of the trimmed posting cases (hereinafter referred to as Y(rankn)) Y(rank1) = 545,372, the minimum is Y(rank375) = 68, obviously, there is a tremendous distance between Y(rank1) and Y(rank375). Moreover, the overall value of Y(rank1)–Y(rank7) is significantly larger. On the other hand, the overall value of Y(rank369)–Y(rank375) is significantly smaller.
As shown in Figure 4, the x-axis is the all hybrid similarity measure sim (Pnrank, P01) between the posting cases of Pnrank and P01 before trimming, and the y-axis (value = Y/1000) is the value of the corresponding Y of Pnrank. From the distribution of posting cases of Pnrank on the x-axis, it could be pointed out that the hybrid similarity measures between Prank375 and P01, Prank1, and P01 all meet the extracting rules, indicating that both Prank375 and Prank1 are similar to P01. However, Y(rank1) is not only greatly spaced from Y(rank375), but also the distribution of Y(rank1) in Figure 4 is in a significant discrete situation far from other posting cases. Similarly, the distribution of Y(rank2) = 52,342 also shows a relatively discrete state.
From Table 8 and Figure 4, it can be clearly seen that the Matthew effect is caused by information competition and the community incentive effect by the network community institution, collectively referred to as network effects. Network effects not only exist in social media but also affect earnings (here refers to the attention gained) of network postings. The network effect can simultaneously impose two significant opposite effects on earnings, even if these online postings are similar to attribute features. Therefore, when applying the CBR method in the network environment, the influence of the network and background situation on the case should not be ignored.
As shown in Figure 5, the figure shows the posting cases after the extreme cases of both ends are eliminated, where Y(rank8) = 20,441 is the maximum, and its corresponding sim (Prank8, P01) is 0.925. In addition, Y(rank117) = 1297, and its corresponding sim (Prank117, P01) is 0.993, which is the largest of all hybrid similarity measures. The space between Y(rank8) and Y(rank117) of Figure 5 is much closer than the distance between Y(rank1) and Y(rank375) in Figure 4. Therefore, when the muster Pnrank has finished trimming extreme cases of both ends, the blue dots’ distribution, compared with the distribution in Figure 4, is relatively uniform and more concentrated in Figure 5. If we draw a red regression line in the concentration area of blue dots, the blue dots could be found to evenly distribute over the upper and lower sides of the red regression line.
To sum up, when analyzing and evaluating cases of the social network, it is very reasonable and necessary to trim extreme cases which satisfy the case extraction rules but are impacted by the network effect. After the similar network posting cases Pτ are trimmed, the referring similar posting cases Pτrefer are formed according to Formula (19).
Step (5): Using Formula (19), the calculation of the attention as a forecast result of the target posting P01 is as follows:
y 1 0 = r o u n d ( y 1 0 ) = [ i = 8 368 Y i r a n k 375 2 × 7 ] = 1513
Referring to the above Steps (1)–(5), including P01, we created 10 examples P01P010 in all, which are shown in Table 9 below. Where features A1A10 are attributes of example target postings, τ(i) is the similarity threshold between the network posting cases and the target postings P01P010, Y0 is the forecast result of attention obtained by the target posting, PA is the actual attention gained when the example target posting is in a relatively stable network-state (milestone ≥ 45 days). Error is the percentage error between the predicted result and the actual value of the attention obtained by the example posting in this paper, Error = (PA − Y0)/PA.
As shown in Table 9, Attributes A1A10 of example postings P01P010 are different, which indicates that these 10 target postings are independent examples. According to Equation (15), the setting of the similarity threshold τ is based on the principle of a simple majority, by which one-third of historical cases with the highest similarity to the target object will be extracted. Therefore, the larger the value of similarity threshold τ is, the higher the similarity between the extracted case and the target one. On the other hand, the target one has a higher reference significance of the extracted cases. In Table 9, the corresponding τ value of P05 is τ(5) = 0.885, and the corresponding percentage error Error = 1.02% indicates that for the example target posting P05, the network cases extracted from the network posting cases PCBR have more reference significance. Thus, the error of the prediction result is smaller. On the other hand, τ(9) = 0.813, and the corresponding percentage error is −3.16%, which indicates that the network postings extracted from PCBR are less similar to example P09 than the example P05. Therefore, the error of this prediction result relatively increases a little. However, the error of the prediction results of all 10 examples is not more than ± 5 % compared with the attention actually obtained by target postings of the auto community.

5. Conclusions

To solve the related problems of information dissemination on social media, we propose a CBR & AM (a method of Case-Based Reasoning and Attribute Features Mining for Popularity Prediction in Social Network) based on the research background of the automobile network community. The case postings of vital reference significance are carefully identified from massive information in the online community Drawing on the knowledge of historical case postings of the auto community, we evaluate the number of public views that target postings might gain when they have been kept in a relatively stable network state. This paper has successfully explored a new method to deal with noisy network information and predict the popularity of social network information, which provides an effective way to implement early management of network expectations. We have carried out the following work.
(1)
The construction of a network posting cases database. Preprocess the auto community data package, collect sufficient data samples, and extract basic indicators and simple features. By adopting the sampling method based on mathematical statistics and setting reference milestones for extracting optimal cases, the network posting cases database is constructed according to the social network environment.
(2)
Proposing the idea of intrinsically interpretable attribute features of network postings. Based on existing theoretical exploration and empirical research of social network information, the idea of attribute features with explanatory capability to postings’ popularity has been proposed. Combined with computer data mining and text analysis technology, valuable information about the popularity of postings in the community is extracted, and related attribute features are mined successfully, including directly observable features called simple features and features requiring data analysis and text mining technology called hidden features.
(3)
Construction of the CBR case-presentation system. Based on interpretable attribute features of different data types, CBR presentation of network case postings and the target posting has been provided. Combined with related research on the CBR approach and the research background, this work has successfully offered a hybrid similarity measure between the network posting cases and the target posting and the extraction rules of similar network posting cases as well.
(4)
Trimming suggestions of extreme cases and formation of referring similar posting-cases dataset. Drawn from the existing empirical research work on the social network and combined with the data analysis of the auto community dataset, the view that network effects have an expected impact on information earnings (earnings here refer to the popularity by postings gained) has been proposed according to the actual situation of postings’ earnings in the auto community. Trimming suggestions and corresponding parameters are offered to deal with extreme cases caused by network effects. Finally, the referring similar posting cases base has been successfully formed, which might bring actual reference value for the CBR application to the network environment.
(5)
Case study and demonstration. A case is used to completely demonstrate the entire process of the CBR & AM method, from establishing the network posting cases database to generating the prediction results of the target object. Moreover, 10 independent cases are adapted to compare the forecast results with the actual popularity obtained by cases, respectively. In the process of demonstration and analysis, the detailed data and chart analysis are used to show the rationality of the reference milestone (≥45 days), the existence of network effects, two significant influences imposed by network effects, and the comparative analysis of network extreme cases before and after trimming. In this way, the CBR & AM method proposed has verified the application in the network environment and the advantage of the algorithm.
The application value of this research: the number of views obtained by the online posts commonly reflects the attitudes of the public. In the network community, it could represent the interest of the community users in issues. To the shopping websites, it could reflect consumers’ willingness to buy products. On the news websites, it could indicate the appeal of news to the public. In fact, it is used for feedback on the operation performance of websites or businesses. Meanwhile, the expansion effect of the network often leads the public to follow the trend, which might finally result in the Internet celebrity phenomenon. Thus, this research could offer intelligent suggestions of reference for both consumers and enterprises; moreover, from the perspective of users, according to the characteristics and environmental background of users’ published content, this research can infer whether the content has the potential to be recommended, which is helpful to identify potential opinion leaders or Internet celebrities in the network. Thus, this is also very meaningful for the characterization of the portrait of network users.
The inadequacies and further improvements of this paper: first, if conditions are available, the sample data collected for the construction of the network case base should be dynamically updated over time during the application process so that the case samples can keep synchronizing with the time changes of social media. Secondly, the attribute features adopted for CBR presentation are the features that have the explanatory capability for postings’ popularity in the community. In the future, the influence of these features on the popularity of postings will be further explored to distinguish the weight of attribute features, and this work will be helpful in dealing with the problem from primarily a qualitative analysis to a quantitative analysis. Third, due to the widespread existence of the network effect and social media “recommendation” institutions, some network postings can reap excess earnings, while others fall in another extreme situation (they are quickly submerged in information competition). Thus, the prediction of attention of network postings in this research is not applicable to the forecast of special cases, but the prediction results can be used to evaluate whether postings have a certain potential, such as the potential to be worthy of recommendation, or be judged of no capability to arouse others’ interest. Finally, some experience parameters applied in this research, such as the value of μ, are derived from the results of the sample data analysis. Once the environment of the social network changes, parameters may need corresponding adjustments. In the future, we will try to formulate algorithms or rules without experience parameters.

Author Contributions

Conceptualization, T.Z.; methodology, T.Z.; software, T.Z.; formal analysis, T.Z.; investigation, T.Z.; writing—original draft preparation, T.Z.; writing—review and editing, T.Z. and Z.Z.; supervision, J.L. and Z.Z.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Frequency analysis of entries of the posting text corpus (top 30 high frequency words).
Table A1. Frequency analysis of entries of the posting text corpus (top 30 high frequency words).
No.Entries of Postings—Title Corpus (Chinese-English)Frequency Proportion (%)Professional Terms (Y/N)Entries of Postings—Body Corpus (Chinese-English)Frequency Proportion (%)Professional Terms (Y/N)
1Magotan34.00%1Magotan11.90%1
2pick up the car6.00%1pick up the car3.10%1
3navigate2.80%1Volkswagen2.90%0
4engine oil2.50%1price2.50%0
5price2.20%0engine2.20%1
6hub2.10%1Passat2.10%0
7headlights2.00%1sale2.00%0
8maintain2.00%1engine oil2.00%1
9car mate1.80%0navigate2.00%1
10Volkswagen1.80%0maintain1.90%1
11tyre1.70%1fuel consumption1.80%1
12new car1.60%0interior1.70%1
13engine1.60%1car model1.70%0
14back a car1.60%1test drive1.60%1
15fuel consumption1.40%1headlights1.60%1
16refit a car1.10%1buy a car1.60%1
17steering wheel0.90%1car mate1.60%0
18gearbox0.90%1car1.50%1
19Passat0.80%0new car1.40%0
20air-conditioner0.80%1tyre1.30%1
21caesar gold0.80%1back a car1.20%1
22dashboard0.80%1seat1.10%1
23rearview mirror0.80%1gearbox1.10%1
24Magotanb80.70%1hub1.00%1
25key0.70%0air-conditioner1.00%1
26drive recorder0.70%1drive0.90%1
27tire pressure0.60%1FAW-Volkswagen0.90%0
28car0.60%1steering wheel0.80%1
29seat0.60%1trunk0.80%1
30buy a car0.60%1vehicle0.80%0
Total 76.60%24 (80.0%) 58.10%21 (70.0%)

References

  1. Liao, S.-H.; Widowati, R.; Hsieh, Y.-C. Investigating online social media users’ behaviors for social commerce recommendations. Technol. Soc. 2021, 66, 101655. [Google Scholar] [CrossRef]
  2. Komori, M.; Miura, A.; Matsumura, N.; Hiraishi, K.; Maeda, K. Spread of Risk Information Through Microblogs: Twitter Users with More Mutual Connections Relay News That is More Dreadful1. Jpn. Psychol. Res. 2019, 63, 1–12. [Google Scholar] [CrossRef]
  3. Website Web Information Office. CNNIC Released the 48th Statistical Report on China’s Internet Development. Available online: http://www.cnnic.cn/gywm/xwzx/rdxw/20172017_7084/202109/t20210923_71551.htm (accessed on 7 January 2022).
  4. Li, M.; Wang, X.; Gao, K.; Zhang, S. A Survey on Information Diffusion in Online Social Networks: Models and Methods. Information 2017, 8, 118. [Google Scholar] [CrossRef]
  5. Hu, Y.; Hu, C.; Fu, S.; Huang, J. Survey on Popularity Evolution Analysis and Prediction. J. Electron. Inf. Technol. 2017, 39, 805–816. [Google Scholar] [CrossRef]
  6. Lee, G.; Suzuki, A. Motivation for information exchange in a virtual community of practice: Evidence from a Facebook group for shrimp farmers. World Dev. 2019, 125, 104698. [Google Scholar] [CrossRef]
  7. Wang, N.; Yin, J.; Ma, Z.; Liao, M. The influence mechanism of rewards on knowledge sharing behaviors in virtual communities. J. Knowl. Manag. 2021, 26, 485–505. [Google Scholar] [CrossRef]
  8. Dong, L.; Huang, L.; Hou, J.; Liu, Y. Continuous content contribution in virtual community: The role of status-standing on motivational mechanisms. Decis. Support Syst. 2020, 132, 113283. [Google Scholar] [CrossRef]
  9. Zhang, J.; Yu, P.S. Broad Learning: An Emerging Area in Social Network Analysis. ACM SIGKDD Explor. Newsl. 2018, 20, 24–50. [Google Scholar] [CrossRef]
  10. Jordan, T.; Alves, O.C.P.; De Wilde, P.; De Lima-Neto, F.B. Link-prediction to tackle the boundary specification problem in social network surveys. PLoS ONE 2017, 12, e0176094. [Google Scholar] [CrossRef]
  11. Goldenberg, J.; Libai, B.; Muller, E. Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth. Mark. Lett. 2001, 12, 211–223. [Google Scholar] [CrossRef]
  12. Bao, P.; Shen, W.H.; Huang, J.; Cheng, X.Q. Popularity prediction in microblogging network: A case study on sina weibo. In Proceedings of the 22nd International Conference on World Wide Web Companion, International World Wide Web Conferences Steering Committee, Rio de Janeiro, Brazil, 13–17 May 2013. [Google Scholar]
  13. Kumar, P.; Sinha, A. Information diffusion modeling and analysis for socially interacting networks. Soc. Netw. Anal. Min. 2021, 11, 1–18. [Google Scholar] [CrossRef] [PubMed]
  14. Berger, J. Word of mouth and interpersonal communication: A review and directions for future research. J. Consum. Psychol. 2014, 24, 586–607. [Google Scholar] [CrossRef]
  15. Henry, D.; Stattner, E.; Collard, M. Social media, diffusion under influence of parameters: Survey and perspectives. Procedia Comput. Sci. 2017, 109, 376–383. [Google Scholar] [CrossRef]
  16. Wang, C.-J.; Zhu, J.J. Jumping over the network threshold of information diffusion: Testing the threshold hypothesis of social influence. Internet Res. 2021, 31, 1677–1694. [Google Scholar] [CrossRef]
  17. Riquelme, F.; Gonzalez-Cantergiani, P.; Hans, D.; Villarroel, R.; Munoz, R. Identifying Opinion Leaders on Social Networks Through Milestones Definition. IEEE Access 2019, 7, 75670–75677. [Google Scholar] [CrossRef]
  18. Firdaus, S.N.; Ding, C.; Sadeghian, A. Retweet: A popular information diffusion mechanism—A survey paper. Online Soc. Netw. Media 2018, 6, 26–40. [Google Scholar] [CrossRef]
  19. Ozer, M.; Sapienza, A.; Abeliuk, A.; Muric, G.; Ferrara, E. Discovering patterns of online popularity from time series. Expert Syst. Appl. 2020, 151, 113337. [Google Scholar] [CrossRef]
  20. Foroozani, A.; Ebrahimi, M. Nonlinear anomalous information diffusion model in social networks. Commun. Nonlinear Sci. Numer. Simul. 2021, 103, 106019. [Google Scholar] [CrossRef]
  21. Hönings, H.; Knapp, D.; Nguyễn, B.C.; Richter, D.; Williams, K.; Dorsch, I.; Fietkiewicz, K.J. Health information diffusion on Twitter: The content and design of WHO tweets matter. Health Inf. Libr. J. 2021, 39, 22–35. [Google Scholar] [CrossRef]
  22. Fan, C.; Jiang, Y.; Yang, Y.; Zhang, C.; Mostafavi, A. Crowd or Hubs: Information diffusion patterns in online social networks in disasters. Int. J. Disaster Risk Reduct. 2020, 46, 101498. [Google Scholar] [CrossRef]
  23. Zhang, M.; Lin, W.; Ma, Z.; Yang, J.; Zhang, Y. Users’ health information sharing intention in strong ties social media: Context of emerging markets. Libr. Hi Tech 2021. ahead of print. [Google Scholar] [CrossRef]
  24. Li, K.; Zhou, C.; Yu, X. Exploring the differences of users’ interaction behaviors on microblog: The moderating role of microblogger’s effort. Telemat. Inform. 2020, 59, 101553. [Google Scholar] [CrossRef]
  25. Wu, S.; Lin, J.; Huang, D.; Zhang, Z. Similar Cases Extraction and Amount Estimation of Person Subjected to Execution Concealing Property Based on Similarity of Heterogeneous Information. Chin. J. Manag. Sci. 2021, 1, 1–12. [Google Scholar] [CrossRef]
  26. Qi, J.; Hu, J.; Peng, Y.H.; Ren, Q. Electrical evoked potentials prediction model in visual prostheses based on support vector regression with multiple weights. Appl. Soft Comput. 2011, 11, 5230–5242. [Google Scholar] [CrossRef]
  27. Wang, L.; Guo, Z.; Zhang, Y.; Shang, Y.; Zhang, L. An emergency supplies demand prediction model based on intuitionistic fuzzy case reasoning. J. China Univ. Min. Technol. 2015, 44, 775–780. [Google Scholar] [CrossRef]
  28. Zhu, X.; Fan, Y.; Gao, J. A Case Similarity Calculation Model Based on the Urban Flooding Case with Stratified Data Characteristics. J. Syst. Sci. Inf. 2018, 6, 134–151. [Google Scholar] [CrossRef]
  29. Qin, Y.; Lu, W.; Qi, Q.; Liu, X.; Huang, M.; Scott, P.J.; Jiang, X. Towards an ontology-supported case-based reasoning approach for computer-aided tolerance specification. Knowl. Based Syst. 2018, 141, 129–147. [Google Scholar] [CrossRef]
  30. Zhang, G.-b.; Zhang, G.-m.; Liu, Y.-j.; Huang, W.-y. Combustion optimization of power plant boilers based on data mining case reasoning. J. Eng. Therm. Energy Power 2021, 36, 114–121. [Google Scholar] [CrossRef]
  31. Zhao, H.; Liu, J.; Dong, W.; Sun, X.; Ji, Y. An improved case-based reasoning method and its application on fault diagnosis of Tennessee Eastman process. Neurocomputing 2017, 249, 266–276. [Google Scholar] [CrossRef]
  32. Gu, D.; Liang, C.; Zhao, H. A case-based reasoning system based on weighted heterogeneous value distance metric for breast cancer diagnosis. Artif. Intell. Med. 2017, 77, 31–47. [Google Scholar] [CrossRef]
  33. Cai, H.; Zhang, X.; Zhang, Y.; Wang, Z.; Hu, B. A Case-Based Reasoning Model for Depression Based on Three-Electrode EEG Data. IEEE Trans. Affect. Comput. 2018, 11, 383–392. [Google Scholar] [CrossRef]
  34. Louati, A.; Louati, H.; Li, Z. Deep learning and case-based reasoning for predictive and adaptive traffic emergency management. J. Supercomput. 2020, 77, 4389–4418. [Google Scholar] [CrossRef]
  35. Zhang, Z.; Xing, Z.; Qin, Y. Intuitionistic Fuzzy FMEA Approach for Key Component Identification of Rail Bogie. In International Conference on Electrical and Information Technologies for Rail Transportation; Springer: Singapore, 2021; pp. 460–466. [Google Scholar]
  36. Chang, J.W.; Lee, M.C.; I Wang, T. Integrating a semantic-based retrieval agent into case-based reasoning systems: A case study of an online bookstore. Comput. Ind. 2016, 78, 29–42. [Google Scholar] [CrossRef]
  37. Zhang, H.; Zhang, Z.; Zhou, L.; Wu, S. Case-Based Reasoning for Hidden Property Analysis of Judgment Debtors. Mathematics 2021, 9, 1559. [Google Scholar] [CrossRef]
  38. Zhang, H.; Zhang, Z. Characteristic Analysis of Judgment Debtors Based on Hesitant Fuzzy Linguistic Clustering Method. IEEE Access 2021, 9, 119147–119157. [Google Scholar] [CrossRef]
  39. Zhang, Z.; Li, J.; Sun, Y.; Lin, J. Novel Distance and Similarity Measures on Hesitant Fuzzy Linguistic Term Sets and Their Application in Clustering Analysis. IEEE Access 2019, 7, 100231–100242. [Google Scholar] [CrossRef]
  40. Zhang, Z.; Zhao, X.; Qin, Y.; Si, H.; Zhou, L. Interval type-2 fuzzy TOPSIS approach with utility theory for subway station operational risk evaluation. J. Ambient Intell. Humaniz. Comput. 2021, 1–15. [Google Scholar] [CrossRef]
  41. He, J.; Zhang, H.; Zhang, Z.; Zhang, J. Probabilistic Linguistic Three-Way Multi-Attibute Decision Making for Hidden Property Evaluation of Judgment Debtor. J. Math. 2021, 2021, 1–16. [Google Scholar] [CrossRef]
  42. Qi, J.; Hu, J.; Peng, Y.; Ren, Q.; Wang, W.; Zhan, Z. Integration of similarity measurement and dynamic SVM for electrically evoked potentials prediction in visual prostheses research. Expert Syst. Appl. 2010, 38, 5044–5060. [Google Scholar] [CrossRef]
  43. Zhang, T.; Weng, K.; Zhang, Q.; Zhang, Y. Audience rating predication before broadcasting based on context case-based reasoning. J. Ind. Eng. Eng. Manag. 2020, 34, 156–164. [Google Scholar] [CrossRef]
  44. Wei, M.; Dai, Q. A prediction model for traffic emission based on interval-valued intuitionistic fuzzy sets and case-based reasoning theory. J. Intell. Fuzzy Syst. 2016, 31, 3039–3046. [Google Scholar] [CrossRef]
  45. Li, H.; Adeli, H.; Sun, J.; Han, J.-G. Hybridizing principles of TOPSIS with case-based reasoning for business failure prediction. Comput. Oper. Res. 2011, 38, 409–419. [Google Scholar] [CrossRef]
  46. Wu, X.D.; Li, Y.; Li, L. Influence Analysis of Online Social Networks. Chin. J. Comput. 2014, 37, 735–752. [Google Scholar] [CrossRef]
  47. Gong, C.Z.; Liu, W.; Guo, J.; Liu, B.; Liu, Z. Principles of Statistics, 2nd ed.; China Machine Press: Beijing, China, 2017; p. 309. [Google Scholar]
  48. Mueller, R.O. Structural equation modeling: Back to basics. Struct. Equ. Model. A Multidiscip. J. 1997, 4, 353–369. [Google Scholar] [CrossRef]
  49. Ren, L. Survey Experiment: A New Technique of Causal Study, 1st ed.; Chongqing University Press: Chongqing, China, 2018; pp. 62–120. [Google Scholar]
  50. Agichtein, E.; Liu, Y.; Bian, J. Modeling information-seeker satisfaction in community question answering. ACM Trans. Knowl. Discov. Data 2009, 3, 1–27. [Google Scholar] [CrossRef]
  51. Li, C.; Chao, W.; Chen, X.; Li, Z. Quality Evaluation and Prediction for Question and Answer in Chinese Community Question Answering. Comput. Sci. 2011, 38, 230–236. [Google Scholar] [CrossRef]
  52. Netzer, O.; Feldman, R.; Goldenberg, J.; Fresko, M. Mine Your Own Business: Market-Structure Surveillance Through Text Mining. Mark. Sci. 2012, 31, 521–543. [Google Scholar] [CrossRef]
  53. Zhou, L.; Tang, L.; Zhang, Z. Extracting and ranking product features in consumer reviews based on evidence theory. J. Ambient Intell. Humaniz. Comput. 2022, 1–11. [Google Scholar] [CrossRef]
  54. Zhou, L.; Zhang, Z.; Zhao, L.; Yang, P. Attention-based BiLSTM models for personality recognition from user-generated content. Inf. Sci. 2022, 596, 460–471. [Google Scholar] [CrossRef]
  55. Sun, X.; Ni, R. Chinese cruisers’ product cognition, emotional expression and brand image perception: A web content analysis. Geogr. Res-Aust. 2018, 37, 1159–1180. Available online: http://en.cnki.com.cn/Article_en/CJFDTotal-DLYJ201806009.htm (accessed on 30 July 2022).
  56. Berger, J.; Milkman, K.L. What Makes Online Content Viral? J. Mark. Res. 2012, 49, 192–205. [Google Scholar] [CrossRef]
  57. Zhang, Z.; Guo, J.; Zhang, H.; Zhou, L.; Wang, M. Product selection based on sentiment analysis of online reviews: An intuitionistic fuzzy TODIM method. Complex Intell. Syst. 2022, 8, 3349–3362. [Google Scholar] [CrossRef]
  58. Song, T.; Huang, T.; Li, X. Python Language: An Ideal Choice for the Teaching Reform of Programming Course. China Univ. Teach. 2016, 2, 42–47. [Google Scholar] [CrossRef]
  59. Huo, S.; Zhang, M.; Liu, Y.Q.; Ma, S.P. New Words Discovery in Microblog Content. PR AI 2014, 27, 141–145. [Google Scholar] [CrossRef]
  60. Hong, C.-M.; Chen, C.-M.; Chiu, C.-Y. Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems. Expert Syst. Appl. 2009, 36, 3641–3651. [Google Scholar] [CrossRef]
  61. Jia, Y.; Liu, L.; Chen, H.; Sun, Y. A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth. Pattern Anal. Appl. 2019, 23, 1011–1020. [Google Scholar] [CrossRef]
  62. Yu, C.; Cao, L.; Yin, W.; Zhang, Z.; Zheng, Y. Automatic word segmentation on Lizu spoken annotation corpus. Appl. Res. Comput. 2017, 34, 1325–1328. [Google Scholar] [CrossRef]
  63. Muniz, A.M., Jr.; O’Guinn, T.C. Brand Community. J. Consum. Res. 2001, 27, 412–432. [Google Scholar] [CrossRef]
  64. Li, Y.; Ma, S.; Zhang, Y.; Huang, R. Kinshuk An improved mix framework for opinion leader identification in online learning communities. Knowl. Based Syst. 2013, 43, 43–51. [Google Scholar] [CrossRef]
  65. Lee, C.S.; Ma, L. News sharing in social media: The effect of gratifications and prior experience. Comput. Hum. Behav. 2012, 28, 331–339. [Google Scholar] [CrossRef]
  66. Nordin, S.; Rizal, A.R.A.; Zolkepli, I.A. Innovation Diffusion: The Influence of Social Media Affordances on Complexity Reduction for Decision Making. Front. Psychol. 2021, 12, 705245. [Google Scholar] [CrossRef]
  67. Wasko, M.M.; Faraj, S. Why Should I Share? Examining Social Capital and Knowledge Contribution in Electronic Networks of Practice. MIS Q. 2005, 29, 35–57. [Google Scholar] [CrossRef]
  68. Jeon, J.; Croft, W.B.; Lee, J.H.; Park, S. A framework to predict the quality of answers with non-textual features. In Proceedings of the the 29th Annual International ACM SIGIR Conference, Seattle, WA, USA, 6–11 August 2006; ACM: Seattle, WA, USA, 2006; pp. 228–235. [Google Scholar]
  69. Agichtein, E.; Castillo, C.; Donato, D.; Gionis, A.; Mishne, G. Finding high-quality content in social media. In Proceedings of the International Conference on Web Search and Web Data Mining, Palo Alto, CA, USA, 11–12 February 2008; pp. 183–194. [Google Scholar]
  70. Agarwal, N.; Liu, H.; Lei, T.; Yu, P.S. Identifying the influential bloggers in a community. In Proceedings of the 2008 International Conference on Web Search and Data Mining, Palo Alto, CA, USA, 11–12 February 2008; pp. 207–218. [Google Scholar]
  71. Wetzer, I.M.; Zeelenberg, M.; Pieters, R. “Never eat in that restaurant, I did!”: Exploring why people engage in negative word-of-mouth communication. Psychol. Mark. 2007, 24, 661–680. [Google Scholar] [CrossRef]
  72. Yang, C.; Wang, T.; Ye, S. Research on Evaluation Index System of Microblog Opinion Leaders:A Media Influence Perspective. J. Intell. 2014, 33, 178–183. [Google Scholar] [CrossRef]
  73. Guille, A.; Hacid, H.; Favre, C.; Zighed, D.A. Information diffusion in online social networks: A Survey. ACM SIGMOD Rec. 2013, 42, 17–28. [Google Scholar] [CrossRef]
  74. Wang, L.; Wang, Y.; Wang, D.A.; Xu, X.L. A Survey of Information Diffusion Prediction in Online Social Networks. Netinfo Secur. 2015, 5, 47–55. [Google Scholar] [CrossRef]
  75. Korfiatis, N.; Rodríguez, D.; Sicilia, M.N. The Impact of Readability on the Usefulness of Online Product Reviews: A Case Study on an Online Bookstore. In World Summit on Knowledge Society 2008: Emerging Technologies and Information Systems for the Knowledge Society; Springer: Berlin/Heidelberg, Germany, 2008; pp. 423–432. [Google Scholar]
  76. Geiß, S.; Leidecker, M.; Roessing, T. The interplay between media-for-monitoring and media-for-searching: How news media trigger searches and edits in Wikipedia. New Media Soc. 2015, 18, 2740–2759. [Google Scholar] [CrossRef]
  77. Wang, W.; Ji, Y.; Wang, H.; Zheng, L. Evaluating Chinese Answers’ Quality in the Community QA System: A Case Study of Zhihu. Libr. Inf. Serv. 2017, 61, 36–44. [Google Scholar] [CrossRef]
  78. Feng, W.; Cao, Y.; Ren, H. The Study on the Case-based Reasoning Method of the Cost-estimation in Civil En-gineering. China Civ. Eng. J. 2003, 36, 51–56. [Google Scholar] [CrossRef]
Figure 1. Flowchart of CBR for popularity prediction analysis of network community postings.
Figure 1. Flowchart of CBR for popularity prediction analysis of network community postings.
Mathematics 10 02868 g001
Figure 2. Data mining workflow.
Figure 2. Data mining workflow.
Mathematics 10 02868 g002
Figure 3. Frequency distribution of hybrid similarity measure.
Figure 3. Frequency distribution of hybrid similarity measure.
Mathematics 10 02868 g003
Figure 4. Popularity of initial posting cases and the hybrid similarity measure.
Figure 4. Popularity of initial posting cases and the hybrid similarity measure.
Mathematics 10 02868 g004
Figure 5. Popularity to trimmed posting cases and the hybrid similarity measure.
Figure 5. Popularity to trimmed posting cases and the hybrid similarity measure.
Mathematics 10 02868 g005
Table 1. Word segmentation and statistics of Title Corpus.
Table 1. Word segmentation and statistics of Title Corpus.
No.Postings—Title Corpus (PTC) Length (LT)Word Segmentation ProcessingNWTLength (LTC)Professional WordsNSTLength
1
(Chinese-English)
Finally waiting for you, the 2017 type Magotan 330 is ahead of the Caesar King pick up car record.432017 type Magotan 330 ahead Caesar-King pick-up-car-record 729Magotan Caesar-King210
2
(Chinese-English)
A good place for weekend vacation, a record of one-day trip to Tongli ancient town in Suzhou36weekend vacation good-place record Suzhou Tongli ancient-town one-day-trip 832 00
Table 2. The attribute information of the posting popularity.
Table 2. The attribute information of the posting popularity.
AttributesName (Abbreviation)Meanings of AttributesData Type of AttributesProperty
A1Member Since (MS)Experience levelFuzzy linguistic variablecovered
A2Recommended Postings_number (RP)Honor level Fuzzy linguistic variablecovered
A3Posting_number (PN)Sharing behaviorCrisp numbersimple
A4Reply-to_number (RT)Reciprocal behaviorCrisp numbersimple
A5Activity_Frequency (AF)Interaction-activity degreeCrisp numbercovered
A6Length_Title (LT)Amount of posting title’s information Crisp numbercovered
A7Length_Body (LB)Amount of posting body’s informationCrisp numbercovered
A8RichnessPerformance_Title (RPT)Title-corpus’s useful informationCrisp numbercovered
A9Title_Topic Centrality (TTC)Consistency between title and community-themeCrisp numbercovered
A10Body_Topic Centrality (BTC)Consistency between body and community-themeCrisp numbercovered
Table 3. Linguistic terms of fuzzy linguistic variables and their corresponding triangular fuzzy numbers.
Table 3. Linguistic terms of fuzzy linguistic variables and their corresponding triangular fuzzy numbers.
Linguistic termsS0S1S2S3S4
Member Since (MS)Novice MemberJunior MemberIntermediate MemberSenior MemberDiamond Member
Recommended Postings_number (RP)Regular MemberJunior EliteIntermediate EliteSenior EliteDiamond Elite
Corresponding triangular Fuzzy number(0, 0, 0.25)(0, 0.25, 0.5)(0.25, 0.5, 0.75)(0.5, 0.75, 1)(0.75, 1, 1)
Table 4. Representation of the network posting cases PnCBR and the target posting P0.
Table 4. Representation of the network posting cases PnCBR and the target posting P0.
PostingsAttributes of Network Posting Cases—DescriptionPostings_Attention
Al Ad
PCBRA1Am1Am1+1AmY
P1CBRa11a1m1a1m1+1a1my1
P2CBRa12a2m1a2m1+1a2my2
PnCBRa1nanm1anm1+1anmyn
P0a01a0m1a0m1+1a0my0
Table 5. Database P of the network auto community and the network posting cases P C B R .
Table 5. Database P of the network auto community and the network posting cases P C B R .
Posting DatabaseVariable (Abbreviation)Number of RecordsMinMaxMean
PPosting Time (PT)
(unit: day)
12003259683.94
Postings_Attention (PA)120049545,3723803.68
PCBR
(milestone ≥ 45 days)
Posting Time (PT)
(unit: day)
487452596163.53
Postings_Attention (PA)48768545,3728103.23
Table 6. Attribute value Am and Y of postings.
Table 6. Attribute value Am and Y of postings.
A1A2A3A4A5A6A7A8A9A10Y
PCBR1S3S10.0310.0000.0460.7120.3450.2000.0000.000545,372
PCBR2S4S10.0090.0060.0040.4420.4910.2000.0930.104334,475
PCBR3S2S10.0290.0110.0500.9810.7450.1821.0000.059325,525
PCBR163S1S00.0680.0620.2600.4620.1450.0000.0100.1111562
PCBR164S1S00.0060.0210.0230.3080.2180.6670.0000.0001545
PCBR485S3S00.0060.0070.0060.0770.1820.0000.0170.08383
PCBR486S3S00.1950.1570.1600.0770.1820.5000.0000.00079
PCBR487S2S00.0010.0010.0010.3080.2550.2000.0010.00068
P01S2S00.0040.0060.0080.1920.2550.2500.0100.053-
Table 7. Attribute similarity measure, hybrid similarity measure, and Y between Piτ and P01.
Table 7. Attribute similarity measure, hybrid similarity measure, and Y between Piτ and P01.
PτiA1A2A3A4A5A6A7A8A9A10Sim(Pτi, P01)Y
Pτ11.0001.0000.9970.9940.9951.0001.0001.0000.9960.9440.9931297
Pτ21.0001.0000.9950.9970.9930.9540.9551.0000.9990.9800.987871
Pτ31.0001.0000.9971.0000.9971.0000.9550.9200.9560.9770.980376
Pτ41.0001.0000.9960.9960.9930.8671.0000.9510.9910.9440.97468
Pτ51.0001.0000.9970.9990.9950.9540.9110.9200.9900.9440.971746
Pτ3600.5001.0000.9960.9940.9990.6520.9110.9510.9920.7360.8731000
Pτ3610.7500.7500.9730.9940.9620.5260.8900.9510.9900.9440.873545,372
Pτ3620.7501.0000.8590.8120.8590.6830.9110.9200.9900.9440.8731908
Pτ3720.7501.0000.9890.9400.9980.4450.7220.9200.9900.9440.8702702
Pτ3730.7501.0000.9990.9960.9800.7880.8300.4720.9980.8830.8702704
Pτ3740.7501.0000.9470.9760.9550.9540.7930.7790.9940.5510.8701852
Pτ3750.5001.0000.9961.0000.9820.5920.7930.9730.9980.8630.870451
Pτ4870.5000.0000.3680.8790.7520.5010.5220.7790.9900.9440.6245639
Table 8. Attributes, hybrid similarity measure, and Y of the trimmed posting cases.
Table 8. Attributes, hybrid similarity measure, and Y of the trimmed posting cases.
PranknA1A2A3A4A5A6A7A8A9A10Sim(Prankn, P01)Y
Prank1S3S10.0310.0000.0460.7120.3450.2000.0000.0000.873545,372
Prank2S3S10.0280.0140.0230.4040.4180.1670.1090.0170.88252,342
Prank3S1S00.0020.0000.0170.4620.3640.3330.2060.0990.90134,897
Prank4S0S00.0020.0000.0490.1540.0000.0000.1370.0640.87728,524
Prank5S3S00.1290.2400.1010.5380.3270.2000.0000.0000.87728,126
Prank6S4S10.0170.0280.0090.4230.3640.2000.0000.0000.87228,032
Prank7S3S00.0310.0140.0310.5380.3640.2000.0060.1820.90324,159
Prank369S1S00.0030.0140.0100.1920.2180.0000.0030.2500.92794
Prank370S1S00.0060.0300.0190.1540.2180.0000.0020.0000.93393
Prank371S2S00.0490.2410.0800.1920.2180.3330.0040.2860.93191
Prank372S2S00.0010.0010.0010.4040.3450.3330.0010.0000.95084
Prank373S3S00.0060.0070.0060.0770.1820.0000.0170.0830.92683
Prank374S3S00.1950.1570.1600.0770.1820.5000.0000.0000.87879
Prank375S2S00.0010.0010.0010.3080.2550.2000.0010.0000.97468
P01S2S00.0040.0060.0080.1920.2550.2500.0100.0531.000-
Table 9. Information of 10 example target postings and errors of prediction.
Table 9. Information of 10 example target postings and errors of prediction.
P0iA1A2A3A4A5A6A7A8A9A10τ(i)Y0PAError (%)
P01S2S00.0040.0060.0080.1920.2550.2500.0100.0530.873151315381.63%
P02S1S00.0050.0000.0440.3850.4360.4000.0200.0810.84524952431−2.63%
P03S3S00.0050.0190.0040.2500.2730.0000.0030.0000.82952805131−2.90%
P04S1S00.0020.0040.0120.2690.1820.0000.0040.0000.840192919832.72%
P05S3S10.0070.0240.0070.4620.5090.1430.1430.0170.885300130321.02%
P06S1S00.0260.0060.1730.5000.4360.2860.1050.0710.835291630083.06%
P07S1S00.0040.0330.0270.1920.2180.0000.0100.0830.836895854−4.80%
P08S4S10.0600.0690.0260.0770.2180.0000.0080.1670.83011891157−2.77%
P09S4S40.4510.5060.1790.4420.4730.5000.3990.1220.81365066307−3.16%
P010S4S10.0070.0100.0030.6730.5270.6670.0140.2630.829381439383.15%
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhao, T.; Lin, J.; Zhang, Z. Case-Based Reasoning and Attribute Features Mining for Posting-Popularity Prediction: A Case Study in the Online Automobile Community. Mathematics 2022, 10, 2868. https://0-doi-org.brum.beds.ac.uk/10.3390/math10162868

AMA Style

Zhao T, Lin J, Zhang Z. Case-Based Reasoning and Attribute Features Mining for Posting-Popularity Prediction: A Case Study in the Online Automobile Community. Mathematics. 2022; 10(16):2868. https://0-doi-org.brum.beds.ac.uk/10.3390/math10162868

Chicago/Turabian Style

Zhao, Tingting, Jie Lin, and Zhenyu Zhang. 2022. "Case-Based Reasoning and Attribute Features Mining for Posting-Popularity Prediction: A Case Study in the Online Automobile Community" Mathematics 10, no. 16: 2868. https://0-doi-org.brum.beds.ac.uk/10.3390/math10162868

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop