Next Article in Journal
Dynamic Modelling of Phosphorolytic Cleavage Catalyzed by Pyrimidine-Nucleoside Phosphorylase
Next Article in Special Issue
A Fuzzy Multicriteria Decision-Making (MCDM) Model for Sustainable Supplier Evaluation and Selection Based on Triple Bottom Line Approaches in the Garment Industry
Previous Article in Journal / Special Issue
Performance Evaluation of Sustainable Soil Stabilization Process Using Waste Materials
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Analysis of the Trends in Biochemical Research Using Latent Dirichlet Allocation (LDA)

1
College of Business Administration, Incheon National University, 119, Academy-ro, Yeonsu-gu, Incheon 22012, Korea
2
Department of Applied Chemistry, Kyung Hee University, 1732, Deogyeong-daero, Giheung-gu, Yongin-si, Gyeonggi-do 130-701, Korea
*
Authors to whom correspondence should be addressed.
Submission received: 15 April 2019 / Revised: 11 June 2019 / Accepted: 13 June 2019 / Published: 18 June 2019

Abstract

:
Biochemistry has been broadly defined as “chemistry of molecules included or related to living systems”, but is becoming increasingly hard to be distinguished from other related fields. Targets of its studies evolve rapidly; some newly emerge, disappear, combine, or resurface themselves with a fresh viewpoint. Methodologies for biochemistry have been extremely diversified, thanks particularly to those adopted from molecular biology, synthetic chemistry, and biophysics. Therefore, this paper adopts topic modeling, a text mining technique, to identify the research topics in the field of biochemistry over the past twenty years and quantitatively analyze the changes in its trends. The results of the topic modeling analysis obtained through this study will provide a helpful tool for researchers, journal editors, publishers, and funding agencies to understand the connections among the diverse sub-fields in biochemical research and even see how the research topics branch out and integrate with other fields.

1. Introduction

Biochemistry is the study of the structure, composition, and chemical reactions of substances in living systems and includes the sciences of molecular biology, immunochemistry, and neurochemistry, as well as bioinorganic, bioorganic, and biophysical chemistry [1].
Biochemistry has been broadly defined as “chemistry of molecules included or related to living systems”, but is becoming increasingly hard to be distinguished from other related fields. Targets of its studies evolve rapidly; some newly emerge, disappear, combine, or resurface themselves with a fresh viewpoint. Methodologies for biochemistry have been extremely diversified, thanks particularly to those adopted from molecular biology, synthetic chemistry, and biophysics. There are sub-fields that are now regarded to lie within the field of biochemistry but used to be considered otherwise (e.g., nuclear magnetic resonance spectroscopy and mass spectroscopy for biological systems) [2,3,4,5].
Like other research fields these days, the field of biochemistry tends to focus on—and sometimes adjust itself to—a few high impactful themes, as shown by recent explosions of interest in gene-editing technologies, liquid–liquid phase separation, cryo-electron microscopy, or synthetic biology. Past topics similar to these, the so-called “hot” topics, however, sometimes were short-lived, as they matured rapidly or turned out to be less influential than they initially seemed to be. This is presumably because general access to research publication from the world-wide community is becoming easier and faster, so a few good papers published in powerful journals would give much more ramifications than they used to do [6,7,8,9,10].
The broad scope, rapidly changing interests, and fast transition of research topics in biochemistry make it an interesting field for trend studies. So far, the analysis of research trends has mostly been conducted using qualitative methodologies such as literature review, expert evaluations, and the Delphi method [11,12]. However, such qualitative techniques tend to require enormous time and costs to abstract significant results from large amounts of data while also carrying the possibility of bias depending on the scholars involved, as their subjective values or opinions may be reflected in the study. Moreover, a completely unbiased and objective evaluation of a field which encompasses a broad scope of research topics conducted over multiple decades can be a formidable task to even the top experts in the field. Particularly for biochemical research, whose trends are constantly shifting, a quantitative, as opposed to intuitive and popularity-driven, long-term trend analysis could provide a more objective and unbiased interpretation of the changes in research trends [13].
Therefore, this paper adopts topic modeling, a text mining technique, to identify the research topics in the field of biochemistry over the past twenty years and quantitatively analyze the changes in its trends. Topic modeling enables us to not only specify research topics so far touched upon by scholars in biochemistry but to also extract the keywords used in relation to the topics for a more in-depth analysis. Thus, the results of the topic modeling analysis obtained through this study will provide a helpful tool for researchers, journal editors, publishers, and funding agencies to understand the connections among the diverse sub-fields in biochemical research and even see how the research topics branch out and integrate with other fields. Also, for scholars and students of academic fields outside of biochemistry, this study will present an effective starting point for approaching biochemical research.
In Section 2 that follows, we summarize the existing literature that analyzes research trends using the topic modeling technique then explain the methods and application of the technique in Section 3. Section 4 describes the research data collected for this study and how it was preprocessed for our purposes. The analysis results are summarized in Section 5. Section 6 presents the conclusions of the research.

2. Literature Review

Topic modeling is an algorithm for locating topics from a large, unstructured collection of texts, and it is a model that infers topics by clustering words with similar meanings [14,15,16]. Because of this feature, topic modeling has been widely used to analyze topics and trends. Grimmer [17] analyzed the agendas of U.S. senators emphasized in their press releases, using topic modeling to examine how lawmakers inform their voters about their work. Mann et al. [18] demonstrated that topic modeling can be applied to measure the impact of research papers by applying topical n-grams (TNG) on 300,000 papers in the field of computer science.
Specific to the use of topic modeling to understand trends over time, Griffiths and Steyvers [15] used topic modeling to extract topics from abstracts listed on papers published between 1991 and 2001 in the proceedings of the National Academy of Sciences of the United States of America (PNAS), then identified the cold and hot topics by period [17]. Newman and Block [19] applied topic modeling to understand early American society and its publishing culture. Their study extracted topics from the text of newspapers published in the 18th century and analyzed how the topics changed over time. Gerrish and Blei [20] used the dynamic topic model to identify the changes in the topical contents over time in the corpus of academic research and to measure the influence of individual documents. Wang et al. [16] analyzed 17,000 studies published in Science, and Sun and Yin [13] analyzed the research trends in the transportation sector using topic modeling over time and by country.
As such, topic modeling is being applied to analyze existing literature or, notably, bibliographic data such as research abstracts, as a way to identify the research trends in diverse fields of study. Insofar as this present paper adopts the topic modeling technique on biochemical research, it may seem that what is attempted here lacks novelty. However, at the same time, the fact that topic modeling is frequently used in the existing literature across fields underscores the importance of identifying research trends. In particular, the blurry boundaries, the speedy evolution of topics, and the openness to convergence studies which are characteristic of the field of biochemistry reinforces the contributions of this present study.

3. Topic Modeling

Topic modeling is a text mining technique for discovering an abstract ‘subject’ from a set of documents. A document is generally written on one topic, and as such, the words related to the topic would appear more often than the other words in the document. For example, in a document on the subject of dogs, the words “dog” and “bones” would appear more often, while it is assumed that a document on the topic of cats will more often contain the words “cat” and “meow.” A topic model, roughly speaking, binds the words “dogs” and “bones” under one topic, and “cat” and “meow” under another topic. Topic modeling, like the K-means clustering technique, sets the number of topics in advance and endows the subjects to the words grouped under topics at a later stage.
Latent Dirichlet allocation (LDA), a representative topic modeling technique, is a model based on procedural probability distribution that finds potentially meaningful topics in multiple documents [21]. LDA analysis calculates the probability that certain words will be included in each topic, assuming that multiple words can be grouped under different topics, and calculates the probability that those words will be included in each topic to extract a set of words with high probabilities corresponding to a topic. That is, LDA analysis finds the latent topic corresponding to the words in any given document. The schematic of the LDA technique can be visualized as Figure 1.
LDA’s algorithm finds the latent subject of a document by inferring a hidden variable based on the variables observed in the document, where the observed variables are words (Wd,n). The algorithm uses hyper parameters α and η and the hidden parameter βk to extract the words. The hidden variables Zd,n and Wd,n cannot be observed directly in the document but can be inferred through the LDA model. In the LDA model, Zd,n is generated from θd, which is the ratio of topics by document whose value follows the Dirichlet prior weight determined by the value α. Likewise, βk, which is the probability that a word will be generated by topic, is determined by the value η, and the Dirichlet prior weight of βk is shaped by η. The word Wd,n is thus identified by Zd,n, the value that shows the topic of each word, and βk, the word-by-topic ratio. The algorithm can be expressed as an equation, as follows:
p ( z 1 ,   ,   z N ) = p ( θ ) ( n = 1 N p ( z n | θ ) ) d θ
p ( w ,   z ) = p ( θ ) ( n = 1 N p ( z n | θ ) p ( w n | z n ) ) d θ
p ( θ ,   z | w ,   α ,   β ) = p ( θ ,   z ,   w | α ,   β ) p ( w | α ,   β )
Other than LDA, there are also topic modeling algorithms such as latent semantic analysis (LSA), probabilitic LSA (pLSA), and Dirichlet multinomial regression (DMR). LSA builds semantic spaces based on a corpus and compares the similarities between words, sentences, paragraphs, and documents to form word clusters [22,23,24,25]. While LSA measures the similarities and creates clusters based on the frequency at which words are used in a document, pLSA is a model which looks at the probability that a specific word will appear in a document. LDA is a Bayesian version of pLSA using the Dirichlet distribution, which is a conjugate prior. Thus, while pLSA uses only the document-term matrix as input without consideration of the distribution of topics in a document, LDA considers both the distribution of topics by document and the distribution of terms by topic [26]. DMR expands on LSA to assume that the hyper parameter α depends on the document’s metadata (author, year, department, country) [27,28].
At present, LDA is the most popular topic modeling algorithm used by scholars. Several studies compare LSA and LDA, but there has been no definite conclusion on whether one method is dominantly superior to the other [27,29]. Because LSA is based on term frequency, its advantage is that it produces intuitive results. On the other hand, the strength of LDA is that, because it is a probability-based model, it can reveal hidden connections which cannot be found by looking only at frequency. As this study attempts to redefine the topics in biochemistry and analyze their trends, we utilized the LDA model. Also, because it is difficult to specify the metadata of bibliographical information (e.g., the ‘geographical area’ of a paper can be difficult to define when there are more than one authors who are of different affiliations), the DMR model was considered unsuitable for our research.

4. Data Collection and Preprocessing

Among the research papers provided by the American Chemical Society from 1999 to 2018, this study analyzed 52 journals and 26,422 biochemical papers that fall under the subject of “general chemistry” on the American Chemical Society’s research database (ACS Publications https://0-pubs-acs-org.brum.beds.ac.uk/). The amount of data collected by journal and year are given in Table 1. The journals Biochemistry, Journal of Physical Chemistry B, Journal of the American Chemical Society, and Langmuir, which published the largest number of relevant papers, have the impact factors of 2.938, 3.146, 14.357, and 3.789, respectively. As can be seen from Figure 2, there has been a steady increase in the number of published papers on biochemistry from 1999, with the largest number of studies published in 2012, followed by a slight decline (It should be mentioned that the significant drop in the number of papers published in 2018 is because the ACS database has not been fully updated after June 2018, unrelated to the trends in biochemistry research (Figure 2). Only the articles clearly categorized as biochemistry research in the ACS database were used as the primary data for this study).

Data Preprocessing

The abstracts of the 26,422 papers published in the field of biochemistry were collected and tokenize into units of words. Words that appear after more than 10,000 times, representatively, verbs such as ‘is’, ‘have’, and ‘be’, and unnecessary stopwords including special characters such as punctuation marks, were removed from the data prior to analysis. Then, only the words corresponding to nouns were filtered to be put through the analysis.

5. Results

5.1. Defining the Topics in Biochemical Research

The topic modeling using the preprocessed data proceeded as follows. First, each topic was given a name based on the words assigned to each topic. The number of topics to be analyzed was set to 15, and the outcomes and description of each topic are shown in Table 2. The words assigned to each topic are schematized as Figure 3 using word clouds, and the probability of each topic’s word generation is summarized in Table 3. Figure 4 shows the ratio each topic holds among the total research data.
Topic modeling intuitively assigns research fields and topics based on the composition of the words that are assigned to the topic. For example, the words assigned to the first topic are “aggreg, fibril, amyloid, format, diseas”, based on which it becomes possible to induce the research topic, “Aberrant protein aggregation and diseases”. The second topic, “Ion channels and receptors”, was induced based on the words “channel, ligand, receptor, affin, complex” which were assigned to the topic.
Although the topics showed little difference in their ratios among the total research data, the topic that accounted for the highest ratio were “5. Protein conformation (computational studies), 9. Regulation of cellular functions, 10. Biochemistry of lipids”, while those which took up comparatively lower portions were “3. Protein folding, 11. Development of chromophores for biochemistry, 14. Redox chemistry of cytoskeletal dynamics”.

5.2. Analysis of Yearly Trends

Using the topic modeling analysis results for the research conducted in the field of biochemistry from 1999 to 2018, the changes in research trends over time were identified. As research in 2019 is ongoing, this year was excluded from the trend analysis. The overall trend was determined by plotting the ratio of research papers by topic for all years on a graph (Figure 5), and then the data was analyzed quantitatively using the linear regression model (Table 4).
Sun and Yin [13] collected the data on transportation research from 1991 to 2015 for topic modeling analysis and, to analyze the trends in research, defined the r k index using the ratio of topic k by journal, θ k t , following the equation below. Based on the equation, topics whose r k value is less than 1 were classified as hot topics, and those above 1 as cold topics.
r k = t = 1991 1995 θ k t t = 2011 2015 θ k t
However, this method of analysis is limited in reflecting the overall trends as it is based on simple arithmetic averages, such as r k . As such, in this study, a linear regression model was constructed by the method proposed by Griffiths and Steyvers [15], and hot and cold topics were classified based on the significance of the regression coefficient. The independent variables were set as the 20 years from 1999 to 2018, and the dependent variables as the share of each topic by year. Topics with regression coefficients under them were considered significant, and whether they were hot or cold topics were judged depending on the direction of the regression coefficient, that is, if (+) was assumed as indicating a hot topic, and (−) as indicating a cold topic.
According to the trend analysis based on linear regression, “0. Aberrant protein aggregation and diseases, 5. Protein conformation (computational studies), 9. Regulation of cellular functions, 10. Biochemistry of lipids” were the topics that showed consisted upward trends. “3. Protein conformation (NMR studies), 4. Various helix dimers, 12. Biochemistry of nucleic acids, 13. Biochemistry of Heme complexes, 14. Redox chemistry of cytoskeletal dynamics” were the topics that exhibited clear downward trends over time throughout the whole period examined in this research.
These trends indicate that biochemical research has gradually broadened its scope from its past focus on the understanding of specific biomolecules (proteins, nucleic acids, etc.) to examining all proteins or lipids, then to explaining globally-occurring phenomena. Interestingly though, the research on lipids has become even more active, which may be a reflection of the recent spotlight on autophagy and lipid droplets leading to a growth in research. Also, the decline in research on the biochemistry of nucleic acids despite the large number of papers published on genome editing technology can be attributed to the research focus on the use of genome editing to control cell or protein functions, rather than nucleic acids themselves. Also notable is that in the case of protein conformational dynamics, computational study (topic 5) is a hot topic, while NMR study (topic 3) is a cold topic. One reason for this can be found in the developments in computational power, both in terms of hardware and software, which has enabled analysis of areas that were out of reach using nuclear magnetic resonance (NMR) technology, leading to a growth trend.

Analysis of Topics by Period

Topic modeling creates word clusters from a given document data set based on probability, then attributes topics to each cluster afterwards. Thus, naturally, a change in the data set results in a change in the word clusters. The topics listed in Table 2 are based on the full data set collected for this study for the twenty year period from 1999 to 2018, which means that dividing this data set by period and applying topic modeling to each periodic set individually will extract topics that are specific to the period, as opposed to the general topics found for the twenty year timeframe. To do this, we divided the data set into four specific timeframes and performed topic modeling separately on each of the four data sets (Table 5). The four timeframes were from 1999 to 2006, 2007 to 2011, 2012 to 2017, and the one-year timeframe of 2018 to identify the latest topics in biochemical research. The number of topics was set to five.
The results showed that regardless of the period, a large number of research has been consistently conducted on topics such as “membran, lipid” and “DNA.” On the other hand, “heme” appeared as one of the major topics before 2011 but was not seen thereafter, while “fibril,” and “dynam, simul” emerged as a popular topic from 2007. All of these topics are among the fifteen major topics extracted for the full twenty-year period, however, their inclusion in the major topics by period differed by topic and period in a way that matched the trends shown in Figure 5. Also, the topics extracted for 2018 confirmed the appearance of “water, hydrogen, charg” as new topics. The emergence of “energy production in biosystems” as a new topic around 2018 can be tied to the wider implementation of policies in many countries toward accelerating the development of biomass energy.

6. Discussion and Conclusions

Since its establishment as a separate field of study, biochemistry has been continuously expanding in its research scope, and the related industries have also shown steady growth which is expected to continue in the future [10]. Today, the biochemical industry continues to be of great attraction not only to existing biotech and chemical firms but also inviting new challengers including global companies such as Coca-Cola, IKEA, Dell, and LEGO, who are also preparing to try their hands in the research and development of biochemical products. Meanwhile, countries have also been actively supporting the research and development of biochemistry through national policies. The United States (US) plans to replace 30% of its current oil consumption with green carbon by 2030 and is supporting the wider use of bio-derived products policy-wise by expanding its Biopreferred Program to 97 items and 10,000 types. The European Union, which accounts for 60% of the global bioplastics market, has been developing the biochemical industry as one of its six leading industries, and in Japan, national efforts are being made to promote the biochemical industry through the country’s “Biomass Japan Comprehensive Strategy” [30]. This close connection between research and industry has been a driver for biochemistry to develop at greater speed as well as embrace new areas and topics, which is why the present study’s application of topic modeling, which is often attempted in other studies in diverse fields, for the analysis of biochemical research trends can contribute to existing literature [2,3,4,5].
This study used topic modeling, a text mining technique, based on LDA to define the research topics in biochemical research over the past twenty years and quantitatively analyze their trends. The abstracts of 26,422 papers published in 52 journals from 1999 to 2018 were collected through the American Chemical Society and used as the data for analysis. Based on the results, we identified the fifteen major topics of biochemistry over the past 20 years and, using linear regression, analyzed the amount of research conducted over time. Further, the research data was divided into four periods to repeat the topic modeling analysis for the specific timeframes to see which topics decreased or increased in weight over time and to pinpoint newly-emerging topics.
Our analysis results were in line with the recent trends of the biochemical industry. As recent movements in the industry—such as the 1300-PDO production plant with a capacity of 450,000 tons per year constructed by Dupont and Tate and Lyle to produce PTT(Polytrimethylene terephthalate) to be used as fiber material for the Sorona brand and the joint venture for Propylene glycol (PG) production established by ADM(Archer Deniels Miland) and Cargill—the attention of the industry is pointed at biofiber and biofuels. The latest trends revealed by our study lists “energy production in biosystems” and “aberrant protein aggregation and diseases” as two of the top five topics, indicating that our analysis closely reflects the latest industrial interests. Therefore, applying the analysis method used in this study with continuously updated data will provide a helpful decision-making tool for practitioners and researchers in the industry and academia.
Most researchers, of course, do not seek to follow the trend of the moment, nor should they. The goal of this study is not to argue that researchers should follow academic trends but to contribute to future research by sharing information on trends and opening up possibilities of new and diverse research. It is important to study classic topics to gain an understanding of the fundamentals of a field, but it is also equally necessary to bring together different research fields or to broaden boundaries and uncover new topics according to changing trends or technological development, thereby breathing in new vitality to traditional research topics.
In addition to the papers published by the ACS in this study, there exists a number of journals, including Nature, which publish articles that can be categorized under the field of biochemistry. However, we used only the papers clearly categorized as biochemistry by the ACS to avoid the subjective judgment of what research belongs to the field. Our decision here is both a limitation and a contribution of this study. We put aside research articles which were unspecified as to whether they fall under biochemistry, however, this also means the topics extracted by our analysis can become a base-point from which further trend analyses can be performed with additional data. In particular, further studies may expand our present study by conducting a more in-depth analysis of sub-categories to uncover more specific trends in the wide scope of biochemical research.

Author Contributions

C.K. collected the data and H.J.K. conceived and designed the experiments. K.K. provided biochemical knowledge. The experiment was performed by all related authors. Also, the paper is written by all related authors. All authors read and approved the final manuscript.

Funding

This work was funded by Incheon National University “Program on Research Workforce Post-Doctor Recruitment Support Project 2019”.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. American Chemical Society. Available online: www.Acs.org (accessed on 8 April 2019).
  2. McKendry, P. Energy production from biomass (part 1): Overview of biomass. Bioresour. Technol. 2002, 83, 37–46. [Google Scholar] [CrossRef]
  3. McKendry, P. Energy production from biomass (part 2): Conversion technologies. Bioresour. Technol. 2002, 83, 47–54. [Google Scholar] [CrossRef]
  4. McKendry, P. Energy production from biomass (part 3): Gasification technologies. Bioresour. Technol. 2002, 83, 55–63. [Google Scholar] [CrossRef]
  5. Galambos, L.; Takashi, H.; Vera, Z. (Eds.) The Global Chemical Industry in the Age of the Petrochemical Revolution; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
  6. Bull, A.T.; Holt, G.; Malcolm, D.L. Biotechnology: International Trends and Perspectives; OCDE: Paris, France, 1982. [Google Scholar]
  7. Demirbas, A. Combustion characteristics of different biomass fuels. Prog. Energy Combust. Sci. 2004, 30, 219–230. [Google Scholar] [CrossRef]
  8. Fetterhoff, T.J.; Voelkel, D. Managing open innovation in biotechnology. Res. Technol. Manag. 2006, 49, 14–18. [Google Scholar] [CrossRef]
  9. Kinch, M.S. The rise (and decline?) of biotechnology. Drug Discov. Today 2014, 19, 1686–1690. [Google Scholar] [CrossRef] [PubMed]
  10. Jian, Z.; Zhao, Z.-Y. Green building research–current status and future agenda: A review. Renew. Sustain. Energy Rev. 2014, 30, 271–281. [Google Scholar]
  11. William, L.B.; Mickelsen, J.F. An analysis of prior Delphi applications and some observations on its future applicability. Technol. Forecast. Soc. Chang. 1977, 10, 103–110. [Google Scholar]
  12. Clark, K.R.; Neal, T.A.; Johnson, T.E. Creation of an innovative laser incident reporting form for improved trend analysis using the Delphi technique. Mil. Med. 2006, 171, 894–899. [Google Scholar] [CrossRef] [PubMed]
  13. Sun, L.; Yin, Y. Discovering themes and trends in transportation research using topic modeling. Transp. Res. Part C Emerg. Technol. 2017, 77, 49–66. [Google Scholar] [CrossRef] [Green Version]
  14. Steyvers, M.; Griffiths, T. Probabilistic topic models. Handb. Latent Semant. Anal. 2007, 427, 424–440. [Google Scholar]
  15. Griffiths, T.L.; Steyvers, M. Finding scientific topics. Proc. Natl. Acad. Sci. USA 2004, 101, 5228–5235. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. Wang, C.; Blei, D.; Heckerman, D. Continuous time dynamic topic models. arXiv 2012, arXiv:1206.3298. [Google Scholar]
  17. Grimmer, J. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Anal. 2010, 18, 1–35. [Google Scholar] [CrossRef]
  18. Mann, G.S.; Mimno, D.; McCallum, A. Bibliometric impact measures leveraging topic analysis. In Proceedings of the 6th ACM/IEEE-Cs Joint Conference on Digital libraries, Chapel Hill, NC, USA, 11–15 June 2006. [Google Scholar]
  19. Newman, D.J.; Block, S. Probabilistic topic decomposition of an eighteenth-century American newspaper. J. Am. Soc. Inf. Sci. Technol. 2006, 57, 753–767. [Google Scholar] [CrossRef]
  20. Gerrish, S.; Blei, D.M. A Language-based Approach to Measuring Scholarly Impact. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010. [Google Scholar]
  21. Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
  22. Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 1990, 41, 391–407. [Google Scholar] [CrossRef]
  23. Foltz, P.W.; Kintsch, W.; Landauer, T.K. The measurement of textual coherence with latent semantic analysis. Discourse Process. 1998, 25, 285–307. [Google Scholar] [CrossRef]
  24. Landauer, T.K.; Dumais, S.T. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol. Rev. 1997, 104, 211. [Google Scholar] [CrossRef]
  25. Landauer, T.K.; Foltz, P.W.; Laham, D. An introduction to latent semantic analysis. Discourse Process. 1998, 25, 259–284. [Google Scholar] [CrossRef]
  26. Liu, L.; Tang, L.; Dong, W.; Yao, S.; Zhou, W. An overview of topic modeling and its current applications in bioinformatics. SpringePlus 2016, 5, 1608. [Google Scholar] [CrossRef] [PubMed]
  27. Lee, S.; Song, J.; Kim, Y. An empirical comparison of four text mining methods. J. Comput. Inf. Syst. 2010, 51, 1–10. [Google Scholar]
  28. Bergamaschi, S.; Po, L. Comparing lda and lsa topic models for content-based movie recommendation systems. In Proceedings of the International Conference on Web Information Systems and Technologies, Barcelona, Spain, 3–5 April 2014. [Google Scholar]
  29. Cvitanic, T.; Lee, B.; Song, H.I.; Fu, K.; Rosen, D. Lda v. lsa: A comparison of two computational text analysis tools for the functional categorization of patents. In Proceedings of the International Conference on Case-Based Reasoning, Atlanta, GA, USA, 31 October–2 November 2016. [Google Scholar]
  30. Lee, M. Bio-Based Chemicals Industry Trends. In Bio Economy Brief; Korea Biotechnology Industry Organization: Seoul, Korea, 2018; pp. 2–3. [Google Scholar]
Figure 1. Schematic of the topic modeling algorithm. K: Number of topics; α: Dirichlet prior weight of topic k by document, the parameter which determines the value of θ; η: Dirichlet prior weight of word w by document, the parameter which determines the value of β; θd: The ratio of topics by document; βk: The probability that word w will be generated by topic; Zd,n: The topic of the nth word in document d (index); Wd,n: The nth word in document d (variable observed in document, index).
Figure 1. Schematic of the topic modeling algorithm. K: Number of topics; α: Dirichlet prior weight of topic k by document, the parameter which determines the value of θ; η: Dirichlet prior weight of word w by document, the parameter which determines the value of β; θd: The ratio of topics by document; βk: The probability that word w will be generated by topic; Zd,n: The topic of the nth word in document d (index); Wd,n: The nth word in document d (variable observed in document, index).
Processes 07 00379 g001
Figure 2. Number of biochemical papers published by year.
Figure 2. Number of biochemical papers published by year.
Processes 07 00379 g002
Figure 3. Word cloud by topic.
Figure 3. Word cloud by topic.
Processes 07 00379 g003
Figure 4. Ratio by topic.
Figure 4. Ratio by topic.
Processes 07 00379 g004
Figure 5. Trends in research topics by year.
Figure 5. Trends in research topics by year.
Processes 07 00379 g005
Table 1. Number of biochemical papers published by journal.
Table 1. Number of biochemical papers published by journal.
JournalNo.JournalNo.
Biochemistry11,218ACS Chem. Neurosci.113
J. Phys. Chem. B5172J. Med. Chem.108
J. Am. Chem. Soc.4915Org. Lett.86
Langmuir2196ACS Appl. Mater. Interfaces54
Biomacromolecules662ACS Synth. Biol.48
J. Chem. Theory Comput.528Chem. Mater.43
J. Phys. Chem. A528Mol. Pharmaceutics42
Inorg. Chem.512ACS Cent. Sci.34
J. Phys. Chem. Lett.426J. Nat. Prod.34
ACS Chem. Biol.338J. Chem. Inf. Comput. Sci.31
Acc. Chem. Res.320Environ. Sci. Technol.22
J. Agric. Food Chem.282ACS Biomater. Sci. Eng.19
Nano Lett.252Ind. Eng. Chem. Res.19
J. Phys. Chem. C246ACS Med. Chem. Lett.18
Chem. Rev.241ACS Macro Lett.15
J. Proteome Res.233ACS Sustainable Chem. Eng.8
Bioconjugate Chem.231J. Chem. Eng. Data8
ACS Nano205ACS Catal.6
J. Chem. Inf. Model.203ACS Infect. Dis.6
Macromolecules191ACS Sens.3
Anal. Chem.186ACS Appl. Bio Mater.2
Chem. Res. Toxicol.164ACS Comb. Sci.2
Crystal Growth and Design140J. Chem. Educ.2
J. Org. Chem.125J. Comb. Chem.2
ACS Omega122ACS Earth Space Chem.1
J. Phys. Chem.120Org. Process Res. Dev.1
Table 2. Topic title and descriptions.
Table 2. Topic title and descriptions.
#TitleDescriptions
0Aberrant protein aggregation and diseasesAmyloid formation; Amyloidosys
1Ion channels and receptorsIon channel complexes; Ligands; Drug design
2Protein foldingKinetics and thermodynamics of protein folding; Protein denaturation
3Protein conformation (NMR studies)NMR spectroscopy for conformational dynamics
4Various helix dimersDimeric transmembrane domain, Dimerizing transcription factors, Zinc-finger domain
5Protein conformation computational studies)Molecular dynamics simulations on protein dynamics
6Membrane transportersMembrane fusion proteins; Mechanisms of membrane transportation
7Calculation of electron transferBiological electron transfer; Redox pairs
8Self-assembly of biomoleculesSelf-assembled biopolymers; DNA origami; Crystallization
9Regulation of cellular functionsSignaling pathways; Phosphorylation
10Biochemistry of lipidsLipid biochemistry; Vesicle formation; Cholesterols
11Development of chromophores for biochemistryFluorescence sensors; Biological chromophores
12Biochemistry of nucleic acidsDNA; RNA; DNA secondary structures
13Biochemistry of Heme complexesHemoglobin; Heme complexes
14Redox chemistry of cytoskeletal dynamicsKinetics of redox enzymes; Oxidation of actins
Table 3. Probability distribution of words by topic.
Table 3. Probability distribution of words by topic.
Aberrant Protein Aggregation and DiseasesIon Channels and ReceptorsProtein FoldingProtein Conformation (NMR Studies)Various Helix DimersProtein Conformation (Computational Studies)Membrane TransportersCalculation of Electron Transfer
aggreg0.028channel0.028fold0.026nmr0.023helix0.012simul0.024cell0.013energi0.018
fibril0.024ligand0.022unfold0.016conform0.018dimer0.011dynam0.023transport0.013electron0.015
amyloid0.016ca20.015stabil0.015spectro-scopi0.011sequenc0.011model0.018membran0.011calcul0.014
form0.015receptor0.012temperatur0.014spectra0.011region0.010molecular0.016function0.007charg0.013
0.014site0.010increas0.011dynam0.010helic0.010water0.013coli0.006transfer0.012
format0.014affin0.009chang0.010shift0.009amino0.009conform0.010express0.006group0.011
diseas0.011complex0.009transit0.010reson0.009mutat0.009experiment0.010complex0.006hydrogen0.010
oligom0.009select0.009effect0.009relax0.008two0.009energi0.010substrat0.006proton0.009
studi0.008cation0.008denatur0.009backbon0.008stabil0.008forc0.008studi0.006base0.009
monom0.007na0.008kinet0.008measur0.008form0.008differ0.007fusion0.005effect0.008
β-sheet0.007mg20.007solut0.008indic0.007conform0.008molecul0.007suggest0.005pair0.007
assembl0.007bound0.006conform0.007observ0.007c-termin0.008mechan0.007result0.005model0.007
dimer0.006pore0.006nativ0.007solut0.007posit0.007result0.007human0.005result0.007
show0.006none0.006thermo-dynam0.007chain0.007chain0.007studi0.006show0.004mechan0.006
result0.005studi0.006studi0.007data0.007n-termin0.007time0.006escherichia0.004studi0.006
associ0.005transport0.006mol0.007chang0.007loop0.007calcul0.005two0.004complex0.006
suggest0.005conduct0.005result0.006amid0.007fold0.007md0.005apo0.004mol0.006
mechan0.005differ0.005rate0.006chemic0.007result0.006system0.005transloc0.004experiment0.006
conform0.005show0.005depend0.00615n0.007motif0.005free0.005ribosom0.004function0.005
oligomer0.005two0.005measur0.005label0.006show0.005field0.005site0.004densiti0.005
Self-Assembly of BiomoleculesRegulation of Cellular FunctionsBiochemistry of LipidsDeveloment of Chromophores for BiochemistryBiochemistry of Nucleic AcidsBiochemistry of Heme ComplexesRedox Chemistry of Cytoskeletal Dynamics
dna0.009function0.013membran0.045fluoresc0.020dna0.064heme0.022reaction0.022
adsorpt0.009cell0.012lipid0.042chromophor0.014rna0.020ii0.017oxid0.018
solut0.008regul0.008bilay0.025excit0.012base0.019ligand0.014rate0.014
self-assembl0.008complex0.008phase0.014proton0.010sequenc0.015metal0.014radic0.009
assembl0.008signal0.007vesicl0.012chang0.009pair0.014complex0.010actin0.008
form0.007phosphoryl0.007cholesterol0.009absorpt0.009duplex0.013iron0.010product0.008
polym0.007role0.006phospholipid0.009nm0.008strand0.008site0.010format0.008
charg0.006studi0.006studi0.007rhodopsin0.007form0.007cytochrom0.010form0.007
molecul0.006target0.006monolay0.007observ0.006stabil0.007fe0.009kinet0.007
complex0.006specif0.006chain0.006light0.006loop0.007cu0.009complex0.007
crystal0.006biolog0.006increas0.006time0.006complex0.006coordin0.009re0.006
result0.006mechan0.005interfac0.006retin0.006contain0.006co0.008atp0.006
microscopi0.006cellular0.005effect0.006differ0.006two0.006copper0.007constant0.006
studi0.005import0.005result0.006intermedi0.005g-quadruplex0.005two0.007cluster0.006
forc0.005provid0.005water0.005band0.005result0.005electron0.007cross-link0.006
format0.005receptor0.005differ0.005spectroscopi0.005studi0.005redox0.006mechan0.006
adsorb0.005identifi0.004liposom0.004form0.005oligonucleotid0.005iii0.005hydrolysi0.005
poli0.005process0.004form0.004decay0.005site0.005form0.005presenc0.005
nm0.005modul0.004hydrophob0.004spectra0.005groov0.005histidin0.005dissoci0.005
film0.005demonstr0.004show0.004base0.005nucleotid0.005oxid0.005result0.005
Table 4. Hot/Cold topics.
Table 4. Hot/Cold topics.
Topic No.Coefficientp-ValueHot/Cold
05.5250.0000hot
1−2.5380.0055-
2−0.5520.4843-
3−2.8380.0000cold
4−3.9340.0000cold
511.0230.0000hot
6−2.1330.0072-
7−0.6720.5599-
80.8600.5855-
97.1280.0000hot
102.3610.0029hot
11−0.4950.4312-
12−2.7330.0041cold
13−3.0830.0012cold
14−2.7040.0000cold
Table 5. Topics and Words by Timeframe.
Table 5. Topics and Words by Timeframe.
YearTitleWords
1999~2006Biochemistry of Heme complexesheme, rate, oxid, reaction, electron
Biochemistry of lipidslipid, membran, bilay, phase, water
Biochemistry of nucleic acidsdna, base, structure, energi, complex
Membrane transporterscell, complex, site, function, receptor
Protein conformation (NMR studies)structur, fold, conform, nmr, helix
2007~2011Membrane transporterscell, function, complex, receptor
Aberrant protein aggregation and diseasesform, fibril, aggreg, complex, heme
Biochemistry of nucleic acidsdna, dynam, conform, simul, fold
Biochemistry of lipidsmembran, lipid, bilay, water, phase
Calculation of electron transferbase, proton, eletron, energi, transfer
2012~2017Calculation of electron transfereletron, proton, reaction, transfer, complex
Biochemistry of nucleic acidsdna, rna, base, sequenc, complex
Aberrant protein aggregation and diseasesaggreg, fibril, conform, cell, amyloid
Biochemistry of lipidsmembran, lipid, bilay, phase
Protein conformation (computational studies)dynam, simul, conform, water, fold
2018Ion channels and receptorscell, function, receptor, rna, ligand
Biochemistry of lipidsmembran, lipid, bilay, phase, system
Biochemistry of nucleic acidsdynam, conform, dna, fold, simul
Energy production in biosystemswater, energi, hydrogen, chain, charg
Aberrant protein aggregation and diseasesaggreg, form, fibril, cell, diseas

Share and Cite

MDPI and ACS Style

Kang, H.J.; Kim, C.; Kang, K. Analysis of the Trends in Biochemical Research Using Latent Dirichlet Allocation (LDA). Processes 2019, 7, 379. https://0-doi-org.brum.beds.ac.uk/10.3390/pr7060379

AMA Style

Kang HJ, Kim C, Kang K. Analysis of the Trends in Biochemical Research Using Latent Dirichlet Allocation (LDA). Processes. 2019; 7(6):379. https://0-doi-org.brum.beds.ac.uk/10.3390/pr7060379

Chicago/Turabian Style

Kang, Hee Jay, Changhee Kim, and Kyungtae Kang. 2019. "Analysis of the Trends in Biochemical Research Using Latent Dirichlet Allocation (LDA)" Processes 7, no. 6: 379. https://0-doi-org.brum.beds.ac.uk/10.3390/pr7060379

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop