Biological Knowledge Discovery from Big Data

A special issue of Algorithms (ISSN 1999-4893). This special issue belongs to the section "Databases and Data Structures".

Deadline for manuscript submissions: closed (10 February 2021) | Viewed by 10532

Special Issue Editors

Institute for Informatics and Telematics (IIT), CNR, 1-56124 Pisa, Italy
Interests: algorithms & data structures; bioinformatics & computational biology; machine learning; scalable data mining
Special Issues, Collections and Topics in MDPI journals
1. The Laboratory of Technologies of Information and Communication, and Electrical Engineering (LaTICE), National Higher School of Engineers of Tunis (ENSIT), University of Tunis, Tunisia
2. Faculty of Economic Sciences and Management of Tunis (FSEGT), University of Tunis-El Manar, Tunisia
Interests: algorithmics; bioinformatics; knowledge discovery and data mining
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

In recent years, there has been a rapid development of biological technologies producing more and more biological data, i.e., data related to biological macromolecules (DNA, RNA, and proteins).  The rise of next-generation sequencing technologies (NGS), also known as high-throughput sequencing technologies, has contributed actively to the deluge of these data.  In general, these data are big, heterogeneous, complex, and distributed worldwide in databases.  Analyzing biological big data is a challenging task, not only because of its complexity and its multiple and numerous correlated factors, but also because of the continuous evolution of our understanding of the biological mechanisms. Classical approaches of biological data analysis are no longer efficient and produce only a very limited amount of information, compared to the numerous and complex biological mechanisms under study.  From here comes the necessity to adopt new computer tools and develop new in silico high performance approaches to support us in the analysis of biological big data and, hence, to help us in the understanding of the correlations that exist between, on the one hand, structures and functional patterns in biological macromolecules and, on the other hand, genetic and biochemical mechanisms.  Biological Knowledge Discovery from Big Data (BIOKDD) is a response to these new trends. 

Researchers are encouraged to submit original research contributions in all major areas, which include, but are not limited to:

Data Preprocessing: biological big data storage, representation and management (e.g., data warehouses, databases, sequences, trees, graphs, biological networks and pathways), biological big data cleaning (e.g., errors removal, redundant data removal, completion of missing data), Feature Extraction (e.g., motifs, subgraphs), feature selection (e.g., filter approaches, wrapper approaches, hybrid approaches, embedded approaches)

Data Mining: biological big data regression (e.g., regression of biological sequence), biological big data clustering/biclustering (e.g., microarray data biclustering, clustering/biclustering of biological sequences), biological big data classification (e.g., classification of biological sequence), association rules learning from biological big data, text mining and application to biological sequences, web mining and application to biological big data, parallel, cloud and grid computing for biological and scalable big data mining.

Data Postprocessing: biological nuggets of knowledge filtering, biological nuggets of knowledge representation and visualization, biological nuggets of knowledge evaluation (e.g., calculation of the classification error rate, evaluation of the association rules via numerical indicators or measurements of interest), biological nuggets of knowledge integration.

The topics of interest to this Special Issue cover the scope of the International Workshop on Biological Knowledge Discovery from Big Data (BIOKDD 2019) (http://www.dexa.org/biokdd2019). Extended versions of selected papers presented at BIOKDD 2019 are invited.

Dr. Davide Verzotto
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Algorithms is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

10 pages, 1283 KiB  
Article
MultiKOC: Multi-One-Class Classifier Based K-Means Clustering
by Loai Abdallah, Murad Badarna, Waleed Khalifa and Malik Yousef
Algorithms 2021, 14(5), 134; https://0-doi-org.brum.beds.ac.uk/10.3390/a14050134 - 23 Apr 2021
Cited by 4 | Viewed by 2750
Abstract
In the computational biology community there are many biological cases that are considered as multi-one-class classification problems. Examples include the classification of multiple tumor types, protein fold recognition and the molecular classification of multiple cancer types. In all of these cases the real [...] Read more.
In the computational biology community there are many biological cases that are considered as multi-one-class classification problems. Examples include the classification of multiple tumor types, protein fold recognition and the molecular classification of multiple cancer types. In all of these cases the real world appropriately characterized negative cases or outliers are impractical to achieve and the positive cases might consist of different clusters, which in turn might lead to accuracy degradation. In this paper we present a novel algorithm named MultiKOC multi-one-class classifiers based K-means to deal with this problem. The main idea is to execute a clustering algorithm over the positive samples to capture the hidden subdata of the given positive data, and then building up a one-class classifier for every cluster member’s examples separately: in other word, train the OC classifier on each piece of subdata. For a given new sample, the generated classifiers are applied. If it is rejected by all of those classifiers, the given sample is considered as a negative sample, otherwise it is a positive sample. The results of MultiKOC are compared with the traditional one-class, multi-one-class, ensemble one-classes and two-class methods, yielding a significant improvement over the one-class and like the two-class performance. Full article
(This article belongs to the Special Issue Biological Knowledge Discovery from Big Data)
Show Figures

Figure 1

15 pages, 3549 KiB  
Article
Classification of Precursor MicroRNAs from Different Species Based on K-mer Distance Features
by Malik Yousef and Jens Allmer
Algorithms 2021, 14(5), 132; https://0-doi-org.brum.beds.ac.uk/10.3390/a14050132 - 22 Apr 2021
Viewed by 2325
Abstract
MicroRNAs (miRNAs) are short RNA sequences that are actively involved in gene regulation. These regulators on the post-transcriptional level have been discovered in virtually all eukaryotic organisms. Additionally, miRNAs seem to exist in viruses and might also be produced in microbial pathogens. Initially, [...] Read more.
MicroRNAs (miRNAs) are short RNA sequences that are actively involved in gene regulation. These regulators on the post-transcriptional level have been discovered in virtually all eukaryotic organisms. Additionally, miRNAs seem to exist in viruses and might also be produced in microbial pathogens. Initially, transcribed RNA is cleaved by Drosha, producing precursor miRNAs. We have previously shown that it is possible to distinguish between microRNA precursors of different clades by representing the sequences in a k-mer feature space. The k-mer representation considers the frequency of a k-mer in the given sequence. We further hypothesized that the relationship between k-mers (e.g., distance between k-mers) could be useful for classification. Three different distance-based features were created, tested, and compared. The three feature sets were entitled inter k-mer distance, k-mer location distance, and k-mer first–last distance. Here, we show that classification performance above 80% (depending on the evolutionary distance) is possible with a combination of distance-based and regular k-mer features. With these novel features, classification at closer evolutionary distances is better than using k-mers alone. Combining the features leads to accurate classification for larger evolutionary distances. For example, categorizing Homo sapiens versus Brassicaceae leads to an accuracy of 93%. When considering average accuracy, the novel distance-based features lead to an overall increase in effectiveness. On the contrary, secondary-structure-based features did not lead to any effective separation among clades in this study. With this line of research, we support the differentiation between true and false miRNAs detected from next-generation sequencing data, provide an additional viewpoint for confirming miRNAs when the species of origin is known, and open up a new strategy for analyzing miRNA evolution. Full article
(This article belongs to the Special Issue Biological Knowledge Discovery from Big Data)
Show Figures

Figure 1

13 pages, 597 KiB  
Article
Molecular Subtyping and Outlier Detection in Human Disease Using the Paraclique Algorithm
by Ronald D. Hagan and Michael A. Langston
Algorithms 2021, 14(2), 63; https://0-doi-org.brum.beds.ac.uk/10.3390/a14020063 - 19 Feb 2021
Cited by 1 | Viewed by 2231
Abstract
Recent discoveries of distinct molecular subtypes have led to remarkable advances in treatment for a variety of diseases. While subtyping via unsupervised clustering has received a great deal of interest, most methods rely on basic statistical or machine learning methods. At the same [...] Read more.
Recent discoveries of distinct molecular subtypes have led to remarkable advances in treatment for a variety of diseases. While subtyping via unsupervised clustering has received a great deal of interest, most methods rely on basic statistical or machine learning methods. At the same time, techniques based on graph clustering, particularly clique-based strategies, have been successfully used to identify disease biomarkers and gene networks. A graph theoretical approach based on the paraclique algorithm is described that can easily be employed to identify putative disease subtypes and serve as an aid in outlier detection as well. The feasibility and potential effectiveness of this method is demonstrated on publicly available gene co-expression data derived from patient samples covering twelve different disease families. Full article
(This article belongs to the Special Issue Biological Knowledge Discovery from Big Data)
Show Figures

Figure 1

15 pages, 1372 KiB  
Article
An Investigation of Alternatives to Transform Protein Sequence Databases to a Columnar Index Schema
by Roman Zoun, Kay Schallert, David Broneske, Ivayla Trifonova, Xiao Chen, Robert Heyer, Dirk Benndorf and Gunter Saake
Algorithms 2021, 14(2), 59; https://0-doi-org.brum.beds.ac.uk/10.3390/a14020059 - 11 Feb 2021
Viewed by 2372
Abstract
Mass spectrometers enable identifying proteins in biological samples leading to biomarkers for biological process parameters and diseases. However, bioinformatic evaluation of the mass spectrometer data needs a standardized workflow and system that stores the protein sequences. Due to its standardization and maturity, relational [...] Read more.
Mass spectrometers enable identifying proteins in biological samples leading to biomarkers for biological process parameters and diseases. However, bioinformatic evaluation of the mass spectrometer data needs a standardized workflow and system that stores the protein sequences. Due to its standardization and maturity, relational systems are a great fit for storing protein sequences. Hence, in this work, we present a schema for distributed column-based database management systems using a column-oriented index to store sequence data. In order to achieve a high storage performance, it was necessary to choose a well-performing strategy for transforming the protein sequence data from the FASTA format to the new schema. Therefore, we applied an in-memory map, HDDmap, database engine, and extended radix tree and evaluated their performance. The results show that our proposed extended radix tree performs best regarding memory consumption and runtime. Hence, the radix tree is a suitable data structure for transforming protein sequences into the indexed schema. Full article
(This article belongs to the Special Issue Biological Knowledge Discovery from Big Data)
Show Figures

Figure 1

Back to TopTop