Research

14 pages, 447 KiB

Open AccessArticle

Ensemble Classification through Random Projections for Single-Cell RNA-Seq Data

by Aristidis G. Vrahatis, Sotiris K. Tasoulis, Spiros V. Georgakopoulos and Vassilis P. Plagianakos

Information 2020, 11(11), 502; https://0-doi-org.brum.beds.ac.uk/10.3390/info11110502 - 28 Oct 2020

Cited by 6 | Viewed by 2147

Nowadays, biomedical data are generated exponentially, creating datasets for analysis with ultra-high dimensionality and complexity. An indicative example is emerging single-cell RNA-sequencing (scRNA-seq) technology, which isolates and measures individual cells. The analysis of scRNA-seq data consists of a major challenge because of its [...] Read more.

Nowadays, biomedical data are generated exponentially, creating datasets for analysis with ultra-high dimensionality and complexity. An indicative example is emerging single-cell RNA-sequencing (scRNA-seq) technology, which isolates and measures individual cells. The analysis of scRNA-seq data consists of a major challenge because of its ultra-high dimensionality and complexity. Towards this direction, we study the generalization of the MRPV, a recently published ensemble classification algorithm, which combines multiple ultra-low dimensional random projected spaces with a voting scheme, while exposing its ability to enhance the performance of base classifiers. We empirically showed that we can design a reliable ensemble classification technique using random projected subspaces in an extremely small fixed number of dimensions, without following the restrictions of the classical random projection method. Therefore, the MPRV acquires the ability to efficiently and rapidly perform classification tasks even for data with extremely high dimensionality. Furthermore, through the experimental analysis in six scRNA-seq data, we provided evidence that the most critical advantage of MRPV is the dramatic reduction in data dimensionality that allows for the utilization of computational demanding classifiers that are considered as non-practical in real-life applications. The scalability, the simplicity, and the capabilities of our proposed framework render it as a tool-guide for single-cell RNA-seq data which are characterized by ultra-high dimensionality. MRPV is available on GitHub in MATLAB implementation. Full article

(This article belongs to the Special Issue Big Data Research, Development, and Applications––Big Data 2018)

► Show Figures

Figure 1

23 pages, 1571 KiB

Open AccessArticle

Real-Time Tweet Analytics Using Hybrid Hashtags on Twitter Big Data Streams

by Vibhuti Gupta and Rattikorn Hewett

Information 2020, 11(7), 341; https://0-doi-org.brum.beds.ac.uk/10.3390/info11070341 - 30 Jun 2020

Cited by 13 | Viewed by 5786

Abstract

Twitter is a microblogging platform that generates large volumes of data with high velocity. This daily generation of unbounded and continuous data leads to Big Data streams that often require real-time distributed and fully automated processing. Hashtags, hyperlinked words in tweets, are widely [...] Read more.

Twitter is a microblogging platform that generates large volumes of data with high velocity. This daily generation of unbounded and continuous data leads to Big Data streams that often require real-time distributed and fully automated processing. Hashtags, hyperlinked words in tweets, are widely used for tweet topic classification, retrieval, and clustering. Hashtags are used widely for analyzing tweet sentiments where emotions can be classified without contexts. However, regardless of the wide usage of hashtags, general tweet topic classification using hashtags is challenging due to its evolving nature, lack of context, slang, abbreviations, and non-standardized expression by users. Most existing approaches, which utilize hashtags for tweet topic classification, focus on extracting hashtag concepts from external lexicon resources to derive semantics. However, due to the rapid evolution and non-standardized expression of hashtags, the majority of these lexicon resources either suffer from the lack of hashtag words in their knowledge bases or use multiple resources at once to derive semantics, which make them unscalable. Along with scalable and automated techniques for tweet topic classification using hashtags, there is also a requirement for real-time analytics approaches to handle huge and dynamic flows of textual streams generated by Twitter. To address these problems, this paper first presents a novel semi-automated technique that derives semantically relevant hashtags using a domain-specific knowledge base of topic concepts and combines them with the existing tweet-based-hashtags to produce Hybrid Hashtags. Further, to deal with the speed and volume of Big Data streams of tweets, we present an online approach that updates the preprocessing and learning model incrementally in a real-time streaming environment using the distributed framework, Apache Storm. Finally, to fully exploit the batch and stream environment performance advantages, we propose a comprehensive framework (Hybrid Hashtag-based Tweet topic classification (HHTC) framework) that combines batch and online mechanisms in the most effective way. Extensive experimental evaluations on a large volume of Twitter data show that the batch and online mechanisms, along with their combination in the proposed framework, are scalable, efficient, and provide effective tweet topic classification using hashtags. Full article

(This article belongs to the Special Issue Big Data Research, Development, and Applications––Big Data 2018)

► Show Figures

Figure 1

26 pages, 353 KiB

Open AccessArticle

Dramatically Reducing Search for High Utility Sequential Patterns by Maintaining Candidate Lists

by Scott Buffett

Information 2020, 11(1), 44; https://0-doi-org.brum.beds.ac.uk/10.3390/info11010044 - 15 Jan 2020

Viewed by 2138

Abstract

A ubiquitous challenge throughout all areas of data mining, particularly in the mining of frequent patterns in large databases, is centered on the necessity to reduce the time and space required to perform the search. The extent of this reduction proportionally facilitates the [...] Read more.

A ubiquitous challenge throughout all areas of data mining, particularly in the mining of frequent patterns in large databases, is centered on the necessity to reduce the time and space required to perform the search. The extent of this reduction proportionally facilitates the ability to identify patterns of interest. High utility sequential pattern mining (HUSPM) seeks to identify frequent patterns that are (1) sequential in nature and (2) hold a significant magnitude of utility in a sequence database, by considering the aspect of item value or importance. While traditional sequential pattern mining relies on the downward closure property to significantly reduce the required search space, with HUSPM, this property does not hold. To address this drawback, an approach is proposed that establishes a tight upper bound on the utility of future candidate sequential patterns by maintaining a list of items that are deemed potential candidates for concatenation. Such candidates are provably the only items that are ever needed for any extension of a given sequential pattern or its descendants in the search tree. This list is then exploited to significantly further tighten the upper bound on the utilities of descendent patterns. An extension of this work is then proposed that significantly reduces the computational cost of updating database utilities each time a candidate item is removed from the list, resulting in a massive reduction in the number of candidate sequential patterns that need to be generated in the search. Sequential pattern mining methods implementing these new techniques for bound reduction and further candidate list reduction are demonstrated via the introduction of the CRUSP and CRUSPPivot algorithms, respectively. Validation of the techniques was conducted on six public datasets. Tests show that use of the CRUSP algorithm results in a significant reduction in the overall number of candidate sequential patterns that need to be considered, and subsequently a significant reduction in run time, when compared to the current state of the art in bounding techniques. When employing the CRUSPPivot algorithm, the further reduction in the size of the search space was found to be dramatic, with the reduction in run time found to be dramatic to moderate, depending on the dataset. Demonstrating the practical significance of the work, experiments showed that time required for one particularly complex dataset was reduced from many hours to less than one minute. Full article

(This article belongs to the Special Issue Big Data Research, Development, and Applications––Big Data 2018)

► Show Figures

Figure 1

22 pages, 1788 KiB

Open AccessFeature PaperArticle

The Construction of the Past: Towards a Theory for Knowing the Past

by Kenneth Thibodeau

Information 2019, 10(11), 332; https://0-doi-org.brum.beds.ac.uk/10.3390/info10110332 - 28 Oct 2019

Cited by 11 | Viewed by 3604

Abstract

This paper presents Constructed Past Theory, an epistemological theory about how we come to know things that happened or existed in the past. The theory is expounded both in text and in a formal model comprising UML class diagrams. The ideas presented here [...] Read more.

This paper presents Constructed Past Theory, an epistemological theory about how we come to know things that happened or existed in the past. The theory is expounded both in text and in a formal model comprising UML class diagrams. The ideas presented here have been developed in a half century of experience as a practitioner in the management of information and automated systems in the US government and as a researcher in several collaborations, notably the four international and multidisciplinary InterPARES projects. This work is part of a broader initiative, providing a conceptual framework for reformulating the concepts and theories of archival science in order to enable a new discipline whose assertions are empirically and, wherever possible, quantitatively testable. The new discipline, called archival engineering, is intended to provide an appropriate, coherent foundation for the development of systems and applications for managing, preserving and providing access to digital information, development which is necessitated by the exponential growth and explosive diversification of data recorded in digital form and the use of digital data in an ever increasing variety of domains. Both the text and model are an initial exposition of the theory that both requires and invites further development. Full article

(This article belongs to the Special Issue Big Data Research, Development, and Applications––Big Data 2018)

► Show Figures

Figure 1

16 pages, 388 KiB

Open AccessFeature PaperArticle

Impact of Information Sharing and Forecast Combination on Fast-Moving-Consumer-Goods Demand Forecast Accuracy

by Dazhi Yang and Allan N. Zhang

Information 2019, 10(8), 260; https://0-doi-org.brum.beds.ac.uk/10.3390/info10080260 - 16 Aug 2019

Cited by 8 | Viewed by 4698

Abstract

This article empirically demonstrates the impacts of truthfully sharing forecast information and using forecast combinations in a fast-moving-consumer-goods (FMCG) supply chain. Although it is known a priori that sharing information improves the overall efficiency of a supply chain, information such as pricing or [...] Read more.

This article empirically demonstrates the impacts of truthfully sharing forecast information and using forecast combinations in a fast-moving-consumer-goods (FMCG) supply chain. Although it is known a priori that sharing information improves the overall efficiency of a supply chain, information such as pricing or promotional strategy is often kept proprietary for competitive reasons. In this regard, it is herein shown that simply sharing the retail-level forecasts—this does not reveal the exact business strategy, due to the effect of omni-channel sales—yields nearly all the benefits of sharing all pertinent information that influences FMCG demand. In addition, various forecast combination methods are used to further stabilize the forecasts, in situations where multiple forecasting models are used during operation. In other words, it is shown that combining forecasts is less risky than “betting” on any component model. Full article

(This article belongs to the Special Issue Big Data Research, Development, and Applications––Big Data 2018)

► Show Figures

Figure 1

15 pages, 1760 KiB

Open AccessArticle

Aggregation of Linked Data in the Cultural Heritage Domain: A Case Study in the Europeana Network

by Nuno Freire, René Voorburg, Roland Cornelissen, Sjors de Valk, Enno Meijers and Antoine Isaac

Information 2019, 10(8), 252; https://0-doi-org.brum.beds.ac.uk/10.3390/info10080252 - 30 Jul 2019

Cited by 23 | Viewed by 5622

Abstract

Online cultural heritage resources are widely available through digital libraries maintained by numerous organizations. In order to improve discoverability in cultural heritage, the typical approach is metadata aggregation, a method where centralized efforts such as Europeana improve the discoverability by collecting resource metadata. [...] Read more.

Online cultural heritage resources are widely available through digital libraries maintained by numerous organizations. In order to improve discoverability in cultural heritage, the typical approach is metadata aggregation, a method where centralized efforts such as Europeana improve the discoverability by collecting resource metadata. The redefinition of the traditional data models for cultural heritage resources into data models based on semantic technology has been a major activity of the cultural heritage community. Yet, linked data may bring new innovation opportunities for cultural heritage metadata aggregation. We present the outcomes of a case study that we conducted within the Europeana cultural heritage network. In this study, the National Library of The Netherlands contributed by providing the role of data provider, while the Dutch Digital Heritage Network contributed as an intermediary aggregator that aggregates datasets and provides them to Europeana, the central aggregator. We identified and analyzed the requirements for an aggregation solution for the linked data, guided by current aggregation practices of the Europeana network. These requirements guided the definition of a workflow that fulfils the same functional requirements as the existing one. The workflow was put into practice within this study and has led to the development of software applications for administrating datasets, crawling the web of data, harvesting linked data, data analysis and data integration. We present our analysis of the study outcomes and analyze the effort necessary, in terms of technology adoption, to establish a linked data approach, from the point of view of both data providers and aggregators. We also present the expertise requirements we identified for cultural heritage data analysts, as well as determining which supporting tools were required to be designed specifically for semantic data. Full article

(This article belongs to the Special Issue Big Data Research, Development, and Applications––Big Data 2018)

► Show Figures

Figure 1

17 pages, 5061 KiB

Open AccessArticle

Hadoop Performance Analysis Model with Deep Data Locality

by Sungchul Lee, Ju-Yeon Jo and Yoohwan Kim

Information 2019, 10(7), 222; https://0-doi-org.brum.beds.ac.uk/10.3390/info10070222 - 27 Jun 2019

Cited by 6 | Viewed by 7487

Abstract

Background: Hadoop has become the base framework on the big data system via the simple concept that moving computation is cheaper than moving data. Hadoop increases a data locality in the Hadoop Distributed File System (HDFS) to improve the performance of the system. [...] Read more.

Background: Hadoop has become the base framework on the big data system via the simple concept that moving computation is cheaper than moving data. Hadoop increases a data locality in the Hadoop Distributed File System (HDFS) to improve the performance of the system. The network traffic among nodes in the big data system is reduced by increasing a data-local on the machine. Traditional research increased the data-local on one of the MapReduce stages to increase the Hadoop performance. However, there is currently no mathematical performance model for the data locality on the Hadoop. Methods: This study made the Hadoop performance analysis model with data locality for analyzing the entire process of MapReduce. In this paper, the data locality concept on the map stage and shuffle stage was explained. Also, this research showed how to apply the Hadoop performance analysis model to increase the performance of the Hadoop system by making the deep data locality. Results: This research proved the deep data locality for increasing performance of Hadoop via three tests, such as, a simulation base test, a cloud test and a physical test. According to the test, the authors improved the Hadoop system by over 34% by using the deep data locality. Conclusions: The deep data locality improved the Hadoop performance by reducing the data movement in HDFS. Full article

(This article belongs to the Special Issue Big Data Research, Development, and Applications––Big Data 2018)

► Show Figures

Figure 1

20 pages, 4230 KiB

Open AccessArticle

Performance Comparing and Analysis for Slot Allocation Model

by ZhiJian Ye, YanWei Li, JingTing Bai and XinXin Zheng

Information 2019, 10(6), 188; https://0-doi-org.brum.beds.ac.uk/10.3390/info10060188 - 31 May 2019

Cited by 1 | Viewed by 4426

Abstract

The purpose of this study is to ascertain whether implementation difficulty can be used in a slot allocation model as a new mechanism for slightly weakening grandfather rights; according to which, a linear integer programming model is designed to compare and analyze displacement, [...] Read more.

The purpose of this study is to ascertain whether implementation difficulty can be used in a slot allocation model as a new mechanism for slightly weakening grandfather rights; according to which, a linear integer programming model is designed to compare and analyze displacement, implementation difficulty and priority with different weights. Test results show that the implementation difficulty can be significantly reduced without causing excessive displacement and disruption of existing priorities, by weight setting while declared capacity is cleared. In addition to this, whether the movements are listed in order of descending priority or not have great impact on displacement and implementation difficulty within the slot allocation model. Capacity is surely a key factor affecting displacement and implementation difficulties. This study contributes to propose a new mechanism for slightly weakening grandfather right, which can help decision makers to upgrade slot allocation policies. Full article

(This article belongs to the Special Issue Big Data Research, Development, and Applications––Big Data 2018)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Big Data Research, Development, and Applications––Big Data 2018

Share This Special Issue

Special Issue Editor

Special Issue Information

Published Papers (8 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI