Submit to Applied Sciences Review for Applied Sciences Propose a Special Issue

Journal Browser

Big Data Management and Analysis with Distributed or Cloud Computing

Special Issue Editors
Special Issue Information
Keywords
Published Papers

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (31 October 2022) | Viewed by 8530

Share This Special Issue

Special Issue Editors

Prof. Dr. Hyuk-Yoon Kwon

E-Mail Website
Guest Editor

Department of Industrial Engineering, Seoul National University of Science and Technology, 232 Gongneung-Ro, Nowon-Gu, Seoul 01811, Korea
Interests: big data management; data science; distributed computing; cloud computing; databases; machine learning; web crawler

Dr. Kisung Lee

E-Mail Website
Guest Editor

Division of Computer Science and Engineering, Louisiana State University, Baton Rouge, LA 70803, USA
Interests: scaling big data analytics; cloud computing; distributed computing systems; mobile computing and spatial data management; distributed data intensive systems; social network analytics

Special Issue Information

Dear Colleagues,

The data volume from a variety of sources is exploding at an increasingly rapid rate. As a result, it is essential to collect data from multiple different sources and to manage such massive amounts of data in distributed environments such as on-premise data servers or cloud services such as Amazon AWS, MS Azure, or Google GCP. To address big data challenges, distributed computing techniques have been developed on top of a general-purpose big data framework providing distributed data ingestion (e.g., Apache Flume), distributed file systems (e.g., HDFS), MapReduce-like computation models (e.g., Apache Spark), and large-scale stream-processing (e.g., Apache Kafka). Recently, container-orchestration frameworks (e.g., Kubernetes) have been also widely applied to provide big data services.

With the necessity of distributed or cloud computing, traditional approaches of data ingestion, management, and analysis on various data types such as geo-spatial data, images, and log data need to be developed for the distributed or cloud environments. Machine learning or deep learning-based techniques also need to consider their scalability for multiple distributed or partitioned models. Federated learning is the representative example to distribute a single model into multiple local models and to aggregate them to a global model.

In this Special Issue, we focus on big data management and analysis that require distributed or cloud computing. Topics of interests include, but are not limited to the following:

Big data management and analysis on the distributed or cloud environments
Data sciences with distributed or cloud computing
Distributed data collection and ingestion
Machine learning or deep learning-based data analysis with distributed computing
Federated learning to distributed computing
Multi-modal data analysis
Distributed container-based computing
Distributed sensor data integration and analysis

Prof. Dr. Hyuk-Yoon Kwon
Dr. Kisung Lee
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

Big data management and analysis
Data sciences
Distributed and cloud computing
Federated learning
Multi-modal data analysis
Distributed data ingestion
Distributed containers

Published Papers (4 papers)

Download All Papers

Research

20 pages, 8464 KiB

Open AccessArticle

A SqueeSAR Spatially Adaptive Filtering Algorithm Based on Hadoop Distributed Cluster Environment

by Yongning Li, Weiwei Song, Baoxuan Jin, Xiaoqing Zuo, Yongfa Li and Kai Chen

Appl. Sci. 2023, 13(3), 1869; https://0-doi-org.brum.beds.ac.uk/10.3390/app13031869 - 31 Jan 2023

Cited by 1 | Viewed by 1243

Abstract

Multi-temporal interferometric synthetic aperture radar (MT-InSAR) techniques analyze a study area using a set of SAR image data composed of time series, reaching millimeter surface subsidence accuracy. To effectively acquire the subsidence information in low-coherence areas without obvious features in non-urban areas, an MT-InSAR technique, called SqueeSAR, is proposed to improve the density of the subsidence points in the study area by fusing the distributed scatterers (DS). However, SqueeSAR filters the DS points individually during spatial adaptive filtering, which requires significant computer memory, which leads to low processing efficiency, and faces great challenges in large-area InSAR processing. We propose a spatially adaptive filtering parallelization strategy based on the Spark distributed computing engine in a Hadoop distributed cluster environment, which splits the different DS pixel point data into different computing nodes for parallel processing and effectively improves the filtering algorithm’s performance. To evaluate the effectiveness and accuracy of the proposed method, we conducted a performance evaluation and accuracy verification in and around the main city of Kunming with the original Sentinel-1A SLC data provided by ESA. Additionally, parallel calculation was performed in a YARN cluster comprising three computing nodes, which improved the performance of the filtering algorithm by a factor of 2.15, without affecting the filtering accuracy. Full article

(This article belongs to the Special Issue Big Data Management and Analysis with Distributed or Cloud Computing)

► Show Figures

Figure 1

23 pages, 5785 KiB

Open AccessArticle

SAT-Hadoop-Processor: A Distributed Remote Sensing Big Data Processing Software for Earth Observation Applications

by Badr-Eddine Boudriki Semlali and Felix Freitag

Appl. Sci. 2021, 11(22), 10610; https://0-doi-org.brum.beds.ac.uk/10.3390/app112210610 - 11 Nov 2021

Cited by 10 | Viewed by 1997

Abstract

Nowadays, several environmental applications take advantage of remote sensing techniques. A considerable volume of this remote sensing data occurs in near real-time. Such data are diverse and are provided with high velocity and variety, their pre-processing requires large computing capacities, and a fast execution time is critical. This paper proposes a new distributed software for remote sensing data pre-processing and ingestion using cloud computing technology, specifically OpenStack. The developed software discarded 86% of the unneeded daily files and removed around 20% of the erroneous and inaccurate datasets. The parallel processing optimized the total execution time by 90%. Finally, the software efficiently processed and integrated data into the Hadoop storage system, notably the HDFS, HBase, and Hive. Full article

(This article belongs to the Special Issue Big Data Management and Analysis with Distributed or Cloud Computing)

► Show Figures

Graphical abstract

19 pages, 788 KiB

Open AccessArticle

Optimization Techniques for a Distributed In-Memory Computing Platform by Leveraging SSD

by June Choi, Jaehyun Lee, Jik-Soo Kim and Jaehwan Lee

Appl. Sci. 2021, 11(18), 8476; https://0-doi-org.brum.beds.ac.uk/10.3390/app11188476 - 13 Sep 2021

Cited by 1 | Viewed by 1778

Abstract

In this paper, we present several optimization strategies that can improve the overall performance of the distributed in-memory computing system, “Apache Spark”. Despite its distributed memory management capability for iterative jobs and intermediate data, Spark has a significant performance degradation problem when the available amount of main memory (DRAM, typically used for data caching) is limited. To address this problem, we leverage an SSD (solid-state drive) to supplement the lack of main memory bandwidth. Specifically, we present an effective optimization methodology for Apache Spark by collectively investigating the effects of changing the capacity fraction ratios of the shuffle and storage spaces in the “Spark JVM Heap Configuration” and applying different “RDD Caching Policies” (e.g., SSD-backed memory caching). Our extensive experimental results show that by utilizing the proposed optimization techniques, we can improve the overall performance by up to 42%. Full article

(This article belongs to the Special Issue Big Data Management and Analysis with Distributed or Cloud Computing)

► Show Figures

Figure 1

24 pages, 1008 KiB

Open AccessArticle

SPARQL2Flink: Evaluation of SPARQL Queries on Apache Flink

by Oscar Ceballos, Carlos Alberto Ramírez Restrepo, María Constanza Pabón, Andres M. Castillo and Oscar Corcho

Appl. Sci. 2021, 11(15), 7033; https://0-doi-org.brum.beds.ac.uk/10.3390/app11157033 - 30 Jul 2021

Cited by 1 | Viewed by 2222

Abstract

Existing SPARQL query engines and triple stores are continuously improved to handle more massive datasets. Several approaches have been developed in this context proposing the storage and querying of RDF data in a distributed fashion, mainly using the MapReduce Programming Model and Hadoop-based ecosystems. New trends in Big Data technologies have also emerged (e.g., Apache Spark, Apache Flink); they use distributed in-memory processing and promise to deliver higher data processing performance. In this paper, we present a formal interpretation of some PACT transformations implemented in the Apache Flink DataSet API. We use this formalization to provide a mapping to translate a SPARQL query to a Flink program. The mapping was implemented in a prototype used to determine the correctness and performance of the solution. The source code of the project is available in Github under the MIT license. Full article

(This article belongs to the Special Issue Big Data Management and Analysis with Distributed or Cloud Computing)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Big Data Management and Analysis with Distributed or Cloud Computing

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Published Papers (4 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI