Big Data Management and Analysis with Distributed or Cloud Computing

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (31 October 2022) | Viewed by 8530

Special Issue Editors


E-Mail Website
Guest Editor
Department of Industrial Engineering, Seoul National University of Science and Technology, 232 Gongneung-Ro, Nowon-Gu, Seoul 01811, Korea
Interests: big data management; data science; distributed computing; cloud computing; databases; machine learning; web crawler
Division of Computer Science and Engineering, Louisiana State University, Baton Rouge, LA 70803, USA
Interests: scaling big data analytics; cloud computing; distributed computing systems; mobile computing and spatial data management; distributed data intensive systems; social network analytics

Special Issue Information

Dear Colleagues,

The data volume from a variety of sources is exploding at an increasingly rapid rate. As a result, it is essential to collect data from multiple different sources and to manage such massive amounts of data in distributed environments such as on-premise data servers or cloud services such as Amazon AWS, MS Azure, or Google GCP. To address big data challenges, distributed computing techniques have been developed on top of a general-purpose big data framework providing distributed data ingestion (e.g., Apache Flume), distributed file systems (e.g., HDFS), MapReduce-like computation models (e.g., Apache Spark), and large-scale stream-processing (e.g., Apache Kafka). Recently, container-orchestration frameworks (e.g., Kubernetes) have been also widely applied to provide big data services.

With the necessity of distributed or cloud computing, traditional approaches of data ingestion, management, and analysis on various data types such as geo-spatial data, images, and log data need to be developed for the distributed or cloud environments. Machine learning or deep learning-based techniques also need to consider their scalability for multiple distributed or partitioned models. Federated learning is the representative example to distribute a single model into multiple local models and to aggregate them to a global model.

In this Special Issue, we focus on big data management and analysis that require distributed or cloud computing. Topics of interests include, but are not limited to the following:

  • Big data management and analysis on the distributed or cloud environments
  • Data sciences with distributed or cloud computing
  • Distributed data collection and ingestion
  • Machine learning or deep learning-based data analysis with distributed computing
  • Federated learning to distributed computing
  • Multi-modal data analysis
  • Distributed container-based computing
  • Distributed sensor data integration and analysis

Prof. Dr. Hyuk-Yoon Kwon
Dr. Kisung Lee
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Big data management and analysis
  • Data sciences
  • Distributed and cloud computing
  • Federated learning
  • Multi-modal data analysis
  • Distributed data ingestion
  • Distributed containers

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

20 pages, 8464 KiB  
Article
A SqueeSAR Spatially Adaptive Filtering Algorithm Based on Hadoop Distributed Cluster Environment
by Yongning Li, Weiwei Song, Baoxuan Jin, Xiaoqing Zuo, Yongfa Li and Kai Chen
Appl. Sci. 2023, 13(3), 1869; https://0-doi-org.brum.beds.ac.uk/10.3390/app13031869 - 31 Jan 2023
Cited by 1 | Viewed by 1243
Abstract
Multi-temporal interferometric synthetic aperture radar (MT-InSAR) techniques analyze a study area using a set of SAR image data composed of time series, reaching millimeter surface subsidence accuracy. To effectively acquire the subsidence information in low-coherence areas without obvious features in non-urban areas, an [...] Read more.
Multi-temporal interferometric synthetic aperture radar (MT-InSAR) techniques analyze a study area using a set of SAR image data composed of time series, reaching millimeter surface subsidence accuracy. To effectively acquire the subsidence information in low-coherence areas without obvious features in non-urban areas, an MT-InSAR technique, called SqueeSAR, is proposed to improve the density of the subsidence points in the study area by fusing the distributed scatterers (DS). However, SqueeSAR filters the DS points individually during spatial adaptive filtering, which requires significant computer memory, which leads to low processing efficiency, and faces great challenges in large-area InSAR processing. We propose a spatially adaptive filtering parallelization strategy based on the Spark distributed computing engine in a Hadoop distributed cluster environment, which splits the different DS pixel point data into different computing nodes for parallel processing and effectively improves the filtering algorithm’s performance. To evaluate the effectiveness and accuracy of the proposed method, we conducted a performance evaluation and accuracy verification in and around the main city of Kunming with the original Sentinel-1A SLC data provided by ESA. Additionally, parallel calculation was performed in a YARN cluster comprising three computing nodes, which improved the performance of the filtering algorithm by a factor of 2.15, without affecting the filtering accuracy. Full article
(This article belongs to the Special Issue Big Data Management and Analysis with Distributed or Cloud Computing)
Show Figures

Figure 1

23 pages, 5785 KiB  
Article
SAT-Hadoop-Processor: A Distributed Remote Sensing Big Data Processing Software for Earth Observation Applications
by Badr-Eddine Boudriki Semlali and Felix Freitag
Appl. Sci. 2021, 11(22), 10610; https://0-doi-org.brum.beds.ac.uk/10.3390/app112210610 - 11 Nov 2021
Cited by 10 | Viewed by 1997
Abstract
Nowadays, several environmental applications take advantage of remote sensing techniques. A considerable volume of this remote sensing data occurs in near real-time. Such data are diverse and are provided with high velocity and variety, their pre-processing requires large computing capacities, and a fast [...] Read more.
Nowadays, several environmental applications take advantage of remote sensing techniques. A considerable volume of this remote sensing data occurs in near real-time. Such data are diverse and are provided with high velocity and variety, their pre-processing requires large computing capacities, and a fast execution time is critical. This paper proposes a new distributed software for remote sensing data pre-processing and ingestion using cloud computing technology, specifically OpenStack. The developed software discarded 86% of the unneeded daily files and removed around 20% of the erroneous and inaccurate datasets. The parallel processing optimized the total execution time by 90%. Finally, the software efficiently processed and integrated data into the Hadoop storage system, notably the HDFS, HBase, and Hive. Full article
(This article belongs to the Special Issue Big Data Management and Analysis with Distributed or Cloud Computing)
Show Figures

Graphical abstract

19 pages, 788 KiB  
Article
Optimization Techniques for a Distributed In-Memory Computing Platform by Leveraging SSD
by June Choi, Jaehyun Lee, Jik-Soo Kim and Jaehwan Lee
Appl. Sci. 2021, 11(18), 8476; https://0-doi-org.brum.beds.ac.uk/10.3390/app11188476 - 13 Sep 2021
Cited by 1 | Viewed by 1778
Abstract
In this paper, we present several optimization strategies that can improve the overall performance of the distributed in-memory computing system, “Apache Spark”. Despite its distributed memory management capability for iterative jobs and intermediate data, Spark has a significant performance degradation problem when the [...] Read more.
In this paper, we present several optimization strategies that can improve the overall performance of the distributed in-memory computing system, “Apache Spark”. Despite its distributed memory management capability for iterative jobs and intermediate data, Spark has a significant performance degradation problem when the available amount of main memory (DRAM, typically used for data caching) is limited. To address this problem, we leverage an SSD (solid-state drive) to supplement the lack of main memory bandwidth. Specifically, we present an effective optimization methodology for Apache Spark by collectively investigating the effects of changing the capacity fraction ratios of the shuffle and storage spaces in the “Spark JVM Heap Configuration” and applying different “RDD Caching Policies” (e.g., SSD-backed memory caching). Our extensive experimental results show that by utilizing the proposed optimization techniques, we can improve the overall performance by up to 42%. Full article
(This article belongs to the Special Issue Big Data Management and Analysis with Distributed or Cloud Computing)
Show Figures

Figure 1

24 pages, 1008 KiB  
Article
SPARQL2Flink: Evaluation of SPARQL Queries on Apache Flink
by Oscar Ceballos, Carlos Alberto Ramírez Restrepo, María Constanza Pabón, Andres M. Castillo and Oscar Corcho
Appl. Sci. 2021, 11(15), 7033; https://0-doi-org.brum.beds.ac.uk/10.3390/app11157033 - 30 Jul 2021
Cited by 1 | Viewed by 2222
Abstract
Existing SPARQL query engines and triple stores are continuously improved to handle more massive datasets. Several approaches have been developed in this context proposing the storage and querying of RDF data in a distributed fashion, mainly using the MapReduce Programming Model and Hadoop-based [...] Read more.
Existing SPARQL query engines and triple stores are continuously improved to handle more massive datasets. Several approaches have been developed in this context proposing the storage and querying of RDF data in a distributed fashion, mainly using the MapReduce Programming Model and Hadoop-based ecosystems. New trends in Big Data technologies have also emerged (e.g., Apache Spark, Apache Flink); they use distributed in-memory processing and promise to deliver higher data processing performance. In this paper, we present a formal interpretation of some PACT transformations implemented in the Apache Flink DataSet API. We use this formalization to provide a mapping to translate a SPARQL query to a Flink program. The mapping was implemented in a prototype used to determine the correctness and performance of the solution. The source code of the project is available in Github under the MIT license. Full article
(This article belongs to the Special Issue Big Data Management and Analysis with Distributed or Cloud Computing)
Show Figures

Figure 1

Back to TopTop