Next Article in Journal
Simultaneous Determination of 12 Marker Components in Yeonkyopaedok-san Using HPLC–PDA and LC–MS/MS
Next Article in Special Issue
Prediction of Weights during Growth Stages of Onion Using Agricultural Data Analysis Method
Previous Article in Journal
Aerodynamic Characteristics of Different Airfoils under Varied Turbulence Intensities at Low Reynolds Numbers
Previous Article in Special Issue
Large-Scale Data Computing Performance Comparisons on SYCL Heterogeneous Parallel Processing Layer Implementations
 
 
Article
Peer-Review Record

HI-Sky: Hash Index-Based Skyline Query Processing

by Jong-Hyeok Choi 1, Fei Hao 2,3 and Aziz Nasridinov 1,*
Reviewer 1: Anonymous
Reviewer 2:
Submission received: 6 January 2020 / Revised: 26 February 2020 / Accepted: 27 February 2020 / Published: 2 March 2020
(This article belongs to the Special Issue Big Data Analysis and Visualization)

Round 1

Reviewer 1 Report

In this paper, authors propose an algorithm that can efficiently retrieve skyline results from high-volume and high-dimensional data sets. However, the authors do not describe real-world applications that have such kind of characteristic (e.g., high-volume and high-dimensional data sets). The reviewer suggest the authors give some real-world applications to show that their design is widely-applicable to today's business environment.

 

In Section 3, authors mentions that the proposed method allows frequent change in the data set. However, the authors only use static data sets in their experiments. The authors can provide experiments to show that the proposed method can work well when data items are frequently inserted or deleted from the data sets. 

 

Is there any real-world skyline applications that should deal with dynamic data set (i.e., data items changes frequently)?

 

In Figure 2, what happens if a data point falls on the boundary between cells? For example, does p = (0.5, 0.25) belong to GLADs 4, 5, 8 or 9?

 

The meaning of "GLAD b is not dominated by GLAD a" is unclear (page 7, line 314). Since a and b are hash keys (i.e., scalar values), there is no dominance relationship between scalar values. 

 

The meaning of HIGP in Algorithm HI-Sky is unclear.

 

The performance results in [1] shows that Z-SKY performs better that SFS in various data dimensionalities or data cardinalities. However, the experiment in the paper shows different results. That is, SFS performs better that Z-SKY in most of the cases. Why? 

 

[1] Ken C. K. Lee, Wang-Chien Lee, Baihua Zhengm, Huajing Li and Yuan Tian, "Z-SKY: an efficient skyline query processing framework based on Z-order," The VLDB Journal (2010) 19:333–362.

Author Response

Please refer to the attached file.

Author Response File: Author Response.pdf

Reviewer 2 Report

The authors propose an efficient algorithm for performing skyline queries in a distributed context. In particular, the proposed method involves the construction of an index based on a hash grid. The paper is well written and the description of the proposed methodology is correlated by both a formal description in pseudo-code, a running example and an extensive sets of experiments comparing its performances w.r.t. other available techniques. I really appreciate your work, but some issues have to be fixed before the paper can be accepted for publication:

- From the introduction it is not clear which is the considered execution environment: do you refer to a distributed one? is your technique applicable to a MapReduce framework, like Hadoop or Spark? or do you refer to a distributed database system?

- Sect. 2.3: as regards to the related work section, you mention the implementation of skyline algorithms in MapReduce. I would like to have also a reference to the skyline implementation provided by SpatialHadoop [1], a spatial extension of Apache Hadoop. This framework not only provides an implementation of the skyline for spatial data, but also allows to combine it with various kind of indexes [2]. Moreover, for such kind of system a work has been published which proposes a technique to determine the best kind of index to apply based on the given dataset characteristics [3].

[1] "SpatialHadoop: A MapReduce Framework for Spatial Data" by Eldawy et al. In Proceedings of the IEEE International Conference on Data Engineering, ICDE 2015

[2] "Spatial Partitioning Techniques in SpatialHadoop" by Eldawy et al. In Proceedings of the International Conference on Very Large Databases, VLDB 2015

[3] "Detecting Skewness of Big Spatial Data in SpatialHadoop" by Belussi et al. In Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPATIAL 2018

- Sect. 3.1: from your description at page 4, it seams that your technique is able to deal with dynamic datasets, namely data can be added or deleted from the dataset. However, it is not completely clear if your index can scale well in case of a intense updating operations.

- Sect. 3.1: another question is about the treating of skewed data. Are you index able to efficiently deal with not uniformly distributed datasets? In case of very clustered datasets, the uniform repartitioning proposed by your grid index could be not very appropriate. See again the work in (3).

Minor comments:

- page 5, line 193: [0,1]^2 -> does 2 stand for d, the number of dimensions?

- the resolution of Fig. 3 is not very high, the characters are not clearly readable.

- page 10, line 385: "Fig. 3(b)" -> is really subfigure b?

- page 11-12: place the caption of the algorithm in the same page of the pseudo-code.

Author Response

Please refer to the attached file.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The authors have addressed all my concerns. 

Author Response

We greatly appreciate your time reviewing our manuscript.

Reviewer 2 Report

I am satisfied by the work done by the authors, they address all my comments. However, as regards to the treatment of skewed datasets which is (correctly) leaved as a future work, I encourage the authors to place a sentence also in the conclusion mentioning such future extention, taking reference also to [3] for ispiration about how to extend your work based on the dataset distribution.

[3] "Detecting Skewness of Big Spatial Data in SpatialHadoop" by Belussi et al. In Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPATIAL 2018

Author Response

Summary of Revision

Reviewer #2

We deeply appreciate the reviewer’s careful comments. The following is the explanation of the revision. The changed part of the paper is highlighted in red color.

 

Comment 1: I am satisfied by the work done by the authors, they address all my comments. However, as regards to the treatment of skewed datasets which is (correctly) leaved as a future work, I encourage the authors to place a sentence also in the conclusion mentioning such future extention, taking reference also to [1] for ispiration about how to extend your work based on the dataset distribution.

 

Answer 1: Thank you for your comments. We have added a description about how to extend the proposed method according to the paper [1] recommended by Reviewer. Please refer to the conclusion (page 21).

 

“In future work, we are planning to apply a method like box-counting [35], to make HI-Sky choose the best np by itself even if the datasets are skewed. In addition, we are planning to utilize a system like SpatialHadoop [24-25] to extend HI-Sky into a distributed environment.”

 

References

  1. Belussi, A.; Migliorini, S.; Eldawy, A. Detecting skewness of big spatial data in SpatialHadoop. Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Association for Computing Machinery: New York, NY, USA, 2018; pp 432-435.
Back to TopTop