Next Article in Journal
A Quality Control Method for Broad-Beam HF Radar Current Velocity Measurements
Previous Article in Journal
Performance Simulation of the Transportation Process Risk of Bauxite Carriers Based on the Markov Chain and Cloud Model
Previous Article in Special Issue
Modeling Tidal Datums and Spatially Varying Uncertainty in the Texas and Western Louisiana Coastal Waters
 
 
Article
Peer-Review Record

Analysis and Visualization of Coastal Ocean Model Data in the Cloud

by Richard P. Signell 1,* and Dharhas Pothina 2
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Submission received: 6 March 2019 / Revised: 29 March 2019 / Accepted: 1 April 2019 / Published: 19 April 2019

Round 1

Reviewer 1 Report

In the article, the authors show an example usage of a Cloud-based software stack

for Coastal ocean modeling. After stating the current limitations of the “traditional”

approaches for scientific analysis, they are making the case for Cloud-based data analysis. After a detailed description of the various software in their stack, they show

examples of such analysis using a finite element storm surge model (ADCIRC) and a

primitive equations regional ocean model (COAWST).


Major remarks:


Abstract and Introduction share some very similar language.

The discussion would benefit to have a deeper discussion of the limitations of the technology. Still a very limited amount of datasets are served on the cloud. The software stack is in constant change (which demonstrates  the vitality of the community),…

The conclusion is very short, some restructuring of the discussion/conclusions could be useful.

The authors are using model results for storm surge and 3d ocean forecast as their examples. Providing a little more context about the datasets would help make the case for why the cloud-based computing is a great asset for science.

No analysis for COAWST: the analysis part of the paper is limited to computing the max of the first dataset

Related to this: can we have an example showing say the temperature anomaly between the forecast and the climatological mean for that day, using groupby for example.


Structure:


In the framework description, I would split the list into projects (pangeo, pyviz, earthsim) and python packages. The pangeo description could include details on how the computing environment is built (k8s cluster, conda environment, jupyterhub), and then tailored for a specific use case (geophysical fluid dynamics)


Minor remarks:


L14-15/L36-37: Although the HPC to desktop workflow may be the most common one, some groups or institution have more centralized ways of working, where visualization servers sits (and share filesystems) with the supercomputers. I would try to be more general.


L24-25: what about “a cloud-optimized file format for Ndimensional arrays.”


L26: 10 million grid point is pretty standard for an ocean model. I would let go of the “massive”


L29: I would keep the technical details of the ADCIRC run for the main text and also add details of the COAWST model later.


L35: I would say the challenge lies in the ever-increasing volume of data produced. Current systems can deal with 1TB simulations, however the PB scale for model output is quickly coming.


L62: replace ecosystem by community


L65-68: looks very similar to lines 30-31


L89-90: can you elaborate on this point?


L104: chunks is defined here but used before


L103: Maybe define out-of-memory, explaining how building a dask graph allows

to perform operations on a dataset that doesn’t fit into memory.


L110: and coordinates


L112: technically netcdf4= hdf5. I would rephrase to local and remote datasets


L113: I would mention that the type of parallelization is fine grain and is very different from MPI, just to make sure the reader doesn’t mix up paradigms 


L168: I wouldn’t use agnostic in this context, what about: and works on all major Cloud providers


L173: same remark on agnostic


L182: The Pangeo community…


L196: the text says 15s when the screenshot says 24.5s. Rather than raw numbers, speedups can be more interesting to see, for example in a benchmark against a desktop computer.


L202-203: Is there a reason to use this particular chunking? It might be useful to say that the unit of data access in this case is the chunk and if one needs only the first

time step all 10 will be read.


Figure 5: The countries name don’t display well, I’d remove them and trust the reader’s rudiments in geography.


Figure 5 and 6: Can you add a date and maybe explain the signals we’re seeing. How different is the max water level from the Sea Surface Height? Is it a tidal signal shown in the Bay of Fundy?


Figure 6: the remaining of Houston past the H disappeared. 


L247: Can we have more details on the COAWST model grid?


L272: unless the data is produced locally


L273: I’d say: On the Could, once a dataset is made available, any user can run analysis on it using the extensive cloud computing resources.


L293: Maybe mention that one advantage of cloud-based computing is to allow flexibility in the spending on resources, which can be very advantageous for scientists whose needs in terms of computing can greatly vary by project


L296: any computer connected to the internet.


L315: maybe mention that researchers might soon not have the choice given the cost and inefficiency of storing dark copies of ever-growing datasets. For example one year of precipitation in ERAinterim was around 500MB, compared to 17GB in ERA5.

 

L316: remove and



Author Response

Reviewer #1:


In the article, the authors show an example usage of a Cloud-based software stack for Coastal ocean modeling. After stating the current limitations of the “traditional” approaches for scientific analysis, they are making the case for Cloud-based data analysis. After a detailed description of the various software in their stack, they show examples of such analysis using a finite element storm surge model (ADCIRC) and a primitive equations regional ocean model (COAWST).


Major remarks:


Abstract and Introduction share some very similar language.


Indeed some of the same concepts are repeated, but this is intentional.  As other reviewers did not object, no changes were made.


The discussion would benefit to have a deeper discussion of the limitations of the technology. Still a very limited amount of datasets are served on the cloud. The software stack is in constant change (which demonstrates  the vitality of the community),…


The challenges of the rapidly evolving software stack and moving data to the Cloud have been added as challenges.  


The conclusion is very short, some restructuring of the discussion/conclusions could be useful.


The conclusion has been lengthened  is now not quite as short.  


The authors are using model results for storm surge and 3d ocean forecast as their examples. Providing a little more context about the datasets would help make the case for why the cloud-based computing is a great asset for science.


No analysis for COAWST: the analysis part of the paper is limited to computing the max of the first dataset


Related to this: can we have an example showing say the temperature anomaly between the forecast and the climatological mean for that day, using groupby for example.


The two examples using ADCIRC and COAWST results are representative only, chosen carefully to illustrate the functionality of the framework, namely parallel computation, out-of-band memory handling, and visualization of both trimesh and quadmesh grids.   We don’t see how the suggested additional example would add any additional demonstration of functionality.



Structure:


In the framework description, I would split the list into projects (pangeo, pyviz, earthsim) and python packages. The pangeo description could include details on how the computing environment is built (k8s cluster, conda environment, jupyterhub), and then tailored for a specific use case (geophysical fluid dynamics)


As other reviewers did not suggest this, and we don’t feel this would improve the manuscript, we did not undertake this particular restructuring.



Minor remarks:




L14-15/L36-37: Although the HPC to desktop workflow may be the most common one, some groups or institution have more centralized ways of working, where visualization servers sits (and share filesystems) with the supercomputers. I would try to be more general.


We don’t think it’s necessary to cover all the existing current workflows in the abstract.  The focus is on a new workflow on the Cloud.


L24-25: what about “a cloud-optimized file format for Ndimensional arrays.”


Thanks, done.



L26: 10 million grid point is pretty standard for an ocean model. I would let go of the “massive”


Changed to “large”.   


L29: I would keep the technical details of the ADCIRC run for the main text and also add details of the COAWST model later.


I would like to keep these details of the ADCIRC run in the abstract since some people only read the abstract and it quantitatively conveys the size of the data, which is important.  

L35: I would say the challenge lies in the ever-increasing volume of data produced. Current systems can deal with 1TB simulations, however the PB scale for model output is quickly coming.


We believe it’s also challenging dealing with 1TB simulations effectively.  


L62: replace ecosystem by community


It actually is the ecosystem, not the community.



L65-68: looks very similar to lines 30-31


I think this is okay.  30-31 are in the abstract.


L89-90: can you elaborate on this point?


Done.


L104: chunks is defined here but used before


Sentence rewritten.


L103: Maybe define out-of-memory, explaining how building a dask graph allows

to perform operations on a dataset that doesn’t fit into memory.


Done.


L110: and coordinates


Yes, added.


L112: technically netcdf4= hdf5. I would rephrase to local and remote datasets


Done


L113: I would mention that the type of parallelization is fine grain and is very different from MPI, just to make sure the reader doesn’t mix up paradigms


Done.


L168: I wouldn’t use agnostic in this context, what about: and works on all major Cloud providers


Done.


L173: same remark on agnostic


Here we are quoting, so can’t change it


L182: The Pangeo community…


Done.


L196: the text says 15s when the screenshot says 24.5s. Rather than raw numbers, speedups can be more interesting to see, for example in a benchmark against a desktop computer.


Fixed.


L202-203: Is there a reason to use this particular chunking? It might be useful to say that the unit of data access in this case is the chunk and if one needs only the first time step all 10 will be read.


Added information about the chunk size selection.


Figure 5: The countries name don’t display well, I’d remove them and trust the reader’s rudiments in geography.


This is an actual screen shot of the tool, so would not be appropriate to modify.


Figure 5 and 6: Can you add a date and maybe explain the signals we’re seeing. How different is the max water level from the Sea Surface Height? Is it a tidal signal shown in the Bay of Fundy?


Added some text that in fact it *is* the tidal signal shown in the Bay of Fundy.



Figure 6: the remaining of Houston past the H disappeared.


This is an actual screen shot of the tool, so would not be appropriate to modify.



L247: Can we have more details on the COAWST model grid?


Added.


L272: unless the data is produced locally


Added.


L273: I’d say: On the Could, once a dataset is made available, any user can run analysis on it using the extensive cloud computing resources.


Changed.


L293: Maybe mention that one advantage of cloud-based computing is to allow flexibility in the spending on resources, which can be very advantageous for scientists whose needs in terms of computing can greatly vary by project


Added.


L296: any computer connected to the internet.


Changed.


L315: maybe mention that researchers might soon not have the choice given the cost and inefficiency of storing dark copies of ever-growing datasets. For example one year of precipitation in ERAinterim was around 500MB, compared to 17GB in ERA5.


Not added.


L316: remove and


Done.

Reviewer 2 Report

Review of “Analysis and Visualization of Coastal Ocean Model Data in the Cloud” by Signell and Pothina.

 

This paper describes the collection of tools that are now available to process and visualize large geophysical datasets on the cloud. The authors present two example applications using coastal modeling data sets.

 

The topic of storing and analyzing data on the cloud is of great importance to modelers and observationalists alike.  Many of these tools have come to maturity just in the last year or two, and most scientists, including myself, are struggling to keep up with the bewildering array of new tools, how they connect, and what each is used for. I really like this article, because it gives a brief description of each tool and how it fits into a full system of data analysis on the cloud. It also provides a high-level overview of the needs and benefits of this suite of tools.

 

The writing is very good, background and references are sufficiently detailed, and example data sets are well presented.

 

Because of the emphasis on reproducibility, accessibility, and data sharing of cloud-based analysis, it is great that the authors provided a way for readers to conveniently reproduce the results shown in the paper.  I was a little confused by reference 34 and 35 as they are general websites, not specific containers or codes:

https://github.com/reproducible-notebooks

https://mybinder.org

Perhaps you could add a few more paragraphs or an appendix on the exact steps to take for us novices.  For example, I clicked around and went here:

https://hub.binder.pangeo.io/user/reproducible-no-roms_dashboards-sjep2nhm/notebooks/COAWST_Dashboard.ipynb

and clicked cell/run all. It appears to have worked. Did I just run the page on the cloud? In a container? Was I using all of the things in refs 9 through 18? One of the ironies of this being so easy is that I don’t actually know what was “done” or what components it used on that sample jupyter page. A bit more explanation would go a long way, but a tutorial-style explanation may fit an appendix better.

 

Thank you for writing such a timely and relevant article. It will be useful to scientists who are working to keep their computing knowledge and skills up to date. I took a training from Ryan Abernathy et al at AGU last December, and was amazed at how much progress had been made on these libraries for cloud data analysis.  I then tried to show my colleagues at work the following week, and was not able to get it to work, despite having years of experience with big data analysis from models using the older set of tools. So I am a member of the target audience for this publication.

 

 


Author Response

Reviewer #2


All of these reviewer’s issues are about adding more tutorial content.   The problem with a tutorial in a paper (whether in the body or in an appendix) is that it quickly gets out out date.  To address this concern,  I have added a link to the Pangeo tutorials, which will ensure that as the tutorials change, readers of this paper will have access to the latest, up-to-date information.


Reviewer 3 Report

The authors described the Pangeo and related open-source framework for big data analysis in the cloud, which is very encouraging. In the geoscience community, we start to collect more and more data, which is too large to analyze using a laptop or local server. Clound storage and analysis is definitely the right way of going forward. The manuscript is well organized and two applications were given to demostrate the powerful framework. A few comments are as follows.

1) Instructions or examples about how to install this framework in clound, hpc or local sever are needed. Because of the benefits of this powerful framework as demostrated by the authors, readers may would like to try and deploy this framework themselves. It would be very benefitical to have some instructions or examples for installing  and deploying this framework. 

2) Line 93-94, Page 3. Zarr chunk size is a key parameter  of affecting file reading and writing speed.  Could the authors talk more about the rule of thumb for determining chunk size?

3) Because Binder is also mentioned in the manuscript, it would be better if the authors can spend a few sentence  on the intrdouction of Binder. Readers then can better understand how to try the examples in the manuscript themselves. 


Author Response


The authors described the Pangeo and related open-source framework for big data analysis in the cloud, which is very encouraging. In the geoscience community, we start to collect more and more data, which is too large to analyze using a laptop or local server. Clound storage and analysis is definitely the right way of going forward. The manuscript is well organized and two applications were given to demonstrate the powerful framework. A few comments are as follows.


1) Instructions or examples about how to install this framework in cloud, hpc or local sever are needed. Because of the benefits of this powerful framework as demostrated by the authors, readers may would like to try and deploy this framework themselves. It would be very benefitical to have some instructions or examples for installing  and deploying this framework.


Reviewer #2 also asked about this.  I have added a link to the Pangeo tutorials, where interested folks may find much more information about how to set up the framework.  


2) Line 93-94, Page 3. Zarr chunk size is a key parameter  of affecting file reading and writing speed.  Could the authors talk more about the rule of thumb for determining chunk size?


Reviewer #1 also asked about this.   I have added information about the rationale for this particular chunk size.



3) Because Binder is also mentioned in the manuscript, it would be better if the authors can spend a few sentence  on the intrdouction of Binder. Readers then can better understand how to try the examples in the manuscript themselves.


Reviewer #2 also asked about this.  I have added a link to the Pangeo tutorials, where interested folks may find much more information about how to run these workflows on Binder.


Back to TopTop