5.2. Data Provider Interactions
Initial interactions with data providers occurs primarily through the ORNL DAAC’s website on data management and data archival [5
]. The website contains tutorials and training and reference material on data management [12
]. In addition, the webpages provide an overview of data management planning and preparation and offer practical methods to successfully share and archive your data. Information on these pages serve as background and help files for the data activities associated with SAuS, including the steps providers need to take to initiate data ingest through this workflow.
Stage 1 (Inquire and Submit):
The ORNL DAAC receives notification of a data provider’s interest in archiving a data set through an archival interest form (Inquire). The form is available on the ORNL DAAC homepage and is prominently displayed. The archival interest form
captures preliminary information about the data set, such as the funding source, data set title, data set description, etc.
The details submitted by a potential data provider are added to a database at the ORNL DAAC archive center. When an archival interest form
is submitted, the SAuS system triggers an email to the data coordinator and the ORNL DAAC chief scientist. A decision about archiving is made based on archive policy and data set priority, as approved and endorsed by EOSDIS and the ORNL DAAC User Working Group [14
]. At the ORNL DAAC, data sets from the NASA Terrestrial Ecology and related programs (Carbon Monitoring Systems (CMS), Interdisciplinary Science (IDS), Carbon Cycle Science, etc.
) are accepted and processed in the same order they are received. The order in which the data sets are processed may be adjusted based on the condition and quality of data and documentation when received and how quickly investigators respond to questions. If the candidate data sets do not fall within the purview of the ORNL DAAC archive, the user is notified and possible alternative data archives are suggested to the data provider. Non-NASA sponsored data sets that are directly relevant to the terrestrial ecology community have to be approved by EOSDIS and the User Working Group prior to archival, based on the importance of the data set for the community, data set size and condition, and resource availability.
Stage 2 (Quality Review, Document):
If the data set is selected for archive, the data set moves into the “active” ingest phase. Through the SAuS publication dashboard, the ORNL DAAC ingest coordinator triggers an automated email to the data provider. The SAuS publication dashboard is a Drupal-based content management system that provides a graphical user interface for tracking and moving data sets through the ingest workflow. A description of the dashboard is provided later. When the automated email is triggered, automated scripts create various staging areas for the data sets. A unique data set ID based on a Universally Unique Identifier (UUID) is created for ingest in SAuS. The UUID is used to create a data upload directory and to add relevant information about the data set into the ingest workflow tracking database. An email address based on the UUID enables tracking of all email messages associated with the data set. The initial email to the investigator triggered through the dashboard contains four key elements.
Information to get an user account on the ORNL DAAC data publication system;
Link to answer a short questionnaire about the data set;
Link to upload data files;
Link to notify the ORNL DAAC system that all the above steps are complete.
The user account allows authenticated access to the data publication system and provides an added level of security to the ingest workflow. The short questionnaire gathers preliminary information about the data set to assist with quality assessment of the data files and to build data set documentation. The ORNL DAAC data set ingest questionnaire was designed from user community input and is aimed at maximizing the information (metadata) collected from the data provider in a reasonable amount of time (~30 min) to expedite the data publication process while retaining ingest quality. A summary of the questionnaire is provided in Table 1
. In addition to answering the questions, the data provider submits the completed data or model products, including description documents and supplemental files, using the UUID specific upload area. After the files and the answers have been uploaded, the data provider verifies the completion of the steps, which closes the ingest submission stage. The SAuS system provides mechanisms to send reminders and can snapshot the answers and the file upload summary information for provenance and record keeping.
5.3. ORNL DAAC Curation
After the data provider interaction phase has been closed out, the data set moves into the curation phase. Through the SAuS ingest dashboard, data Quality Assessment (QA) of the data set and documentation assignments are made. During the curation phase, all information collected about the data set is copied and moved to a QA area. The ingest UUID identification is maintained through the curation phase. When the files are moved into the QA area they are piped through a metadata script that extracts file level metadata using open source software such as Geospatial Data Abstraction Library (GDAL), netCDF Operators (NCO), etc. This file level metadata is used as a starting point for QA and for building the metadata required for data search, subsetting, visualization, and dissemination interfaces. The file level metadata extracted includes information such as spatial, temporal, file size, file type, variable definition, and associated characteristics of the data files.
QA staff use the information provided through the dashboard, supplemental information collected from the data provider, and the metadata extracted through the script to perform QA checks on the data files. During the QA phase, the integrity of the data files (checksums, projection etc.) are verified and the internal and external organizational aspects of the data files (directory structure, file naming conventions, parameter conventions etc.) are verified to ensure that the data files are representative of the documentation provided.
QA staff will also evaluate the appropriateness of the file format and make any file format conversions to ensure wider usage of the data files. For example, a binary data file may be converted into a Climate and Forecast (CF) convention compliant netCDF file. The non-proprietary netCDF format and the CF convention ensures that the data are readable many years into the future and allows the data files to be used through a wide variety of data analysis tools. A standards-based file format also allows the files to be readily accessible through web services and other data access, visualization, and subsetting mechanisms, thereby broadening the use of the data files to other disciplines. The data values are never altered during the QA steps and file format conversion process. During the QA process the spatial, temporal, and scientific integrity of the data files are also evaluated. For example, the QA team/person will check if the data files contain the same temporal and spatial extent and resolution as described in the documentation provided by the data provider. In some instances, the ORNL DAAC has received files for a smaller region when the documentation indicates that the data set is global. In addition, the variables described in the documentation are crosschecked with the contents of the data files. For example, in some cases, the data files received may have been scaled but the documentation does not describe the scaling. The QA person identifies such issues to ensure the integrity of the data files and, if necessary, confirms issues with the data provider.
Major issues with the data files are identified during this stage. A detailed QA checklist is provided on the ORNL DAAC website [16
]. If there are any unresolved questions or if there are any issues with the data files identified through the QA process, the interaction with the data provider is reopened, and email communications are initiated and tracked to resolve the issues. The speed at which the curation progresses depends on the responsiveness of the data provider and the integrity and completeness of the data files
In addition to the QA, ORNL DAAC staff also prepare metadata for discovery and compile comprehensive documentation that is relevant for future users; we use the 20-years rule [17
], a time far enough into the future to be useful for preparing documentation for both sharing and archiving data. Compiling descriptive data set documentation for future users is a time consuming but critical curation process. During the documentation phase, verification is performed to ensure that the documentation matches the files received. During curation, ORNL DAAC staff evaluate if the data set and its contents are clearly described and that the geospatial and temporal information are complete. Other key information about the data files such as the data file parameters, units, research methodology, etc.
are added to the documentation. If the data set contains data about field stations, information about those field sites (site name, geographic place name, geographic coordinates, elevation, and climate, biosphere, and soil characteristics) is added to the documentation. Calibration information, algorithms, and data quality information are added to the documentation as well. The documentation staff also build a comprehensive reference list that allows users to link the data files to the published research articles that were used to conduct the research and create the data files. Any data use or access policy information is added to the documentation as well.
One of the key benefits of the SAuS system is the ability to centrally manage the documentation and metadata workflow during the curation phase. The documentation is compiled, edited, and approved by several data archive staff. An online metadata editor provided by the SAuS system allows for the documentation and metadata to be shared and edited through a common centralized web based system. The centralized web-based eliminates duplication of document versions residing on individuals’ systems and also reduces the need for paper printouts, allowing for editing and approval of the finalized documentation directly within the ingest online system. To keep all of this information synchronized, to facilitate consistency, and to eliminate redundancy the SAuS metadata editor provides views to the metadata XML files, the documentation HTML pages, and the database table view of the data file records. The editor integrates the information across the three views to allow seamless access and eliminates duplication, thereby making the process more efficient. Before SAuS, if the description of a data file has to be updated, for example, the information had to be changed by hand in three places: in the documentation HTML, XML record, and the relational database that powers the archive web interface. Changes therefore took more time and could possibly have led to inconsistency in content. The centralized automated SAuS system facilitates integration and speeds up the process of documentation creation. Figure 7
shows a screenshot of the metadata editor.
After the quality of the data files is verified and documentation compiled the ORNL DAAC generates a data set citation that includes familiar elements of a citation, including authors, year, title, and digital object identifier (DOI). The DOI for the data set will remain fixed but the location (URL) of the data set may change. The DOI replaces the UUID that was used during internal curation only. An example citation is provided below.
Thornton, P.E., M.M. Thornton, B.W. Mayer, N. Wilhelmi, Y. Wei, R. Devarakonda, and R.B. Cook. 2014. Daymet: Daily Surface Weather Data on a 1-km Grid for North America, Version 2. ORNL DAAC, Oak Ridge, Tennessee, USA. Accessed August 25, 2015. Time Period: 1980-01-01 to 1985-12-31. Spatial range: N=35.05, S=32.50, W=-101.80, E=-85.20. http://0-dx.doi.org.brum.beds.ac.uk/10.3334/ORNLDAAC/1219
The citation acknowledges the researchers who provided the data products. The SAuS workflow provides the metadata needed to register the data set with the DOI registry using EzID [18
]. The metadata used in the workflow also facilitates the creation of a data set landing page. The data set landing page is web location showing access information to the data set made available to a client via resolution of the DOI. The SAuS system also facilitates the creation of the data file and metadata distribution package for user access.
5.5. Stage 3 (Publish): Publication and Post-Publication Activities
When the data set is formally released, the ORNL DAAC distributes the metadata to the NASA EOSDIS clearinghouse and other relevant data catalogues. If applicable, the ORNL DAAC also provides tools to explore, access, and extract data. These tools include web services such as the Open Geospatial Consortium (OGC) Web Map Service and Web Coverage Service (WCS), OPeNDAP, and other REST/SOAP-based Web services. The standardization of the workflow through the SAuS system simplifies the integration of the data files into these tools. The ORNL DAAC also advertises the data through email, social media, and the ORNL DAAC website. An automated script prepares an email message to the data authors, congratulating them on publishing the data product and encourages them to add the citation to their curriculum vitae; a DAAC staff member sends this message.
The ORNL DAAC also provides long-term data stewardship for the data set. The archive provides a secure long-term storage of the data files and acts as a buffer between the data users and the data contributors to address any questions about the data set. To provide long-term storage, the ORNL DAAC continuously refreshes its hardware to prevent bit rot and other unintended changes to the data files because of hardware storage issues. The ORNL DAAC creates and tests back-up copies often to prevent the disaster of lost data. ORNL DAAC also maintains at least three copies of the data: the original, an on-site but external backup, and an off-site backup in case of a disaster. In addition, the ORNL DAAC updates documentation for data sets based on any new information collected. The ORNL DAAC collects and provides data download and citation statistics to gauge the impact of the data sets. Data citations were implemented in 1998 at the ORNL DAAC to provide credit to the data authors, give an estimate of the scientific impact of the ORNL DAAC, and enable readers to access the data used in an article. The ORNL DAAC also added Digital Object Identifiers to its data holdings in 2007 to provide more legitimacy to data citations. The data citations and DOI facilitate the identification of the use of data products in the literature. The ORNL DAAC has integrated data product citations throughout its data workflow and has incorporated data citation metrics to gauge the scientific impact of a data set and to allow data users to understand the various applications of a particular data set.
illustrates an example webpage listing all publications that had used a particular data set from the ORNL DAAC. Example for data set “NACP Aboveground Biomass and Carbon Baseline Data, V.2 (NBCD 2000), U.S.A., 2000
” doi: 10.3334/ORNLDAAC/1161, http://daac.ornl.gov/cgi-bin/show_pubs.pl?ds=1161
from the ORNL DAAC.