TitleBodyTechnical Expertise RequiredCostAdditional Information
Backup your data

To avoid accidental loss of data you should:

  • Backup your data at regular frequencies
    • When you complete your data collection activity
    • After you make edits to your data
  • Streaming data should be backed up at regularly scheduled points in the collection process
    • High-value data should be backed up daily or more often
    • Automation simplifies frequent backups
  • Backup strategies (e.g., full, incremental, differential, etc…) should be optimized for the data collection process
  • Create, at a minimum, 2 copies of your data
  • Place one copy at an “off-site” and “trusted” location
    • Commercial storage facility
    • Campus file-server
    • Cloud fire-server (e.g., Amazon S3, Carbonite)
  • Use a reliable device when making backups
    • External USB drive (avoid the use of “light-weight” devices e.g., floppy disks, USB stick-drive; avoid network drives that are intermittently accessible)
    • Managed network drive
    • Managed cloud file-server (e.g., Amazon S3, Carbonite)
  • Ensure backup copies are identical to the original copy
    • Perform differential checks
    • Perform “checksum” check
  • Document all procedures to ensure a successful recovery from a backup copy
Boyle’s Laws in a Networked World: How the future of science lies in understanding our past
Decide what data to preserve

The process of science generates a variety of products that are worthy of preservation. Researchers should consider all elements of the scientific process in deciding what to preserve:

  • Raw data
  • Tables and databases of raw or cleaned observation records and measurements
  • Intermediate products, such as partly summarized or coded data that are the input to the next step in an analysis
  • Documentation of the protocols used
  • Software or algorithms developed to prepare data (cleaning scripts) or perform analyses
  • Results of an analysis, which can themselves be starting points or ingredients in future analyses, e.g. distribution maps, population trends, mean measurements
  • Any data sets obtained from others that were used in data processing
  • Multimedia: documented procedures, or standalone data

When deciding on what data products to preserve, researchers should consider the costs of preserving data:

  • Raw data are usually worth preserving
  • Consider space requirements when deciding on whether to preserve data
  • If data can be easily or automatically re-created from raw data, consider not preserving. E.g. if data that have undergone quality control processes and were analyzed, consider preserving since reproduction might be costly
  • Algorithms and software source code cost very little to preserve
  • Results of analyses may be particularly valuable for future discovery and cost very little to preserve

Researchers should consider the following goals and benefits of preservation:

  • Enabling re-analysis of the same products to determine whether the same conclusions are reached
  • Enabling re-use of the products for new analysis and discovery
  • Enabling restoration of original products in the case that working datasets are lost
From Rolling Deck to Repository (R2R): Lessons Learned in Managing Data for the US Research Fleet
Identify data sensitivity

Steps for the identification of the sensitivity of data and the determination of the appropriate security or privacy level are:

  • Determine if the data has any confidentiality concerns
    • Can an unauthorized individual use the information to do limited, serious, or severe harm to individuals, assets or an organization’s operations as a result of data disclosure?
    • Would unauthorized disclosure or dissemination of elements of the data violate laws, executive orders, or agency regulations (i.e., HIPPA or Privacy laws)?
    • Does the data have any integrity concerns?
    • What would be the impact of unauthorized modification or destruction of the data?
    • Would it reduce public confidence in the originating organization?
    • Would it create confusion or controversy in the user community?
    • Could a potentially life-threatening decision be made based on the data or analysis of the data?
    • Are there any availability concerns about the data?
    • Is the information time-critical? Will another individual or system be relying on the data to make a time-sensitive decision (i.e. sensing data for earthquakes, floods, etc.)?
  • Document data concerns identified and determine overall sensitivity (Low, Moderate, High)
    • Low criticality would result in a limited adverse effect to an organization as a result of the loss of confidentiality, integrity, or availability of the data. It might mean degradation in mission capability or result in minor harm to individuals.
    • Moderate criticality would result in a serious adverse effect to an organization as a result of the loss of confidentiality, integrity, or availability of the data. It might mean a severe degradation or loss of mission capability or result in significant harm to individuals that does not involve loss of life or serious life threatening injuries.
    • High criticality would result in a severe or catastrophic adverse effect as a result of the loss of confidentiality, integrity, or availability of the data. It might cause a severe degradation in or loss of mission capability or result in severe or catastrophic harm to individuals involving loss of life or serious life threatening injuries.
  • Develop data access and dissemination policies and procedures based on sensitivity of the data and need-to-know.
  • Develop data protection policies, procedures and mechanisms based on sensitivity of the data.
Identify data with long-term value

As part of the data life cycle, research data will be contributed to a repository to support preservation and discovery. A research project may generate many different iterations of the same dataset - for example, the raw data from the instruments, as well as datasets which already include computational transformations of the data.

In order to focus resources and attention on these core datasets, the project team should define these core data assets as early in the process as possible, preferably at the conceptual stage and in the data management plan. It may be helpful to speak with your local data archivist or librarian in order to determine which datasets (or iterations of datasets) should be considered core, and which datasets should be discarded. These core datasets will be the basis for publications, and require thorough documentation and description.

  • Only the datasets which have significant long-term value should be contributed to a repository, requiring decisions about which datasets need to be kept.
  • If data cannot be recreated or it is costly to reproduce, it should be saved.
  • Four different categories of potential data to save are observational, experimental, simulation, and derived (or compiled).
  • Your funder or institution may have requirements and policies governing contribution to repositories.

Given the amount of data produced by scientific research, keeping everything is neither practical nor economically feasible.

Store data with appropriate precision

Data should not be entered with higher precision than they were collected in (e.g if a device collects data to 2dp, an Excel file should not present it to 5 dp). If the system stores data in higher precision, care needs to be taken when exporting to ASCII. E.g. calculation in excel will be done to the highest possible precision of the system, which is not related to the precision of the original data.

The data flood: Implications for data stewardship and the culture of discovery