Data management checklist

Preamble

This checklist is intended to help with the creation of a data management plan for your project. The data management plan goal is to facilitate the practical implementation of EMBL’s open science policy to promote transparency, auditability and reproducibility of a research project. The data management plan should describe how digital traceability of items that contribute to the output of a project (i.e. reagents, instruments, software and data) will be implemented such that ideally there is an uninterrupted and unambiguous chain of information from reagents to instruments to data to analysis output.

Points to consider

1- What is the project?

  • What is the scope?
  • Will there be different types of data?
  • Who are the stakeholders?
  • Will EMBL facilities be involved?

2- What data will be produced and used?

  • Where and how will it be stored?
  • What is the data content and structure?
  • Which open formats will be used? If not using open format, why and will all stakeholders be able to use it?
  • Which open standards will be used? If not using open standards, why and will all stakeholders be able to use it?
  • Have the storage requirements been identified?
  • Disk space?
  • Databases?
  • Cloud?
  • Have access requirements been identified?
  • Will external collaborators need access?
  • Does the storage type meet the requirements of the expected access patterns?
  • Permissions?
  • Bandwidth?
  • What is the back-up strategy?
  • What will be backed-up?
  • How often?
  • On which media?
  • Incremental, differential or full backup?
  • How will data duplication be avoided to ensure all stakeholders work with the same data?
  • How will reuse of existing data be identified, documented and included in the data management plan?

3- What information is needed for the data to be read and interpreted?

  • Does reading the data require specific software?
  • How is the data going to be documented?
  • Which metadata will be associated with the data and how will it be linked?
  • Which types of identifiers will be used to identify biological and other entities?
  • Which ontologies or controlled vocabularies will be used?
  • Where is the data documentation to be found?
  • How can different types of metadata be accessed?
  • Will a laboratory information management system be used?

4- Which procedures will be used to create, process and quality control the data and metadata?

  • How will data and files be named and organised?
  • Will a naming scheme be implemented so that names and abbreviations are consistent, unambiguous and self-explanatory?
  • Where will this scheme be documented?
  • How will changes be tracked and propagated?
  • How will metadata and provenance be preserved?
  • How will derived data be updated?

5- How will data processing operations be documented so that they can be reproduced?

  • How will manual data processing interventions be documented?
  • How will configuration parameters be documented?
  • How are different analysis versions documented and tracked?
  • Will a scientific workflow management system be used?
  • Will open source software be used?
  • Will the different pieces of software used be interoperable?
  • Will original software be produced?
  • How will it be made available?
  • Is there a software management plan?

6- Are there any security or access control requirements?

  • Who’s going to access the data, how and for which purpose?
  • How will security and access be managed?

7- How will IP be managed?

  • Who owns the data?
  • How will the data be licensed for reuse?
  • Are there any restrictions on the reuse of third-party data?
  • Will data sharing be postponed / restricted e.g. to publish or seek patents?

8- What happens to the data at the end of the project?

  • Which public repositories will the data be released to?
  • What happens to data not linked to a publication?
  • Will software and proper documentation be made available to reproduce derived data from deposited primary data?
  • Who is the contact person?

9- Who is responsible for which part of data management?

  • Who maintains the data management plan and informs the project’s stakeholders?
  • Is there a procedure to check compliance with the data management plan?
  • Who’s responsible for which task and resources (e.g. storage, back-up, data processing, data deposition …) ?
  • Who is responsible for updates and propagating the resulting changes they imply?