Data management guide

This version of the data management guide was updated 11 May 2022.

This guide is intended to inform parts of your project data management plan. You can use it to answer some of the questions in the data management checklist.

1- Keep track of all processing steps and data used to create a figure.

There should be an uninterrupted and unambiguous chain from reagents to raw data to each figure of a published paper. Anybody should be able to follow this chain, not just the people who’ve been involved in the work. Disambiguate reagents by always associating them with a database identifier (either from an internal database or from a supplier’s catalog).

2- Create a project folder with sub-folders for experiments and data sets.

Check the project organisation page for some suggestion on how to organize and manage a project folder.

3- Name files and folders in a consistent way.

4- Document data.

The data documentation file should describe:

Template files are available for images and arrays (i.e. tabular and higher dimensional data). The templates are in the YAML format which is both human- and computer-readable and writable. For a quick introduction see the wikipedia article. Once the metadata have been entered, verify that the file is still a valid YAML file using a YAML validator (e.g. online).

5- Register your project and its data in a data management system.

At EMBL, use either the Data Management app or STOCKS.

STOCKS is a more comprehensive system with electronic lab notebook and LIMS functionalities allowing you to link data sets to samples, reagents and protocols.

The DM app is a lightweight data registration system with minimal support for metadata.

6- Document data processing steps.

If you can, use a workflow management system (e.g.snakemake, nextflow, Galaxy: EMBL server or public server, targets …), it will take care (among other things) of documenting the processing steps for you.

7- Don’t keep unused/unusable data.

8- Prepare tabular data for processing and re-use.

If you don’t need or want a server, consider using SQLite, it is a serverless relational database management system that is configuration-free and portable. An SQLite database is a single file that can grow up to 281 TB or the file size limit of the file system used, whichever is smaller.

9- High-throughput data.

Organize the data hierarchically by plate and well as this is the structure most analysis software can use, e.g.:

▽ Chromosome_condensation_project
▽ primary_screen
▽ images
▽ plate1_replicate1
▽ well001
W001-P001-Z000-T0000-s1234-Cy3.tif
W001-P001-Z000-T0000-s1234-EGFP.tif
▷ well002
▷ plate1_replicate2
▷ plate2_replicate1
▽ analysis
▷ configuration_files
▷ segmentation
▽ feature_extraction
▽ plate1_replicate1
▽ well001
W001-P001-Z000-T0000-s1234-Cy3.txt
W001-P001-Z000-T0000-s1234-EGFP.txt
▷ well002
▷ plate1_replicate2
▷ plate2_replicate1
▷ code
▷ validation_screen

10- Ensure file integrity

Errors can happen when copying or moving files so copies should be checked against the original. This is done by comparing checksums of the two files. A simple way of doing this is to copy files using rsync then repeating the rsync command with the -n option to compare the two copies. Avoid copying files using drag-and-drop motions.