data_management

Data management guide

This version of the data management guide was updated 11 May 2022.

This guide is intended to inform parts of your project data management plan. You can use it to answer some of the questions in the data management checklist.

1- Keep track of all processing steps and data used to create a figure.

There should be an uninterrupted and unambiguous chain from reagents to raw data to each figure of a published paper. Anybody should be able to follow this chain, not just the people who’ve been involved in the work. Disambiguate reagents by always associating them with a database identifier (either from an internal database or from a supplier’s catalog).

2- Create a project folder with sub-folders for experiments and data sets.

Organize files consistently by source or experiment/analysis type.
Separate primary data, analysis results and code.
Create a sub-directory for each data set.
Place the data management plan in the project folder.
Keep a plain text documentation file (e.g. README.txt) and specific metadata files in each sub-folder.

Check the project organisation page for some suggestion on how to organize and manage a project folder.

3- Name files and folders in a consistent way.

Define conventions at the beginning of the project and write them down in the data management plan so that you can refer to them later and stick to them.
When working with collaborators, make sure everyone follows the conventions, share the data management plan.
Each name should be unique, indicative of content and machine-readable.
Essential rules for naming files and folders are:
hard-code essential information in file names in a consistent way, i.e. always in the same order and in a standardized way, e.g. use ISO format for dates: yyyy-mm-dd and use official gene symbols, e.g. RANBP2 not NUP358.
use only ASCII alphanumerical characters, i.e. 0-9, a-z, A-Z and separate meaningful fields/information using underscore. Do not use spaces or hyphens.

4- Document data.

The data documentation file should describe:

how the data was created,
the naming conventions (or refer to the data management plan),
the data type, format and structure (i.e. dimensions of the data and what each dimension represents),
measurements units,
coding schemes used, e.g. code used for missing data

5- Register your project and its data in a data management system.

At EMBL, use either the Data Management app or STOCKS.

STOCKS is a more comprehensive system with electronic lab notebook and LIMS functionalities allowing you to link data sets to samples, reagents and protocols.

The DM app is a lightweight data registration system with minimal support for metadata.

6- Document data processing steps.

Record software and parameter choices, including code/software versions (use GitLab for your own code).
Record exact command line used.
Record pre/post-processing steps.
Use a code notebook (e.g. Rstudio, Jupyter …).

If you can, use a workflow management system (e.g.snakemake, nextflow, Galaxy: EMBL server or public server, targets …), it will take care (among other things) of documenting the processing steps for you.

7- Don’t keep unused/unusable data.

Archive primary data on tape as soon as possible after acquisition. You can use the DM app for this.
Delete data from the file system if you’re not sure you’re going to use them, you can restore them from tape using the DM app.

8- Prepare tabular data for processing and re-use.

Store as tab-delimited plain text file (do not use spreadsheet software like Excel for storing/saving data)
Always use row and column headers and place them respectively in the first column and the first row
Use rows for records (e.g. samples, genes, objects …) and columns for variables (e.g. measurements, features …)
Headers must be unique, unambiguous and self-explanatory and consists only of alphanumeric characters and underscores (no spaces or hyphens).
Each row should contain as much of the relevant data as possible.
Each cell should contain only one item or type of data.
Use one file for each processing/analysis step, don’t mix different data in the same file.
Indicate missing data by leaving the corresponding cell empty or using the missing data code defined in the accompanying documentation file.
Include relevant metadata in the accompanying documentation file.
Do quality control and consistency checks by inspecting a small random sample of the data and computing some summary statistics, e.g. do the rows/columns add up to the expected values ?
Consider using a relational database:
if you start repeating the same information across several files (i.e. changing one piece of information requires updating multiple files),
if accessing specific subsets of the data is difficult (e.g. it requires combining multiple files),
if the amount of data becomes unmanageable,
if multiple people or processes need access to the data at the same time.

If you don’t need or want a server, consider using SQLite, it is a serverless relational database management system that is configuration-free and portable. An SQLite database is a single file that can grow up to 281 TB or the file size limit of the file system used, whichever is smaller.

9- High-throughput data.

Organize the data hierarchically by plate and well as this is the structure most analysis software can use, e.g.:

▽ Chromosome_condensation_project
▽ primary_screen
▽ images
▽ plate1_replicate1
▽ well001
W001-P001-Z000-T0000-s1234-Cy3.tif
W001-P001-Z000-T0000-s1234-EGFP.tif
▷ well002
▷ plate1_replicate2
▷ plate2_replicate1
▽ analysis
▷ configuration_files
▷ segmentation
▽ feature_extraction
▽ plate1_replicate1
▽ well001
W001-P001-Z000-T0000-s1234-Cy3.txt
W001-P001-Z000-T0000-s1234-EGFP.txt
▷ well002
▷ plate1_replicate2
▷ plate2_replicate1
▷ code
▷ validation_screen

It is imperative to define naming and other conventions at the start of the project (see point 3 above). Respect of these conventions is extremely important as the volume of data usually prevents easy detection and fixing of problems.
Automate all data writing steps. Never write or edit file names or data by hand. If there’s an issue, fix the code causing the problem.
It is strongly recommended to build a database to capture relationships between different entities (e.g. association between reagents, images and proteins/genes) and associated metadata (e.g. what passed which quality control step, what data is derived from what…). The time invested doing this at the beginning will be largely repaid by the end of the project. Without a database, long-running projects always run into time-consuming data issues towards the end.
All code used in the project should get its input parameters from a configuration file.
Code used in the project should read from, and write to, the project database. This keeps the database up-to-date in a timely manner and ensures collaborators are all working with the same data.
Keep configuration files and code under version control to keep track of the different iterations of the data analysis.

10- Ensure file integrity

Errors can happen when copying or moving files so copies should be checked against the original. This is done by comparing checksums of the two files. A simple way of doing this is to copy files using rsync then repeating the rsync command with the -n option to compare the two copies. Avoid copying files using drag-and-drop motions.