Data management guide
This version of the data management guide was updated 11 May
2022.
This guide is intended to inform parts of your project data
management plan. You can use it to answer some of the questions in the
data management
checklist.
1- Keep track of all processing steps and data used to create
a figure.
There should be an uninterrupted and unambiguous chain from reagents
to raw data to each figure of a published paper. Anybody should be able
to follow this chain, not just the people who’ve been involved in the
work. Disambiguate reagents by always associating them with a database
identifier (either from an internal database or from a supplier’s
catalog).
2- Create a project folder with sub-folders for experiments
and data sets.
- Organize files consistently by source or experiment/analysis
type.
- Separate primary data, analysis results and code.
- Create a sub-directory for each data set.
- Place the data management plan in the project folder.
- Keep a plain text documentation file (e.g. README.txt) and specific
metadata files in each sub-folder.
Check the project
organisation page for some suggestion on how to organize
and manage a project folder.
3- Name files and folders in a consistent way.
- Define conventions at the beginning of the project and write them
down in the data management plan so that you can refer to them later and
stick to them.
- When working with collaborators, make sure everyone follows the
conventions, share the data management plan.
- Each name should be unique, indicative of content and
machine-readable.
- Essential rules for naming files and folders are:
- hard-code essential information in file names in a consistent way,
i.e. always in the same order and in a standardized way, e.g. use ISO
format for dates: yyyy-mm-dd and use official gene symbols, e.g. RANBP2
not NUP358.
- use only ASCII alphanumerical characters, i.e. 0-9, a-z, A-Z and
separate meaningful fields/information using underscore. Do not use
spaces or hyphens.
4- Document data.
The data documentation file should describe:
- how the data was created,
- the naming conventions (or refer to the data management plan),
- the data type, format and structure (i.e. dimensions of the data and
what each dimension represents),
- measurements units,
- coding schemes used, e.g. code used for missing data
Template files are available for images
and arrays
(i.e. tabular and higher dimensional data). The templates are in the YAML
format which is both human- and computer-readable and
writable. For a quick introduction see the wikipedia
article. Once the metadata have been entered, verify that
the file is still a valid YAML file using a YAML validator (e.g. online).
5- Register your project and its data in a data management
system.
At EMBL, use either the Data Management
app or STOCKS.
STOCKS is a more comprehensive system with electronic lab notebook
and LIMS functionalities allowing you to link data sets to samples,
reagents and protocols.
The DM app is a lightweight data registration system with minimal
support for metadata.
6- Document data processing steps.
- Record software and parameter choices, including code/software
versions (use GitLab for your own code).
- Record exact command line used.
- Record pre/post-processing steps.
- Use a code notebook (e.g. Rstudio, Jupyter …).
If you can, use a workflow management system (e.g.snakemake,
nextflow, Galaxy: EMBL server or public server, targets
…), it will take care (among other things) of documenting the processing
steps for you.
7- Don’t keep unused/unusable data.
- Archive primary data on tape as soon as possible after acquisition.
You can use the DM app for this.
- Delete data from the file system if you’re not sure you’re going to
use them, you can restore them from tape using the DM app.
8- Prepare tabular data for processing and
re-use.
- Store as tab-delimited plain text file (do not use spreadsheet
software like Excel for storing/saving data)
- Always use row and column headers and place them respectively in the
first column and the first row
- Use rows for records (e.g. samples, genes, objects …) and columns
for variables (e.g. measurements, features …)
- Headers must be unique, unambiguous and self-explanatory and
consists only of alphanumeric characters and underscores (no spaces or
hyphens).
- Each row should contain as much of the relevant data as
possible.
- Each cell should contain only one item or type of data.
- Use one file for each processing/analysis step, don’t mix different
data in the same file.
- Indicate missing data by leaving the corresponding cell empty or
using the missing data code defined in the accompanying documentation
file.
- Include relevant metadata in the accompanying documentation
file.
- Do quality control and consistency checks by inspecting a small
random sample of the data and computing some summary statistics, e.g. do
the rows/columns add up to the expected values ?
- Consider using a relational database:
- if you start repeating the same information across several files
(i.e. changing one piece of information requires updating multiple
files),
- if accessing specific subsets of the data is difficult (e.g. it
requires combining multiple files),
- if the amount of data becomes unmanageable,
- if multiple people or processes need access to the data at the same
time.
If you don’t need or want a server, consider using SQLite, it
is a serverless relational database management system that is
configuration-free and portable. An SQLite database is a single file
that can grow up to 281 TB or the file size limit of the file system
used, whichever is smaller.
9- High-throughput data.
Organize the data hierarchically by plate and well as this is the
structure most analysis software can use, e.g.:
▽ Chromosome_condensation_project
▽ primary_screen
▽ images
▽ plate1_replicate1
▽ well001
W001-P001-Z000-T0000-s1234-Cy3.tif
W001-P001-Z000-T0000-s1234-EGFP.tif
▷ well002
▷ plate1_replicate2
▷ plate2_replicate1
▽ analysis
▷ configuration_files
▷ segmentation
▽ feature_extraction
▽ plate1_replicate1
▽ well001
W001-P001-Z000-T0000-s1234-Cy3.txt
W001-P001-Z000-T0000-s1234-EGFP.txt
▷ well002
▷ plate1_replicate2
▷ plate2_replicate1
▷ code
▷ validation_screen
- It is imperative to define naming and other conventions at the start
of the project (see point 3 above). Respect of these conventions is
extremely important as the volume of data usually prevents easy
detection and fixing of problems.
- Automate all data writing steps. Never write or edit file names or
data by hand. If there’s an issue, fix the code causing the
problem.
- It is strongly recommended to build a database to capture
relationships between different entities (e.g. association between
reagents, images and proteins/genes) and associated metadata (e.g. what
passed which quality control step, what data is derived from what…). The
time invested doing this at the beginning will be largely repaid by the
end of the project. Without a database, long-running projects always run
into time-consuming data issues towards the end.
- All code used in the project should get its input parameters from a
configuration file.
- Code used in the project should read from, and write to, the project
database. This keeps the database up-to-date in a timely manner and
ensures collaborators are all working with the same data.
- Keep configuration files and code under version control to keep
track of the different iterations of the data analysis.
10- Ensure file integrity
Errors can happen when copying or moving files so copies should be
checked against the original. This is done by comparing checksums of the
two files. A simple way of doing this is to copy files using rsync then
repeating the rsync command with the -n option to compare the two
copies. Avoid copying files using drag-and-drop motions.