Project organisation

Introduction

This page presents one of several ways to organise a computational project.

The main ideas driving the organisation of the project should be to facilitate sharing and collaborative work and to eventually publicly release the project. This document suggests a standard way to organize data and code for your project and can be used as part of your data management plan.

To keep track of all the changes as the project evolves, the project’s content is managed in a git repository hosted on EMBL’s GitLab instance. Using GitLab also gives the opportunity to use additional tools for project management.

You can create your own project structure from the available template with:

git clone https://git.embl.de/heriche/project-template.git ProjectName
cd ProjectName
git remote set-url origin git@git.embl.de:<namespace>/ProjectName.git
git push

Replace ProjectName above with the name of your project (use alpha-numerical characters, _ (underscore) and - (hyphen), and don’t use white spaces or special characters) and <namespace> with the group or user name under which you want the project to live.

Project structure

The suggested repository structure is as follows:

  • .gitignore: This file contains a list of regular expressions matching names of files and directories that shouldn’t be included in the repository such as temporary files or files that can contain secrets. The template already contains some common occurrences but make sure to customize it for your project.
  • data/: This directory is where your code should be looking for its input data. It can be used in various ways:
  • To store a small amount of data (less than 100 MB at EMBL), in particular test data.
  • To track larger data files with git lfs (see https://git-lfs.com/, https://gitlab.ub.uni-giessen.de/jlugitlab/git-lfs-howto and https://docs.gitlab.com/ee/topics/git/lfs/).
  • To store configuration files pointing to data stored elsewhere (e.g. S3 bucket, make sure not to store credentials here).
  • output/: This directory is where your code should write its output. The same rules apply as for the data directory. When some output is used as input data for another step of the analysis, you may decide to put it instead under the data directory. A good rule of thumb is to use the output directory to store end results, i.e. files that are not going to be processed further, regardless of the step in the workflow, such as plots and summary tables.
  • documentation/: This directory is where project documentation should live unless you prefer to use the wiki associated with the repository in GitLab (see https://docs.gitlab.com/ee/user/project/wiki/). This should include the project’s data management plan.
  • workflow/: This directory is where your data processing code should reside. If multiple steps are involved that use different pieces of code, use a subdirectory for each step. You can also have separate repositories for each step and link them here using git submodules (see https://git-scm.com/book/en/v2/Git-Tools-Submodules and https://gist.github.com/gitaarik/8735255).
  • README.md: This plain text file, written using markdown syntax, is the landing page of the project. As such, it should provide a description of the project and how to reproducibly run the analysis. It should be kept short with links to the documentation for details.
  • LICENSE: It is important to explicitly provide a license for the project as without one, nobody can use, copy or distribute the content without asking for permission from the author(s). As this can vary from project to project, the template doesn’t provide this file so make sure to add a suitable one.

Leveraging git and GitLab tools for project management

  • Use branches for testing different methods or workflows.
  • Keep track of versions by using tags. Use semantic versioning when possible, i.e. with tags of the form vX.Y.Z where X is incremented for significant changes, Y when new functionality is added and Z for bug fixes. Make sure to add a meaningful message to the tag.
  • Use the issue tracker to keep track of problems and ideas and discuss them with collaborators.
  • Use an issue board (see https://docs.gitlab.com/ee/user/project/issue_board.html) to organize and visualize progress (with Kanban style cards).
  • Integrate multiple repositories with git submodules.
  • Use continuous integration and deployment to build and deploy software.
  • Use the GitLab wiki associated with the repository to document the project.
  • Create a project’s website using GitLab Pages (see https://bio-it.embl.de/gitlab-pages/)