Data & project organization

Directory, folder, and file naming

One of the first things to consider is: How do I want to organize my data?  There are a number of questions you will want to consider:

  • Are there file naming conventions for your discipline?
  • What directory structure and file naming conventions to use?
  • Version control -- record every change to a file, no matter how small
    • Consider version control software, if applicable
    • Discard obsolete versions, but never the raw copy

Keep the following best practices in mind:

  • Be consistent with how you name directories, folders, and files
    • Always include the same information
    • Retain the order of information
  • Be descriptive so others can understand what file names mean
  • Keep track of versions (and be consistent!)
  • Use application-specific codes for file extensions, such as .mov, tif, wrl

It will be important to track changes in your data files especially if there is more than one person involved in the research.

The following are some free file renaming applications if you need to revise your naming system (endorsement not implied):

Suggested folder structures 

It is generally recommended to keep data, code, and outputs separate to avoid confusion. Although the exact structure can vary depending on the project, here are two approaches to organizing files.

Basic projects
One or two data processing scripts, simple pipeline with few steps

.  
├── InputDataFolderContaining data that will be processed
├── OutputDataFolderContaining data that has been processed
├── FiguresFolderContaining with figures or tables summarizing the results
├── CodeFolderThe scripts or programs to do the analysis
├── LICENSEFileExplaining the terms under which data/code is being made available
├── README.txtFileDocumenting the analysis, and (ideally) the purpose of each file.

Advanced projects
Many kinds of input data, documentation, and code.

This structure can be generated and auto-populated with the Reproducible Science template for Cookiecutter.

.  
├── AUTHORS.mdFileList of people that contributed to the project (Markdown format)
├── LICENSEFilePlain text file explaining the usage terms/license of the data/code file (CC-By, MIT, GNU, etc.)
├── README.mdFileReadme file (Markdown format)
├── binFolderYour compiled model code can be stored here (not tracked by git)
├── configFolderConfiguration files, e.g., for doxygen or for your model if needed
├── dataFolderData for this project
    ├── externalFolderData from third party sources.
    ├── interimFolderIntermediate data that has been transformed.
    ├── processedFolderThe final, canonical data sets for modeling.
    └── rawFolderThe original, immutable data dump.
├── docsFolderDocumentation, e.g., doxygen or scientific papers (not tracked by git)
├── notebooksFolderIpython or R notebooks
├── reportsFolderManuscript source, e.g., LaTeX, Markdown, etc., or any project reports
    └── figuresFolderFigures for the manuscript or reports
└── srcFolderSource code for this project
    ├── dataFolderScripts and programs to process data
    ├── externalFolderAny external source code, e.g., other projects, or external libraries
    ├── modelsFolderSource code for your own model
    ├── toolsFolderAny helper scripts go here
    └── visualizationFolderVisualization scripts, e.g., matplotlib, ggplot2 related

File formats 

The file format is the principal factor in the ability for others to use your data in the future.  You need to plan for software and hardware obsolescence since technology continually changes.  How will others use your data if the software used to produce is no longer available?  You may want to consider migrating your files to a format with the characteristics listed below and keep a copy in the original format.

Formats most likely to be accessible in the future include:

  • Non-proprietary, not tied to a specific software product
  • Unencrypted
  • Uncompressed
  • Common, used by the research community
  • Standard representation, such as ASCII, Unicode
  • Open, documented standard

Examples of preferred formats:

  • PDF/A, not Word
  • Plain-text CSV, not Excel
  • MPEG-4, not Quicktime
  • XML, CSV, or RDF, not MS Access database
  • HDF, not MATLAB binary arrays/matrices

Project organization and management 

In addition to applying file and folder organization best practices, an overall project strategy should consider other aspects to ensure successful projects, publications and hand-offs. In addition, a solid strategy helps avoid errors due to mix-ups and enhances research reproducibility. The tools mentioned in the table are for informational purposes and endorsement is not implied.

Project aspectExample toolsNotes
Overall organization & collaboration
  • OSF
  • Confluence
  • SharePoint
  • Slack, Teams
  • GitHub, Google Drive, Box, etc.

OSF can connect many services together. Examples of how the OSF can be used to organize a lab:

Traditional project management
  • Asana
  • Trello
  • Basecamp, Freedcamp
  • Teamwork.com
  • Notion.so
Calendars, to-do lists, Kanban boards, Gantt, etc. Open source offerings exist however many of the commonly used tools are paid or freemium.
Tracking bugs, issues, timeSome like Taiga are geared towards Agile development but can be used for any kind of project.
Standard Operating Procedures (SOPs)
  • Standard word processors or spreadsheets
  • OSF
  • Protocols.io

Establishing SOPs for projects or research groups is an important step to maintain organization and ensure clean hand-offs, research reproducibility, and project archiving. Aspects include

  • Establishing file naming conventions
  • Styles and practices for software development (e.g., all functions must be documented, files must be checked into version control, etc.)
  • Standardizing workflows for data collection and processing
  • Establishing and enforcing backups
  • Establishing roles (e.g., who will be responsible for what)

Ideally, SOPs form part of the implementation of a data management plan. See an example of SOPs from a UA researcher on the OSF.

Experiment tracking, organization

Electronic Lab Notebooks (ELNs)

  • RSpace
  • LabArchives
  • OSF
ELNs can be a useful tool to manage projects and labs. There is a large number of ELNs on the market from open source to cloud-hosted. The Harvard Medical School maintains a comprehensive list comparing more than 50 features across 27 ELNs. The OSF can also be used as an ELN. See this template from Johns Hopkins University.