Why share data and code
- Required by some journal publishers (such as Nature), and funding agencies (National Science Foundation, National Institute of Health etc.). It is expected other funding agencies will also require researchers to share data produced during the course of their research project (see OMB Policy Memorandum on sharing data).
- Data can be reused to answer new questions, opens up new interpretations and discoveries.
- Sharing data may lead to sharing research process, workflows, and tools (enhances potential for replicating research results).
- Makes your articles and papers more useful and citable by others, increases their value!
How and where to share your data and code
Posting data on a web page is useful for increasing visibility and for presentation but is not a recommended as the main strategy for data sharing. Instead, deposit into a trusted repository and refer to/showcase/cite the deposit in any web pages or other media.
- Recommended: Deposit into a recognized data repository. The University of Arizona's Research Data Repository (ReDATA) or any other appropriate disciplinary data repository.
- Submit data/code, along with your article as supplementary material, if the journal allows for it.
Posting code on GitHub is an accepted way of sharing code. To enhance code citability and to ensure that the exact version of the code is included alongside the data it is associated with, depositing code into a data repository is recommended. Many data repositories including ReDATA support GitHub integration.
Preparing your data for sharing
When sharing your data and code:
- Bundle your data together in a systematic way by following best practices for organizing files and structuring code so others can easily understand and use the data and/or code. See Data organization and Software/Code best practices.
- Include enough information so that others can understand and reuse the dataset. See Data documentation & metadata
- Follow best practices to include enough information in readme files or elsewhere to make it possible to cite the dataset. See Citing data & code.
- Follow best practices to ensure confidentiality of any human participants
Since practices for data preparation vary depending on their characteristics, the Data Curation Network has prepared primers with recommended practices for preparing a wide variety of file formats for sharing.
Archiving
[Archiving is] activity that ensures that data are properly selected, stored, and can be accessed, and for which logical and physical integrity are maintained over time, including security and authenticity.
Data (and code) archiving involves ensuring data is clean, documented, organized and is as self-contained as possible. Data can be archived in various ways
- On local hard disk drives
- On long-term tape drives
- In the cloud
- In dedicated data repositories
A rule of thumb when archiving a project is to document and organize materials such that if you were to give the archive bundle to your colleague, would they be able to understand it without asking you?
Placing data in a dedicated repository is the preferred method of archiving publicly releasable datasets and code (e.g., data associated with published articles) since it allows for data reuse and citation and enables research reproducibility. See the Data repositories page for how to get started finding one.
Although the risk of data loss is low in dedicated data repositories (namely, those that have explicit acknowledgements regarding data storage and how long data will remain available), it is always advisable, where possible to retain an offline copy of the data. This is especially true for general purpose cloud storage that, unlike a dedicated data repository, allows for unintended modifications and deletion of files.
Data retention
Various guidelines govern data retention requirements, depending on the kind of data and its use (e.g., financial information, data used for patent applications, etc). Unless otherwise specified, the following guidance applies to research data in general:
Each investigator should treat data properly to ensure authenticity, reproducibility, and validity and to meet the requirements of relevant grants and other agreements concerning the retention of data. Primary data should be reserved for ten (10) years, if there is no requirement in the Award Document, to ensure that any questions raised in published results can be answered.
See the UA Research, Innovation, & Impact page for more information on research data retention. See Records & Archives for help with official U of A records retention and destruction policies and procedures.
Where to archive your large datasets at U of A
Refer to the options listed in Storage, backups, & security. For larger datasets, there are four recommended no-cost options: OneDrive, Tier 2 AWS storage, R-DAS and ReDATA. Depending on the reason for archiving, specific needs, and affiliation (student, faculty, or staff) one may be a better fit than another. Tier 2 and R-DAS storage is limited to those with PI accounts on the HPC (those PIs can sponsor others to join their group).
| ReDATA | OneDrive | Tier 2 AWS | R-DAS | |
|---|---|---|---|---|
| Main purpose | Data publication to meet funder/journal data sharing requirements | General purpose cloud storage | Large storage space for data not undergoing analysis (cloud storage in Amazon S3) | General purpose storage accessible from your personal machine via a network share. |
| Usage examples | Public archiving, getting a DOI, preserving curated and final data for journal articles, dissertations, etc. | Collaboration space, private archiving, project backups, | Private archiving, project backups, transferring data to HPC for analysis | Private archiving, project backups |
| Eligibility | All individuals with a valid NetID and library privileges | All individuals with a valid NetID | Only faculty may request an account (can sponsor others to join group) | Only faculty may request an account (can sponsor others to join group) |
| Stewardship | Libraries assume full responsibility for long-term data availability. Meets journal and funder data sharing requirements. Curated by ReDATA staff to prepare it for sharing. | None. Users manage their own data. Versioning available but otherwise not backed up. | Data is automatically moved Amazon Glacier after a time (for free) but users must otherwise manage their own data | None. Users manage their own data. Not backed up. |
| Storage quota | 2 GB standard for faculty & staff. Additional storage available on request. | 1 TB (DCC and emeritus 100 GB) | Free up to 1 TB. Overage subject to standard AWS S3 billing. | 5 TB per faculty group |