How to share data?

Open Data
Author

Esther Plomp

Published

December 27, 2023

Where and how to share data? See also the How to share data/code underlying the thesis post?

Data not available upon request

Disadvantages of having to maintain data yourself

  • Grant funding ends when the project ends, even if your data will have to be stored and maintained for several years afterwards

  • You may switch institutes, careers or retire at some point

  • You’re not a data manager or ICT expert, so the data may get lost due to a lack of a proper back up structure

FAIR data

To share the research data in accordance with the FAIR principles, you need to select a data repository.

When selecting a repository, researchers should consider the following:

  • Where are similar datasets preserved?

  • How long will the data be kept?

  • Who manages the repository? An institution? A commercial provider?

  • What are the costs or file size requirements associated with using the repository? How will those costs be paid? Are the file size requirements sufficient for your research project?

  • What are the access and use policies for the repository? Are you able to select a license?

Image

Image

Where: Discipline specific repository

Ideally, your discipline has a data repository available that you can use. There are several benefits of using such a discipline specific repository:

  • They use relevant metadata and specific requirements when you upload the data

  • They facilitate aggregated datasets as the shared datasets are structured alike

  • Access in community used formats

  • Data is more findable, since similar types of data are available in the same repository

  • Generally, there is some type of quality assurance during the upload process as the repository is more closely managed by experts in the field

You can use the following resources to find a disciplinary specific repository:

Where: General repository

Realistically, not every discipline has such a specific data repository available. In this case, you can always make use of more generalist repositories, such as Zenodo and 4TU.ResearchData. These repositories have less strict requirements when uploading the data and accept a wide range of formats.

First page of a Zenodo fact sheet

Where: 4TU.ResearchData

As a TU Delft researcher you can always make use of 4TU.ResearchData. See the 4TU.ResearchData post for more information about how to share your data/code via 4TU.ResearchData.

How 1: Gather all data/code needed for reanalysis/validation

  • Consider all data/code needed for your data to be reusable or for someone to validate or replicate your analyses.

  • Processed data: Sharing the processed data underlying the publication is the minimum requirement from TU Delft. Including raw data can be valuable to ensure that no details are lost and the research is more reproducible.

  • External resources: link to other datasets, code repositories and publications.

  • Review files for errors such as missing data, misnamed files/variables, incorrectly formatted values, and corrupted files (you can use tools such as OpenRefine or Frictionless)

How 2: Use standard or open file formats

Using standard or open data formats ensures longer-term usability of data.

  • Standard Data Formats: “widely accepted” formats such as .xls .xlsx

  • Open Data Formats: Free to use such as .csv .tab

How 3: Preferred (sustainable) file formats:

The following file formats are recommended by 4TU.ResearchData.

  • Text Plain text, XML, HTML, PDF (PDF/A-1), JSON, PDB (Protein Data Bank), XYZ (all formats should be encoded in UTF-8)

  • Spreadsheets CSV (Comma-separated values), Tab-delimited values, PDF (PDF/A-1)

  • Images JPEG, TIFF, PNG, SVG

  • Geospatial GML (Geographical Mark-up Language), KML (Keyhole Mark-up Language), ESRI Shapefile, Geo-referenced TIFF

  • Numerical NetCDF, CSV, JSON

  • Video No sustainable format established

  • Audio Waveform Audio File Format (WAVE)

  • Databases Delimited Flat File w/DDL

  • Archives ZIP, TAR, GZIP, 7Z

See also DANS file format recommendations.

How 4: Are you working with personal/confidential data?

Not all data can be publicly shared via data repositories. See what data should be shared post. This can be the case if you manage personal/confidential data. Personal/confidential data can contain:

  • Personal information which can allow the identification of living individuals (have you set up an HREC application?)

  • Commercially-confidential information (something you might want to patent, or data belonging to a third party)

  • Information related to national security and export control regulations.

How 5: Organise your files and provide documentation

Use a clear folder structure and file naming convention to share the data/code. You can document your data via a README file.

TU Delft requirements

TU Delft Research Data Framework Policy requires all PhD students who started on or after 1 January 2019 to deposit research data (and code) supporting their theses before they can graduate .

For most research done at TU Delft it is possible to make the data and code underpinning research findings available in a data repository. Typically, this is done at the same time as publishing the related papers, theses or reports. TU Delft has a dedicated data repository, 4TU.ResearchData, where all TU Delft researchers can deposit up to 1TB of data per year (per researcher) free of charge.

Linking research outputs

Whether you use a discipline specific repository or a generalist one, it is important to link all the outputs to ensure that they are clearly identifiable and findable. You can do this by providing links/persistent identifiers in the metadata fields, or in the README file, or in the data availability statement in the corresponding article (also cite your research outputs in the citations!). See The Turing Way for more information about how to link your research outputs.

Example:

Data availability  The training and test datasets are available on Zenodo65. Source data are provided with this paper. Code availability  The Python codes for training a model, using a pre-trained model, and for transfer learning on a pre-trained model are available on GitHub66 at: https://github.com/kibb/LINA/.

Data and code availability

Examples

More information