Spreadsheets

Research Data Management
Author

Esther Plomp

Published

November 4, 2022

Spreadsheets are great for certain purposes, but there are some things you need to keep in mind!

Spreadsheets are great for:

  • Data entry

  • Tables for publications

  • Generate summary statistics

  • Figures

Spreadsheets can be not so great:

  • Manual data entry is liable to mistakes (copy pasting wrong cells, accidental deletions)

  • It is not always machine readable. Software cannot process information that relies on context, notes in the margin, spatial layout or field formatting (bold/italics/colours)

  • Excel tries to convert everything into a date. Even human genes. (Watch a short talk on ‘Spreadsheet Problems In Genomic Research’ for more fun!)

Anakin/Padme meme: Padme is asking Anakin (which has the excel logo on his face): 'so if I type 12.5 into a cell you won't treat it as a date, right?'

Set up a spreadsheet

  • Be consistent

  • Write dates like YYYYMMDD (consider to split info)

  • Do not leave empty values (use NA)

    • It is difficult for empty cells to know if a value is missing, and blanks can be confusing when spaces/tabs are used as delimiters

    • Whatever null value you use, use it consistently and indicate it in the README

  • Put as few information possible in a single cell and one observation per row

  • Create a data dictionary that describes the spreadsheet and any cleaning steps you took

  • Leave the raw data alone!

  • Do not use formatting (colours, font, bolding)

  • Use data validation to avoid errors (OpenRefine)

  • Avoid empty cells (not clear whether data was not collected or whether nothing was measured - use undefined, NA, 0, or null instead).

See the extended version, based on Broman and Woo 2018 & Carpentries Spreadsheets for Ecologists or Social Scientists).

Tidy Data

Tidy data is a standard way of mapping the meaning of a dataset to its structure, according to Hadley Wickham. In tidy data, each variable forms a colum, each observation forms a row, each cell is a single measurement.

Data validation

  • Excel support page on data validation
  • Check manually whether your data is consistent, complete and correct:
    • If a column should contain only numeric values or characters, check that there are no non-numeric values or non-character
    • Check for consistency in names, unit of measurements, data type and so on
    • Check if there are any empty cells and replace them with your chosen null value (see above)

Accessibility (Data Curation Primer)

Comma- or Tab-Separated Value (CSV/TSV) formats are not only best for preservation, but for accessibility as well. For more information:

Tools to work with tabular/spreadsheet data

More information