Skip to Main Content

Cleaning Data with OpenRefine: First Steps

A guide to using the OpenRefine program to organize messy datasets

Understanding Data Cleaning

Data Cleaning is one of the initial steps in analysis. It is critically important for fields with large amounts of data -- such as business, healthcare, or research -- to get the data into a usable state before further analysis and conclusions are drawn. Messy or faulty data can lead to costly or damaging mistakes.

 

Conceptualize

Identify the granularity of your dataset - Evaluate the types of values in each column. Are you working with text, numbers, or prices?

Identify critical and non-critical columns - Determine if there are irrelevant columns that can be removed based on the scope of what you are trying to achieve.

Understand definitions - If you are working with a dataset provided from somewhere else, make sure you understand the column headers and terminology. Look for how different columns relate to each other. As you continue cleaning and working with the data, you will begin to understand more.

 

Delineate Solvable and Unsolvable Issues

Solvable issues include formatting, consistency, and duplicate data. A quick eyeball overview can bring some of these issues to your attention, but OpenRefine also has functions to help you discover these errors.

 

Unsolvable issues that OpenRefine cannot help with would be:

  • Missing/blank cells
  • Illogical dates that don't match
  • Symbols/text that don't relate to each other

 

Depending on the prevalence or severity of these issues, they can make cleaning arduous or even impossible.

If a column contains over 70% of missing/nonsense data, it is likely unusable. You should find a different source or exclude this column from your final dataset. Context matters, so make sure to document the percentage of data that you are excluding from your analysis.