Cleaning Data with OpenRefine: First Steps
Understanding Data Cleaning
Data Cleaning is one of the initial steps in analysis. It is critically important for fields with large amounts of data -- such as business, healthcare, or research -- to get the data into a usable state before further analysis and conclusions are drawn. Messy or faulty data can lead to costly or damaging mistakes.
Conceptualize
Identify the granularity of your dataset - Evaluate the types of values in each column. Are you working with text, numbers, or prices?
Identify critical and non-critical columns - Determine if there are irrelevant columns that can be removed based on the scope of what you are trying to achieve.
Understand definitions - If you are working with a dataset provided from somewhere else, make sure you understand the column headers and terminology. Look for how different columns relate to each other. As you continue cleaning and working with the data, you will begin to understand more.
Delineate Solvable and Unsolvable Issues
Solvable issues include formatting, consistency, and duplicate data. A quick eyeball overview can bring some of these issues to your attention, but OpenRefine also has functions to help you discover these errors.
Unsolvable issues that OpenRefine cannot help with would be:
- Missing/blank cells
- Illogical dates that don't match
- Symbols/text that don't relate to each other
Depending on the prevalence or severity of these issues, they can make cleaning arduous or even impossible.
If a column contains over 70% of missing/nonsense data, it is likely unusable. You should find a different source or exclude this column from your final dataset. Context matters, so make sure to document the percentage of data that you are excluding from your analysis.