Cleaning Data with OpenRefine: Home
What is OpenRefine?
OpenRefine is a powerful, free, open-source desktop tool for cleaning messy data. Think of it as a simple, yet powerful cousin to Microsoft Excel -- ideal for correcting inconsistencies, normalizing entries, and preparing large and unwieldy datasets for analysis.
This guide was created using OpenRefine version 3.9.3.
What is Messy Data?
Messy data is data (usually in the form of a spreadsheet) that is incomplete, inconsistent, or contains typos and mistakes. If you use and transfer spreadsheets in your work, you may have had the misfortune of downloading a spreadsheet that contains some of these errors:
- Irrelevant data - columns or rows containing data that is not useful or pertinent
- Mistakes - typos or incorrect values, which can lead to outliers or inaccurate analysis
- Inconsistent formatting - dates in various formats (January 1, 2001 vs. 1/1/01)
- Missing values - blank or null cells
Messy data can undermine the integrity of your records, make your methods difficult to replicate, obscure trends, and make your data non-compliant with regulatory requirements and standards.
What Does it Mean to "Clean" Data?
Data Cleaning is the process of replacing, modifying, organizing, and correcting corrupted or inaccurate records in a dataset or spreadsheet. This process ensures that records are consistent and prepared for further analysis such as data visualization, inclusion in research, or general record-keeping. Every organization and individual benefits from clean and reliable data.
What Can You Do with OpenRefine?
OpenRefine is used to explore, clean, and link large sets of data. It can help you:
- Get an overview of a large dataset
- Split data into more granular categories
- Fix inconsistent categories or formats in a spreadsheet
- Remove duplicate records from survey results
- Clean author names from an exported citation list
- Reconcile with external sources
- Export clean data to spreadsheets or databases
Why Not Just Use Excel?
Depending on the size of the sheet and number of rows, the time to manually search and correct errors would be costly and tedious. To organize the data in Excel would require various formulas and possibly VBA code. OpenRefine offers deeper, more user-friendly functions in dropdown menus to accomplish the same organization.