Skip to Main Content

Cleaning Data with OpenRefine: Basic Functions

A guide to using the OpenRefine program to organize messy datasets

Basic Functions

In order to get started, you must first import existing data into OpenRefine via the Create Project option.

  • This Computer - This is the most-used option. Browse your local computer for files.
  • Web Addresses (URLs) - You can pull online data from URLs.
  • Clipboard - You can paste in data here.
  • Database - You can connect to SQL databases.
  • Google Data - You can import a Google spreadsheet from Google Drive. For more detailed instructions, please access the official documentation.
    • By URL - If you publish and turn on link-sharing from your Google Drive, you can paste in the shareable URL to OpenRefine.
    • By File - You can grant OpenRefine access to all files in your Google Drive.

 

OpenRefine supports the following file formats:

  • Excel spreadsheet (XLS or XLSX)
  • Comma-separated values (CSV) or Text-separated values (TSV)
  • Text files
  • OpenDocument spreadsheet (ODS)
  • Fixed-width columns
  • JSON
  • XML
  • PC-Axis (PX)
  • MARC
  • RDF data (JSON-LD, N3, N-Triples, Turtle, RDF/XML)
  • Wikitext

 

Remember, OpenRefine does not modify your original source data; it creates a separate project file that is then exported.

OpenRefine automatically categorizes each cell of data into one of these formats based on its contents:

 

Text

Name, Role, Category, Location

Numeric

3.14

Date

2019-12-23T00:00:00Z

Boolean

A True/False value

Blank

A cell value of 0 length

Null

Absence of value

 

If data is incorrectly entered, you may encounter formatting issues. Format changes can be accomplished two ways.

 


 

Column Level

Generally, all data in a column will match and formatting can be applied on a columnar basis. Select the arrow next to the column header you would like to change. On the dropdown menu, go to Edit cells > Common transforms > Number / Date / Text

 

 


 

Cell Level

If there are cell level corrections that need to be made, simply drag the mouse over the cell in question and click on the blue Edit button that appears. In the popup box, click the dropdown menu for Data Type and select the correct format.

 

The Undo/Redo tab shows a history of actions you have taken and allows you to reverse or redo changes if you don't like the results. Every step or action is reversible. They appear in sequential order whenever you make changes such as renaming, splitting, or editing cells. Simply click on the step you would like to return to and the following step is reversed. 

 

 

This list of actions can also be exported and reapplied to other datasets. This could be useful if you are working with a recurring set of data on a weekly or monthly basis that will need to be ordered the same. Extracting the steps and then reapplying them to a different dataset takes the work out of repetitive data cleaning. 

To do this, click the Extract button at the top of the side panel and on the popup window, select Export.

 

 

To apply the same steps again, click on the Apply button next to Extract. On the popup window, copy/paste the saved steps into the box and click Perform operations.

 

Once you are finished cleaning your dataset, you are ready to save it and move on.

Navigate to the top right corner and click on Export.

The list includes several format options, many of them for advanced or specific purposes. For most general purposes, Excel 2007+ (.xslx) is the most commonly used format that is usable within Microsoft Excel and other spreadsheet programs. 

 

 

Your file will be exported in the chosen format to your Downloads folder.