Skip to Main Content

Cleaning Data with OpenRefine: Tutorial

A guide to using the OpenRefine program to organize messy datasets

Walkthrough of Different Functions

This tutorial will provide you with a sample spreadsheet of messy and disorganized data and walk you through the steps needed to transform it into a state where it can be meaningfully analyzed. This will demonstrate many of the base functions and capabilities of OpenRefine.

Tutorial

Once you have the messydataset file downloaded, the next step is to import it into OpenRefine and begin working with it.

Navigate to Create Project and select Get Data from This Computer.

 

 

Click on Choose Files and locate the file. Click Next.

OpenRefine will show a preview of the data to make sure it has interpreted it correctly. The program can work with a variety of file types beyond simple spreadsheets. There are different options and checkboxes, but leave everything as is and click on Create Project in the top right corner.

This is the main working interface. Notice that not all of the rows are showing. If you are working with a large file with thousands of rows, OpenRefine hides some of the rows to improve performance. Since this sheet is fairly small, you can click on the numbers in the top row to show more rows. Clicking on number 50 should show the complete number of rows.

 


One of the columns has long name that is obscuring the rest of the columns. You can rename columns to make their names for conciseness or clarity.

 


By clicking the down arrow in the column header, you will see a dropdown menu. Select Edit Column and then Rename this column. Based on the context, this column can be renamed simply to Salary.

 

 

Now the file is imported and all columns and rows are viewable and ready to be organized.

If you try sorting the name column, you will get some unexpected results in that some names are out of order. This is because some cells have leading or trailing whitespace before the text. This can happen when data is manually entered and spaces are accidentally inserted. This is easily fixable within OpenRefine without having to manually search and edit each cell.

Click on the arrow for the All column to access the dropdown menu. Go to Edit all columns > Trim leading and trailing whitespace.

 

 

The issue is now corrected and the columns can be sorted alphabetically.

Some of the data in various cells is inconsistently spelled. OpenRefine offers a way to quickly identify and fix this.

Select the down arrow in the column you want to analyze and select Facet. For this example, select the column "What industry do you work in?"

 

 

In the left sidebar, you should see a list of results. These are the unique values in the column. If you don't see your results, navigate to the top of the sidebar and make sure you have the Facets/Filters tab selected.

 

 

Notice a few irregular or similar results. The categories of Healthcare and Nonprofit have been incorrectly split. OpenRefine has a function to regroup these categories.

Click on the down arrow next to the column header, then Edit cells > Cluster and edit.

 

 

A popup box will appear with several different options.

 

 

Change Method from Key Collision to Nearest Neighbor.

Click on Cluster in the center of the box. The program will analyze the text in the columns for similar results.

 

 

For this example dataset, you should receive two results:

"Healthcare" and "Health care"
"Nonprofits" and "Nonprofit"

Click on the check box next to each and replace "Health care" with "Healthcare" and "Nonprofits" with "Nonprofit".

Click on the button that says Merge selected and close. This will perform the clustering action.

Now, when you check the facet results, there should be no misspelled or duplicate data.

Upon scanning the column, you will notice that the date formatting is inconsistent. The majority of cells contain only the year of birth, yet a few are input in the M/D/YYYY format, which is unnecessary.

First, we will have to standardize the format. Click on the down arrow next to the column header for Birthdate and choose Edit cells > Common transforms > To text.

 

 

Next, click on the down arrow next to the column header for Birthdate and choose Edit Column > Add column based on this column. This will create a new column based on the data that is in the selected column.

 

 

On this screen, you will see a few options.

 

 

 

For New column name, type in Date of Birth, since two columns cannot be named with the same name.

GREL is the coding language that OpenRefine uses. It is used for more advanced functions or functions that are not available through the dropdown menus. Beginners can mostly get by with copy/pasting and not memorizing code. In the expression box, delete the text that says value and replace with this:

 

value.toDate().datePart("years")

 

You will now see the preview below update, showing that OpenRefine will copy the original Birthdate column, but change the date to only display the year and not the full date. Click OK.

 

A new column Date of Birth has been created, based on the previous Birthdate column, and this new column has consistent formatting for the individuals' birth year only.

 

 

Now, it is okay to delete the original Birthdate column, leaving only the new Date of Birth column with the data in matching format.

A common use for the split function is separating a singular Name column into First and Last Name columns.

Click on the down arrow next to Name column. On the dropdown menu, go to Edit column > Split into several columns.

 

 

Here, you will instruct the program how you want the columns to be split. In this case, there is a single space between the first and last names. In the box labeled Separator, press the space button once. Then click OK.

 

 

The Name column has now been split into Name 1 and Name 2 columns.

In this dataset, there is a column for City, a column for State, and a column for Country. Let's say we are not interested in breaking it down and would like to consolidate two of these, the city and state.

 

 

Click on the down arrow next to the column "What city do you work in" and on the dropdown menu, go to Edit column > Join columns.

 

 

On this popup box, you can select which columns you would like to join together. Select the two for city and state. Since the columns follow sequentially, we can easily merge them. But if we merge them with no separator, it will run the words together. To avoid this, in the options box for separator, type a comma followed by a single space. Click the box for "Write results in a new column" and type in "Location." Finally, check the box next to Delete joined columns. Now, click OK.

 

 

Now, the secondary column has been successfully merged with the first into a new column with correct formatting and the original columns have been deleted.

 

Once the main organization has been done, there are a few additional cleaning steps that can be taken to make sure data is properly arranged.

  • Rename Columns -- If you are merging or splitting columns, the original headings may no longer be accurate. Take the time to rename the headings to clarify the newly organized data. For example, change the "Name 1" and "Name 2" column headings to "First Name" and "Last Name."
  • Remove unnecessary columns -- If you did not do this at the beginning, take some time to see if there are any redundant or unnecessary columns that do not contain useful information. For example, in the column "What country do you work in," every result is some form of "United States." Therefore, this is redundant. Delete any columns that are not pertinent to your end goals.    
  • Reorder columns -- It is possible to move columns around to more helpful orders. Names are typically sorted alphabetically by last name, so it makes sense to move the last name in front of the first name. You can do this by clicking on the down arrow and clicking Edit column > Move column left.
  • Fix capitalization -- If you have noticed that certain proper names or titles are not capitalized, there is a quick fix for this. On the All column, click the down arrow. This will present you with functions that will affect every cell in the dataset. Click on Edit all columns > To titlecase. This will capitalize the first letter of every word in each cell.
    • Note: This action will automatically transform all cells into text, so uncheck the box next to any columns containing numbers.

Once you are finished with cleaning your dataset, you are ready to save it and move on.

Navigate to the top right corner and click on Export.

You will be given a list of options. For most general purposes, select Excel 2007+ (.xslx)

 

 

Your file will be exported in the chosen format to your Downloads folder.

Now the data is much cleaner and organized and ready to be input into other programs such as Excel, PowerQuery, or Tableau.

Template