Skip to Main Content

Cleaning Data with OpenRefine: Removing Duplicate Data

A guide to using the OpenRefine program to organize messy datasets

Removing Duplicate Data

If you notice that your dataset contains duplicate values, it is necessary to remove them before further analysis.

Make sure that the data grid is in Row mode.

 

 

Click the down arrow next to the column header and Sort based on the column that contains duplicates.

 

 

Choose how to sort the data depending on the values. In this case, we will select Text and A-Z.

 

 

Navigate to the top row where the Sort button has appeared. Click on it and select Reorder rows permanently.

 

 

Click on the column header arrow again. On the dropdown menu, go to Edit cells > Blank down. This function detects if two rows following each other have the same content. If they do, the second row will be “blanked out” and the cell values removed.

(For this example, one cell has been blanked out. The data for Goodfellas has been entered twice.)

 

 

Click the down arrow next to the column header, go to Facet > Customized facets > Facet by blank.

 

 

From the facet window in the left pane, select the True option and the nlanked cells and rows are displayed.

 

Click the down arrow next to the All column header, go to Edit rows > Remove all matching rows.

 

 

All rows containing duplicate data have now been removed.