Cleaning Data with OpenRefine: Clustering
Clustering
The cluster function uses several types of algorithms to analyze your data and help identify inconsistencies and errors. By using fuzzy matching on the values of a column, clustering determines if cell values are similar enough to be possible matches. For example, "Louisiana" and "louisiana" likely refer to the same location and just differ by capitalization, and "Gödel" and "Godel" probably refer to the same person.
Click on the down arrow next to the column header of the column you would like to cluster, then Edit cells > Cluster and edit.
A popup box will appear with several options.
OpenRefine offers two methods of clustering, Key Collision and Nearest Neighbor. Nearest Neighbor is a more tailored algorithm than Key Collision, so if you do not get satisfying results with one, try the other.
There are several options for Keying Function. These largely involve dealing with phonetic analysis or text in languages other than English and are beyond the scope of this introduction. The OpenRefine user documentation goes into more detail concerning these methods.
The program will analyze the text in the columns for similar results.
In the example below, the Nearest Neighbor cluster search has given two results:
"Healthcare" and "Health care"
"Nonprofits" and "Nonprofit"
You can choose whether you want to merge the values by checking the Merge? box.
In the New cell value box, you can type what value you want both of the merged cells to have.
When you have selected all the cells you want to correct, click on the button that says Merge selected and close. This will perform the clustering action.
Getting a firm grasp of your data at the outset is important. For example, you would not want to try clustering the salary column, as it is possible that two separate people would have the same salary. Clustering is useful for grouping and regrouping different sets of descriptors and name disambiguation.