Using OpenRefine to gain insights into, cluster, clean and enrich messy data

OpenRefine logoImagine the following scenario: You get this file (Excel, CSV, Text, XML,…) containing a list with lots of customer, vendor or project data and you want to structure and clean the data before you can use it to do some analytics, reporting, or other processing steps on it . There are a lot of duplicate entries, names are spelled in different ways, everything is a big mess and a manual clean up will cost you a few hours of your precious time…

Solution

OpenRefine (formerly Google Refine) is a free and open source application which allows you to explore data (generate insights), clean and transform it using powerful scripting possibilities and to reconcile or match it with data from any kind of webservice or databases like Freebase. The possibilities are endless since it is possible to extend your dataset with all kind of data available through webservices. In addition to the core OpenRefine product, a growing list of extensions and plugins  is available. [2]

OpenRefine is a web application and completely runs in your browser. You might say: Yes, but I am worried uploading my sensitive data to the cloud! No worries: Even though OpenRefine is a web application, you will completely run it on your local machine. The software bundle contains a webserver and you will access OpenRefine on a localhost (127.0.0.1) port.

Installation
The installation is pretty straightforward:
1. Download the ZIP file from [2] to your harddrive.
2. Unzip to a folder of your choice and run google-refine.exe (The executable namewise still refers to Google as the origin of the project – the naming will probably change in the future)
3. Open http://127.0.0.1:3333/ in your browser.

openRefineStart

What to use OpenRefine for?

One feature I really like is the possibility to cluster data using statistical techniques as  key collision or nearest neighbour. As you can see in the screenshot below, the algorithm groups all the entries who obviously belong to the same company, but are written in different ways into the same cluster. In addition, OpenRefine unveils properties of the data by providing additional statistical information on the distribution.

2014-02-16 16_59_44-Contracts csv - Google Refine

I could go into more details on what is possible with OpenRefine and compile some tutorial, but I´d rather point you to the very helpful and highly informative official OpenRefine videos. An extensive tutorial might be something for a second blog post on this topic :)


Explore data

Clean and transform data

Reconcile / Match data

I would be very interested to hear from you if and how you use OpenRefine in your daily work and if you like this kind of presentation of useful data tools.
Links

[1] http://www.freebase.com/

[2] http://openrefine.org/

1 Response to “Using OpenRefine to gain insights into, cluster, clean and enrich messy data”


  1. 1 Oliver T.

    Looks promising and I will give it a try in my current project.
    Several web services that provide ‘near identical’ data but business wise segregated and victim of ‘entropic growth’.
    Thanks for the tipp!

Leave a Reply