BY KAREN H.
OpenRefine is a well-loved tool among information professionals for cleaning “messy” data, mostly tabular data (Excel, CSV, TSV), but also record data in serializations like XML. Do you have values in an Excel spreadsheet with unwanted whitespace? Or multiple spellings for the same term? Then OpenRefine might be just the tool for you. OpenRefine is flexible enough to handle script-writing or the writing of regular expressions to batch alter values any way you choose. And scripting can be used for other purposes, too, including calling outside APIs to align new data with what you have.
For my linked data fellowship project with METRO, I wanted to determine relatively straightforward ways to reconcile entities associated with resources. This can be useful for librarians and archivists, for example, who have not conformed their data to outside authorities or vocabularies and are not currently using systems that provide this type of automated look-up. But it can also be useful for those who already have reconciled data, but who want to double-check their assigned entity reconciliations for correctness or check if entities that previously did not have authority records do now, since new records are constantly being created. OpenRefine offers the possibility of reconciling terms much faster than manual look-ups, while also not requiring coding skills.
For my initial run, I wanted to use OpenRefine to reconcile a list of names I had acquired from subject and associated name metadata assigned to document resources that pertain to the artist Martin Wong on the digital archive artasiamerica. I acquired the list by webscraping the site with Python and Beautiful Soup (for more on this see an earlier blogpost: “Using Beautiful Soup with Python for Webscraping”), but reconciling names using OpenRefine can be performed with any list of names you have, whether a stand-alone column in a spreadsheet or a column within a larger spreadsheet, like a data export from a repository.
There are several different ways to reconcile names using OpenRefine, and at different times, they are in varying degrees of actually working. But don’t be discouraged! There’s an extremely active community around this open-source tool, so more often than not things do get repaired, replaced, and/or further developed.
Creating a project in OpenRefine
If you don’t have OpenRefine yet, but are interested in trying this browser-based tool or in following this tutorial, you can download it here: http://openrefine.org/download.html. Once you’ve downloaded OpenRefine, create a project. Launch the application. It should open a new window in your browser. Make sure the “Create Project” side tab is active under “Get data from”. By default, the file location is set to “From this Computer”. Choose your file that has the names you would like to reconcile (XLS, CSV, etc.) using the button provided and click “Next”. In this second panel, you can change the name of your project at the top. OpenRefine shows a preview of your data. Make sure it looks correct. For example, if your file does not use column headings and simply starts the first row with data, make sure “Parse Next” is not selected in the items at the bottom. Once your data looks correct, click “Create Project”.
When you have your data in the OpenRefine workspace, there are several methods you can employ to reconcile your data.
METHOD #1 for Reconciling Names with OpenRefine: Using the reconcile function
If you’ve just downloaded OpenRefine for the first time and are using the most recent version 2.7-rc2 or have just upgraded to 2.7-rc2, reconciling is very simple. A reconciliation service for Wikidata comes installed by default (replacing Freebase in earlier versions, which hadn’t worked for me for a while). If you’re not using 2.7-rc2, instructions are included in the directions below on how to add Wikidata as a a service.
A couple of general notes about these directions before getting started: When reconciling, I prefer to keep the original column and use a separate column for storing the reconciled names. And at this point in time (it could change!), the reconcile function in Wikidata returns only names rather than the full URI, and I like to have the URI as well in a separate column, since I will use those URIs later to build RDF triples. Following the instructions below, then, you will have three columns at the end: 1) your original column, 2) a column showing the name it reconciled to, and 3) a column with the URI for that entity. For my example, I will reconcile to Wikidata IDs, but information will also be provided on how to reconcile to Library of Congress’ Name Authority File (LCNAF).
- To create a new column with your values that will be translated into reconciled entities, click the triangle next to your column name and go to “Edit column”>”Add column based on this column”. In the window that appears, create a name for the new column, like “Wikidata entity”. Keep the settings as they are, and click “OK”. You should see a column identical to your original column.
- To reconcile your names, click on the triangle next to your new column’s name and select “Reconcile”>”Start reconciling” from the pull-down menu. A window should open. If you’re using OpenRefine 2.7-rc2, you should see “Wikidata Reconciliation for OpenRefine (en)” as the default service. If you don’t see it in your version of OpenRefine (because you’re using an earlier version, for example), you can add it manually by clicking on “Add Standard Service” in the lower left-hand corner and copying and pasting the following URL in the field: https://tools.wmflabs.org/openrefine-wikidata/en/api. Then press “Add Service”. A wheel will turn that says “Working” and then you should see “Wikidata Reconciliation for OpenRefine (en)” has been added to the list of services to choose from. Make sure that service is selected. For some reason when I added the service, I was not able to select the property “human” the first time, but human seems to be the default, so it worked anyway. You can choose to change the other settings, as appropriate (I used the default), and then click “Start Reconciling”. On one computer, I had difficulty seeing the button to start reconciling, because it was off-screen and the window would not scroll. If this happens to you, try using your browser functions to zoom out until you see it. OpenRefine shows its progress while it reconciles. Depending on how many names you are reconciling, this could take a little while.
- Once it’s done, if you used the default settings, your list will probably reflect the three different match levels of reconciliation: Those with high confidence in match; those with a selection of matches; and those with no match.
- Next you will need to double-check that the service has reconciled your names correctly. The pop-up window that opens when you click on a name may have a description included that suffices to know the reconciliation is correct. If it is incorrect, you can go through the other choices offered to see if your person is among them. In my case, many did not reconcile to the correct person, and many did not reconcile to anyone at all using Wikidata.
- Once you have completed your list, you will now want to extract the URI associated with that name into a third column. Your new column clearly has that information, since the suggestions you clicked through are linked to Wikidata pages, but OpenRefine does not display the URI. Click on the triangle for your new column and go to “Edit column”>”Add column based on this column”. In the window that appears, in the “New column name” field type “Wikidata URI” (or whatever makes sense to you). Make sure the language is set to “General Refine Expression Language (GREL)”. In the field “Expression” type this exact phrase:
The preview should show you well-formulated URIs for the entities that you were able to reconcile and indicate that nothing was retrieved for those that were not reconciled. Click “OK”.
- Now that you have your names reconciled, using the “Export” button on the upper right, you can export your project data in a variety of formats for further use.
If you’d like to reconcile your names using other services, you might be able to find one online with a Google search. For example, I found a service created by Christina Harlow that reconciles to Library of Congress Subject Headings and the Library of Congress Name Authority File. If you would like to use LoC for reconciliation instead of Wikidata, follow the instructions above, but in Step 2, instead of adding the URL “https://tools.wmflabs.org/openrefine-wikidata/en/ap” under “Add Standard Service”, add the following URL: http://lc-reconcile.cmh2166.webfactional.com/.
METHOD #2 for Reconciling Names with OpenRefine: Copying and Pasting a Reconciliation Script
When I first started reconciling this list of names with OpenRefine at the beginning of 2017, the method that worked best (again, functions on OpenRefine can really be in various stages of “working”) was a script that I found on a github repository by Matt Carruthers that reconciles names with the Library of Congress Name Authority File (LCNAF): https://github.com/mcarruthers/LCNAF-Named-Entity-Reconciliation. His repository has three different scripts: one for corporate names, one for personal names, and one for generic names, which queries for both, but which he states is less accurate.
As per the author’s instructions, once you load your data file into OpenRefine, the only change you need to make is to call your target column of names “Name” (use the triangle next to the column name and then use Edit column>Rename this column). Then, to reconcile, under the “Undo/Redo” tab, click the “Apply” button and paste his JSON code in the panel that opens.
You can begin reconciling by hitting the button “Perform Operations”. One important caveat: This script does not directly query the Library of Congress’ databases to reconcile. It instead uses VIAF, the Virtual International Authority File, to return the LCNAF identifier.2
The script retains your original list, while adding a column for the human-readable name it was reconciled to as it appears in the Library of Congress Name Authority File and the LCNAF URI for that entity. This process is not 100% accurate, but because the script not only provides the URI for the name record, but also the human-readable name corresponding to the URI, it is often easy to see at a glance whether the name has been reconciled correctly. For example, the artist “Lin, Lin” was reconciled to “Stalin, Joseph, 1878-1953”, which is clearly incorrect. In some instances, it was necessary to double-check the LCNAF record, like “Fung, Ming Chip” to “Feng, Mingqiu, 1951-“ with the URI “http://id.loc.gov/authorities/names/nr2001043290” does translate to the same person, with “Fung, Ming-Chip” found among the variant spellings.
Again, after you run your reconciliation, you can export the results using the “Export” button in the upper right-hand corner
Because almost all the individuals in my list are artists, and because artasiamerica (the source of this list) uses Getty’s Union List of Artists Name (ULAN), AND because VIAF’s cluster record does include ULAN, I decided to edit Matt Carruthers’ original script to reconcile to ULAN IDs when available. That altered script can be copied from here, if you’d like to reconcile to ULAN: https://github.com/kllhwang/Named-Entity-Reconciliation-with-OpenRefine/blob/master/PersonalNamesReconcile-ULAN.txt.
His script can be altered to query the IDs from any VIAF contributor by changing: 1) each instance of the contributor abbreviation on VIAF to the new contributor’s (e.g., “LC” for Library of Congress; “JPG” for ULAN), 2) instances of column names (although this is actually optional, but it may get confusing if you don’t), and 3) changing the expression at Line 176 according to the contributor ‘s abbreviation length+1, for example:
- ULAN (“JPG” = 3 letters) change line to: “expression”: “grel:substring(value, 4)”,
- Library of Congress (“LC” = 2 letters) change line to: “expression”: “grel:substring(value, 3)”,
A full list of VIAF contributors can be found at the bottom of the VIAF home page. A quick way to see the abbreviation is to look at the linked URL to that contributor’s page. The abbreviation is at the end, before the “.html”.
- The string to be added was updated on March 22, 2017 from “https://www.wikidata.org/wiki/” to “http://www.wikidata.org/entity/” thanks to a reader’s feedback.
- VIAF started as a project to create cluster records from national authority files around the world and now other sources, as well, like Wikidata and ISNI. The records are updated periodically from the original sources. In general, VIAF is a great resource for looking up identifiers and URIs for entities from different sources, but can also be used for looking up the aggregated information from those authority files. Another project I worked on, for example, used VIAF (among other databases) to query gender data for musicians and others to add a new navigable gender view for the Linked Jazz network graph. You can read a blogpost about that at https://linkedjazz.org/enriching-the-linked-jazz-name-list-with-gender-information/.