Duplicate Detection Tool: how to reduce digitization costs and improve discoverability for your users

New York Botanical Garden, LuEsther T. Mertz Library

Written by slynch on Fri, 03/04/2016 – 15:31

Our proposal is for a tool that would tackle a major challenge for digital libraries:  identifying duplicate bibliographic records between potential contributors and existing records in a digital repository.  We imagine any union catalog or digital repository that aggregates content from multiple sources struggles with this problem. The tool could be generalized to identify unique, potential additions to DPLA, HathiTrust or Digital Culture of Metropolitan New York.

The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize the legacy literature of biodiversity held in their collections and to make that literature available for open access in a digital repository (http://www.biodiversitylibrary.org/).   BHL and its contributors including both the New York Botanical Garden and the American Museum of Natural History, are in need of a tool that can perform a comparison of two sets of bibliographic metadata.  This tool would help with a task that is a critical part of the BHL workflow – to identify if a potential contributor’s records either duplicate content that is already in BHL or is in the process of being digitized for BHL.   For example, the Wildlife Conservation Society (WCS) is willing to contribute digitized copies of items in its collection to BHL. Unfortunately,  the only way to determine what public domain material (assume a cut-off of 1922) held by WCS is lacking in BHL, is to perform a manual search, which requires considerable staff time.

We envision a tool that when a .csv file containing standard bibliographic metadata, such as OCLC #, ISSN, ISBN, title and date of publication, is uploaded would compare the file against the BHL corpus and report which titles represent unique, potential additions to BHL and which would duplicate existing content. Other functional requirements of the tool include the ability to:  1) perform fuzzy matching in order to account for differences in how data values are recorded in fields such as title, author or publisher and 2) process a fairly large number of records (minimum of 200k records).

Benefits resulting from a duplicate detection tool include:  1) making the best use of limited digitization funds so that money is only used to digitize truly unique material; 2) improving the user experience for searchers  by minimizing the number of times that duplicate entries are returned in response to a search request; 3) save staff time not having to manually identify duplicates.