Distilling nlp4arc 💦

Notes by Katie M.

This past week I traveled to University of North Carolina for nlp4arc, an intimate symposium marking the start of Bitcurator NLP (this Andrew W. Mellon funded project is aimed at developing a suite of natural language processing tools for archives). The meeting opened with 11 presentations by educators and archivists who shared their experiences building and applying NLP to analyze digital collections. Our second half was scheduled to be more of an ‘unconference,’ with group-selected topics of interest to be discussed in smaller circles. Unfortunately, midway through, the university announced Chapel Hill’s water supply was being shut off immediately due to a county-wide water emergency—forcing us to evacuate while discussing things like the frozen NYPL in The Day After Tomorrow, and “preppers.”

Despite this interruption, we had enough time to review active and closed projects, and walk away with ideas that should be considered or incorporated into future software. Here were my personal takeaways:

Your name is a small part of your identity
Daniel Pitti chose a more theoretical approach to his talk, focusing on the challenge of identity and in the context of NLP tools, the limitations of a ‘name’ entity. He described the makeup of an individual as being part physical person (what we see when we people-watch) and many parts social person (work-self, hobbies-self, friend-self, etc.). None of which are represented by a name.

“To form a “reliable” identity we must triangulate across multiple sources providing mutually corroborating facts and contexts assembling fragments into a constellation that “identifies” that person.”

Are we looking for questions or answers?
This point was expressed by attendee Stephanie Haas, a UNC professor with over 20 years of NLP research and experience. When conversation circled around the responsibility of an archivist versus that of a researcher, she responded by questioning our expectations of natural language processing. Effective platforms may expose new lines of inquiry through dynamic arrangement, but we may not ever find an application use that will allow us to touch a document just once.

Communities sustain projects
—(this practical advice is a point I continue to revisit)
Our final presentation was delivered by Carl Wilson, tech lead of OpenPreservation.org. He mentioned a number of projects that he described as fascinating and complex but ultimately, unsuccessful. Many projects mentioned over the course of the morning contained a common thread of frustration with being unable to sustain the work, citing issues like tech challenges, lack of funding and low use. Yet, Wilson makes the point that when communities care, anything is sustainable. If a user community is too exclusive, it resists the kind of expansion and care that arises through community-formed documentation, bug reports, feature requests, etc.

On that note, I was left considering how Bitcurator NLP is currently at a stage which holds the most potential: the beginning. At the next symposium maybe the conversation will be interdisciplinary, inviting non-archivist/academic voices to discuss their experiences (more diverse as well, ten of eleven nlp4arc speakers were male). This is an opportunity to develop a platform that will be accessible to a community of users, not just select experts.

Tools mentioned:
veraPDF
ArchExtract
Voyant
Stanford NLP
CMU Sphinx
ePADD

Notes by Katie M.

Leave a Reply Cancel reply