Distilling nlp4arc 💩

Notes by Katie M.

This past week I traveled to University of North Carolina for nlp4arc, an intimate symposium marking the start of Bitcurator NLP (this Andrew W. Mellon funded project is aimed at developing a suite of natural language processing tools for archives). The meeting opened with 11 presentations by educators and archivists who shared their experiences building and applying NLP to analyze digital collections. Our second half was scheduled to be more of an ‘unconference,’ with group-selected topics of interest to be discussed in smaller circles. Unfortunately, midway through, the university announced Chapel Hill’s water supply was being shut off immediately due to a county-wide water emergency—forcing us to evacuate while discussing things like the frozen NYPL in The Day After Tomorrow, and “preppers.”

Despite this interruption, we had enough time to review active and closed projects, and walk away with ideas that should be considered or incorporated into future software. Here were my personal takeaways:

Your name is a small part of your identity
Daniel Pitti chose a more theoretical approach to his talk, focusing on the challenge of identity and in the context of NLP tools, the limitations of a ‘name’ entity. He described the makeup of an individual as being part physical person (what we see when we people-watch) and many parts social person (work-self, hobbies-self, friend-self, etc.). None of which are represented by a name.

“To form a “reliable” identity we must triangulate across multiple sources providing mutually corroborating facts and contexts assembling fragments into a constellation that “identifies” that person.”

Are we looking for questions or answers?
This point was expressed by attendee Stephanie Haas, a UNC professor with over 20 years of NLP research and experience. When conversation circled around the responsibility of an archivist versus that of a researcher, she responded by questioning our expectations of natural language processing. Effective platforms may expose new lines of inquiry through dynamic arrangement, but we may not ever find an application use that will allow us to touch a document just once.

Communities sustain projects
—
(this practical advice is a point I continue to revisit)
Our final presentation was delivered by Carl Wilson, tech lead of OpenPreservation.org. He mentioned a number of projects that he described as fascinating and complex but ultimately, unsuccessful. Many projects mentioned over the course of the morning contained a common thread of frustration with being unable to sustain the work, citing issues like tech challenges, lack of funding and low use. Yet, Wilson makes the point that when communities care, anything is sustainable. If a user community is too exclusive, it resists the kind of expansion and care that arises through community-formed documentation, bug reports, feature requests, etc.

On that note, I was left considering how Bitcurator NLP is currently at a stage which holds the most potential: the beginning. At the next symposium maybe the conversation will be interdisciplinary, inviting non-archivist/academic voices to discuss their experiences (more diverse as well, ten of eleven nlp4arc speakers were male). This is an opportunity to develop a platform that will be accessible to a community of users, not just select experts.

Tools mentioned:
veraPDF
ArchExtract
Voyant
Stanford NLP
CMU Sphinx
ePADD

 

Email Mis/Management 📂

(NOTES by KATIE M.)

Seemingly reckless email use had a major impact on this year’s presidential campaign—this was framed as both a matter of secrecy and irresponsible record keeping—but a central issue is the sticky nature of the format. As a new technology, email in its founding years was disregarded as an informal communication mode until significant legal cases in the early 2000s raised the status of information transmitted in this form. Advancements in the format have only increased the impact of mismanagement as messages more easily proliferate and storage costs drop. Let’s review some recent instances where email records have made headlines.

2012
The Account of Richard Windsor

richard_windsor
EPA Administrator, Lisa Jackson, becomes target of an audit after she is discovered to have a private account under the nom de plume “Richard Windsor,” a combination of her former residence and family dog’s name. In defense, “assigning a secondary email account to the administrator at EPA is not new to this administration. The intent, the agency says, is for the administrator to have a manageable email account in addition to the one that is openly available to the public.” http://politi.co/2gPAY3i

2013
The IRS Gets Personal

irs

Lois Lerner, an IRS official, is asked to turn over six years of official correspondence after she is suspected of having used a personal email address to discuss the targeting of specific political groups interested in tax-exempt status, namely those related to the Tea Party movement. http://bit.ly/2gc9vca

2013
Sekisui Medical America v. Hartfire_email2

Sekisui fails to provide evidentiary documents during a financial dispute with a client, claiming “during that period, the business unit’s HR director deleted the relevant emails because they were cluttering the company’s servers.” This results in a renewed awareness of the need to redefine rules governing electronic discovery, producing a number of legal and social conclusions.

  • Legally, there isn’t a difference between negligent destruction to save space or to hide evidence. As Judge Shira Scheindlin states, “The law does not require a showing of malice to establish intentionality with respect to the spoliation of evidence.” http://bit.ly/2gcKrlm
  • Continually, while organizations retain responsibility for their electronic records, legal sanctions regarding electronic discovery need to be updated once more to fit with the evolution and proliferation of email. http://bit.ly/2gN8x7i
  • Emails need to remain in their original format. Sekisui’s practice of printing important emails for their archive instead of retaining the electronic form contributed to a significant loss of data. This was supported in court by referencing a tool newly developed by the MIT Media Lab to visualize a person’s professional and personal history using compressed email metadata: https://immersion.media.mit.edu/.

2014-2016
hdr22@clintonemail.com

screen-shot-2016-12-02-at-2-04-01-pm

In 2009, Hillary Rodham Clinton becomes secretary of state and begins using hdr22@clintonemail.com for personal email. After a committee is formed in 2014 to investigate the 2012 Benghazi attacks, the State Department requests all of her emails and private and public reviews of her correspondence ensues. One year later, the investigation is continued by the FBI, impacting HDR’s presidential campaign in 2016. http://nyti.ms/1TNFlMg

2014-2016
fort_lee

A federal investigation into Bridgegate uses NJ Gov. Chris Christie’s personal and government email accounts for evidence. Two and a half years later, messages from a private account between Christie and his wife contribute further information about his involvement in the matter.  http://bit.ly/2g0XoRy

These are extraordinary examples of unexceptional issues related to records management: blended personal and professional correspondence, reactions to the cost of space, and unclear legal retention periods. Legal aspects may similarly inform policies within arts organizations but collecting for institutional memory requires separate attention to long-term preservation and access. Over the past few months, I’ve been considering the retention policies surrounding electronic records in place at BAM and SRGM in an effort to understand how they may or may not be compatible with the everyday habits of staff. It has exposed how email, as a digital format, exists on the periphery of electronic records management in the broad category of ‘correspondence.’ In a way, this neglects the complexity of a mode that allows a thick web of overlapping data to form through attachments, forwards, personal talk, and professional updates. I thought it might be helpful to review recommendations provided to government staff during email management training.

“Digital curation is not simply a matter for those charged with care of resources at the end of their active lives, for the term ‘digital curation’ refers to the ongoing management of digital materials for both current AND future use. Curation issues are relevant from day one of the records life-cycle, from creation through to curation and including re-use of the data.”
-Maureen Pennock (2006)

org_chart
(Texas State Library and Archives Commission recommendations used during records management training for state agencies and local governments)

1. Is it a record?

As Geof Huth notes in his recommendations for digital appraisal, in the context of an institution addressing its own records, appraisal is usually an extension of records scheduling. Using the example of an arts institution, a retention schedule may suggest the following record categories for a valued creative department:

  • Collection Object Files & Artist Files
  • Exhibition Files
  • Correspondence: significant, routine, and related to other records (*including email)
  • Education Programming
  • Research Files

Organizing messages and evaluating the significance of content delivered through email is easier when reviewed in relationship to each record category, rather than as an extension of ‘correspondence.’ For example, receiving an email with a link to download images of artworks may fall under ‘Research’ rather than ‘Correspondence,’ and can be deleted after extracting the files. This approach relies on communication and collaboration between all participating departments to curate content accordingly.

  • Collection Object Files & Artist Files: Information on artworks and artists in collection, including documentation on the installation of an artwork. For permanent collection works, includes treatment reports and incident reports related to artworks. Also includes artist interviews.
  • Exhibition Files: Information collected and created by conservation related to exhibitions including traveling, foundation and affiliate exhibitions. For loans, includes treatment reports and incident reports related to artworks.
  • Correspondence: Correspondence that documents important activities, events, operations, policy changes, etc. Correspondence with artists (email, audio/video recordings).
  • Education Programming: Programming: Includes documentation [correspondence, contracts, planning, lectures, surveys] on programming such as symposiums, and conversations with contemporary artists.
    Visual Materials: Includes photographs, videos, and/or films of programs, residencies, publicity, etc. as well as photo permissions.
  • Research Files: Information collected and created by curatorial related to collections,provenance, and other topics.

2. Is it related to your job? 

  • Personal mail may seem like the most obvious non-work related message type. A simple solution would be to make a list of friends and family who you correspond with regularly and filter those messages into their own folder.
  • CCs are often courtesy copies shared to keep different parties up to date on a project, but a copy does not need to be maintained. Unless you go on to contribute to the conversation in a notable way, CCs can be deleted.
  • Unsolicited messages, even when work-related, can be deleted. This includes newsletters, office updates, PR announcements, etc.

3. Are you the custodian?

  • The person who wrote and sent the email can be considered the custodian of the record copy. 

These recommendations are meant to be applied at the creation stage to influence routine maintenance of email correspondence. Additionally, desktop and cell phone applications offer private access to the creator which lead to the perception of email existing as a data source for personal reference only, unlike departmental records which are usually saved to a shared drive. Training can alter this view by making connections between these records and the development of an organization’s history. Tips for curating email have contributed to my review of two email accounts provided by BAM and SRGM, the first containing valuable record of a project’s development, and the other a broader reflection of the institution formed over the course of a director’s career. Over the next posts, I will be exploring what it has been like to use ePADD to appraise records while considering a digital curation (and risk management) framework, and how conversations with staff and other archivists have continued to reveal the complex qualities of this format.


Huth, G. (2016). Appraisal and acquisition strategies. Chicago: Society of American Archivists.

Pennock, M. (2006). Preservation of e-mail messages. DCC Digital Curation Manual, S.Ross, M.Day (eds). Retrieved from http://www.dcc.ac.uk/resource/curation-manual/chapters/curating-e-mails

Texas State Library and Archives Commission (2013). Email management [PowerPoint slides]. Retrieved from https://slrmtraining.tsl.texas.gov/index.php?

A Look at Institutional Email đŸ–„

Hello!

In this first post, I’d like to give a bit of background about my proposal, and the logic guiding which tools, software, and research I’ve chosen to discuss in the posts to follow.

SIP AIP DIP Anxiety

My project developed out of pitches submitted by two archives engaged in recent efforts to strengthen (and simplify) their born-digital workflows, Brooklyn Academy of Music (BAM) Hamm and the Solomon R. Guggenheim Museum (SRGM). As a performing arts venue and an art museum, these two institutions operate on similar cycles of exhibitions/performances, during which high-value institutional records are created regularly by programming and curatorial staff. For both, institutional email preservation was highlighted as an area in need of attention. Like many of their contemporaries, only informal or very broad record retention policies exist for email, and internal education about organization hasn’t been consistent. Although both have similar goals for realistic incorporation of email into record management schedules, and options for access, they each offer very different examples of accounts and staff management. This suggested an opportunity to consider a cross-organization framework for email archiving.

Continue reading “A Look at Institutional Email đŸ–„”