Artificial Intelligence and Archives

—Rebecca Bayeck and Azure Stewart

“Artificial Intelligence and Archives” was the inaugural webinar of the series on Emerging Technologies, Big Data & Archives, organized by CLIR postdocs Rebecca Y. Bayeck of the Schomburg Center for Research in Black Culture and Azure Stewart of New York University. With the emergence of new technologies and big data, the processing and preservation of data has changed and will continue to change. As in other webdomains (e.g., health, video games), artificial intelligence (AI) is increasingly reshaping the way we process, interact with, and think about archives. Consequently, in the age of big data, archives are not just “a collection of historical records relating to a place, organization, or family” (Cambridge Dictionary Online). Today, archives also include all types of digital data—including social media data—and algorithms. Archivists are therefore called on to preserve and process data as they are being created, which requires understanding AI languages, processes, and practices for the creation and protection of data/records now for the future.

In this webinar, our speaker Dr. Anthea Seles, from the International Council on Archives (ICA), discussed AI in archival spaces: its uses, application, and the role archivists should play to become critical voices in AI discussions. Two hours were not enough to address all the questions raised by the 280 attendees. As a follow up to the webinar, we have thematically organized and addressed the unanswered questions and present them here.

Artificial Intelligence in Archives

How much has AI penetrated archives in the developing world?

I would say [this has been] limited, if at all. I think the main issue is that these technologies are being applied in the assessment of development initiatives like Sustainable Development Goals (SDGs). Increasingly there are many projects focusing on artificial intelligence and human rights, for example the University of Essex Human Rights, Big Data and Technology Project, and it is becoming a concern for organisations like Amnesty International.

Who already has the best AI for archives today, according to ICA regulation, that we can adopt?

There is no commercial provider that works specifically on archival questions. I think you can use off-the-shelf eDiscovery software, but you need to have a basic understanding of what the technology is doing in order to measure your precision and recall. 

Artificial Intelligence Tools

Will governments and big corporations use artificial intelligence as a tool to centralize information in future?

Potentially. I think there is some thinking about this coming out of the records management community, but I still believe it is about balancing the strengths of the tool with the continuing need for human intervention. The question is, when will the human be needed? And what can the tool be trusted to do with minimum supervision? How do we ensure a continuous feedback loop to identify records of long-term value as information creation changes? 

What tools were you using for the file analysis and visualization in this presentation?

The screen shots are only example photos, they are not from any of the tools we used. We looked at several eDiscovery tools with different algorithms (e.g., Latent Semantic Indexing, Latent Dirichlet Allocation). These are bog standard machine learning applications that have been around for a while, and we chose to go down that road to see what we could get in off-the-shelf commercial software packages.

So, is there a way to write a script to avoid metadata corruption and alteration?

There are tools now you can use that will preserve the integrity of the metadata when you move material from one system or file to another. I think for historical metadata alteration/corruption it is a question of how we explain this to users and how this might affect different access methods like visualisation. 

Will the International Council on Archives provide training on artificial intelligence and machine learning?

Not yet, but I’m open to suggestions. [We are] currently speaking with different stakeholders and maybe we can hold a hackathon at the Abu Dhabi Congress. 

Access to Archives

Will the course Managing Digital Archives be accessible online?

The managing digital archives course is organized by the ICA and will be accessible online in fall 2020. Please check the ICA website or social media channels (Twitter and Facebook) for more information.

What are some of the practices in the UK National Archives and government on managing structured data as records? How does the UK identify, capture, manage, and apply retention and disposition to data (both transactional applications and analytical ones)?

There are no published policies on identification of datasets that I can see and would suggest you contact either the record copying or the UK government web archive records unit to see if anything more substantive has been developed.

What is your suggestion for keeping physical records for posterity and authentication?

Records should always be maintained in the format in which they are created. The belief in scanning paper records and destroying them in order to save space and save on storage costs is a false economy. The level at which you should be scanning that material and the amount of metadata that should be captured to maintain it over time is very high. Also, you need to take into account computer storage costs, and whether you can afford the costs of digital preservation software, which all begins to add up. One must also take into account the active management of these authentic digital surrogates by digital preservation specialists. Furthermore, if you have a paper management problem and you don’t take that into account when you move into the digital environment you are then transferring an analog integrity issue into a digital integrity/authenticity issue. Digital will not solve integrity issues; in my opinion it will magnify them.

Artificial Intelligence and Society

In Brazil, we are concerned with the problem of the spread and political use of misinformation (fake news). How can archivists with algorithm training provide reliable research insights to fight against this historical problem?

At this point, I couldn’t honestly provide you with an answer but this is something we could explore in the near future with different partners and collaborators.


Rebecca Y. Bayeck is a Postdoctoral Fellow in Data Curation for African American and African Studies at the New York Public Library’s Schomburg Center for Research in Black Culture. She holds a dual PhD in Learning Design and Technology and Comparative International Education from Pennsylvania State University.

Azure Stewart is a Postdoctoral Fellow at New York University Libraries, where she works at the intersection of engineering education, librarianship, and research. She holds a PhD in Education with a specialization in Teaching and Learning from the University of California, Santa Barbara.