Skip to main content

Issues in digitizing newsprint- A day in the life of a digital librarian

We are currently working on the 1990 decade of the Daily Kent Stater, which will be live in the next few weeks. I wanted to write a post about what the work entails for this project, and how many different processes are involved to get a single page posted online.

This process involves using three outside vendors once we have prepped the materials for scanning in-house. We use on vendor to create high resolution scans, another vendor to encode the words on each page for a searchable database, and finally another vendor to host the content. Before we publish the content online, I do a spot check of the original scans from the first vendor, as well as a review of the decade on a test site from the hosting vendor. I look for any known issues that we noted during the preparation phase (such as torn pages, mis-numbered issues, inserted materials, etc.). This is a tedious process, but does ensure the quality of the scans online. While I am not able to check every page of the decade, I have used some tools such as Adobe Photoshop and Adobe Bridge to help expedite this work. With Adobe Bridge for example, I can quickly verify and ensure the scanning benchmarks have been followed by the first vendor. Ideally, if I find something during the spot check, I can make a correction before these are loaded to the current online collection.

As a personal note, this is the type of work that I will normally save for the beginning or end of a work day (and preferably with a very large coffee in hand). I'll tackle one semester at a time, and spend anywhere from 15 mins to an hour reviewing the image files, METS/ALTO XML files and the individual issue PDF files. One issue I found during the most recent decade, was an issue with the raw scans. We had noted in a project spreadsheet that some non-related content had been glued to some cover pages. These were smaller pieces of paper with a note of a skipped issue due to campus closure or other reason, and had a seam of glue running along the gutter.

The fragile state of newsprint made these difficult to remove without extensive time and patience, and we wanted to see if the vendor could get a clean capture by moving the top glued page as much out of the frame as possible. For the most part, the gutter was wide enough that we were able to get the text captured in full. However, we had a small handful that had minimal gutters, and after examination I found in the raw scans that we did have some text loss. For these pages, we then worked to remove as much of the paper as possible with a little water, Q tips, and of course patience. We were able to remove enough of the top glued paper without damaging the original print. For these pages, I then re-scanned the pages in-house with our Bookeye 2 overhead scanning unit. I then work with the hosting vendor to replace these pages on the test site before we move them over to the live website. I've included a few pictures below of the process.


One sample page with a top glued page on the issue





A closer look. 
The text along the left is cut off as a result of the glued page



A tedious chore to say the least, but just one part of the puzzle that is digital newspaper archives.

Comments

Popular posts from this blog

Privacy and digital collections

This past October, I put in a book proposal on the topic of ethical decision-making around privacy issues in digital collections. It has been accepted by Morgan and Claypool, and I am cranking to meet a May 1st deadline to get this into print by November. It's exciting, but also nerve-wracking and perhaps a little terrifying for a few reasons. Ethics is head space that I very much enjoy- This work will include a nod to an essay from Martin Heidegger, which oddly enough I used a different Heidegger essay in my museum studies MA thesis on the ethics of art conservation. The philosophy aspect in ethics is probably the most enjoyable part for me, but it's also unbelievably murky waters. I spent many years rejecting absolutes in my early twenties, though at some point I have to put the pen to the paper and just write. (Funny sidenote- This digital girl still prefers the analog. I write primarily on my laptop and then print out draft and edit by hand. I also hate, hate, hate e-bo...

New image viewer in place for Omeka content

In anticipation for the addition of a large number of textual documents to be added to our online digital archive in the next year, we've added a new image viewer that allows for much more interaction with the digital items. Now a user can zoom in and navigate through an image or document. Here is an example: http://omeka.library.kent.edu/special-collections/items/show/1458 One feature that we are excited to have in place with this new viewer is to provide a slideshow/scroll view of items with multiple pages or images. Thanks to Project Mirador ! Check back for more additions in the coming months.

Choose Privacy Every Day blog

I realized this week that I had not posted here in some time! Lots of reasons- many of which are not terribly exciting- feeling overwhelmed/busy/stressed, being a FT working mom with twins who just started kindergarten, pandemic, blah, blah, blah, but I have also been writing a monthly blog post for the ALA Choose Privacy Every Day blog. https://chooseprivacyeveryday.org/author/virginia-dressler/  This has been a super rewarding experience- I get to pick the topic, and lately it's been mostly interviewing current privacy and AI grant PIs and book reviews. Check it out, and if you feel the urge, you can subscribe on the homepage . I will post some job related project updates here shortly!   "robots" by Tarkowski is licensed under CC BY-SA 2.0