Skip to main content

Issues in digitizing newsprint- A day in the life of a digital librarian

We are currently working on the 1990 decade of the Daily Kent Stater, which will be live in the next few weeks. I wanted to write a post about what the work entails for this project, and how many different processes are involved to get a single page posted online.

This process involves using three outside vendors once we have prepped the materials for scanning in-house. We use on vendor to create high resolution scans, another vendor to encode the words on each page for a searchable database, and finally another vendor to host the content. Before we publish the content online, I do a spot check of the original scans from the first vendor, as well as a review of the decade on a test site from the hosting vendor. I look for any known issues that we noted during the preparation phase (such as torn pages, mis-numbered issues, inserted materials, etc.). This is a tedious process, but does ensure the quality of the scans online. While I am not able to check every page of the decade, I have used some tools such as Adobe Photoshop and Adobe Bridge to help expedite this work. With Adobe Bridge for example, I can quickly verify and ensure the scanning benchmarks have been followed by the first vendor. Ideally, if I find something during the spot check, I can make a correction before these are loaded to the current online collection.

As a personal note, this is the type of work that I will normally save for the beginning or end of a work day (and preferably with a very large coffee in hand). I'll tackle one semester at a time, and spend anywhere from 15 mins to an hour reviewing the image files, METS/ALTO XML files and the individual issue PDF files. One issue I found during the most recent decade, was an issue with the raw scans. We had noted in a project spreadsheet that some non-related content had been glued to some cover pages. These were smaller pieces of paper with a note of a skipped issue due to campus closure or other reason, and had a seam of glue running along the gutter.

The fragile state of newsprint made these difficult to remove without extensive time and patience, and we wanted to see if the vendor could get a clean capture by moving the top glued page as much out of the frame as possible. For the most part, the gutter was wide enough that we were able to get the text captured in full. However, we had a small handful that had minimal gutters, and after examination I found in the raw scans that we did have some text loss. For these pages, we then worked to remove as much of the paper as possible with a little water, Q tips, and of course patience. We were able to remove enough of the top glued paper without damaging the original print. For these pages, I then re-scanned the pages in-house with our Bookeye 2 overhead scanning unit. I then work with the hosting vendor to replace these pages on the test site before we move them over to the live website. I've included a few pictures below of the process.

One sample page with a top glued page on the issue

A closer look. 
The text along the left is cut off as a result of the glued page

A tedious chore to say the least, but just one part of the puzzle that is digital newspaper archives.


Popular posts from this blog

Tenure track, twins and prenatal loss

Life of late has been crazy busy. Technically the tenure clock is paused this year as I toll, yet the 2 year NHRPC grant kicked off last September as I returned from maternity leave, and I continue to make a stab at research and writing in the interim. But my life has changed quite a bit (and as such, the intermittent absence of the blog).

We welcomed twins last May, who are currently inches away from walking and continue to keep me on my toes in a daily whirlwind of activity. They came into this world exactly a year and a day after our devastating full term loss in 2015. Life is strange and odd, and often I find that I am still reeling when I think about the unexplained loss of our first. It has been difficult to move on, and feel a huge part of my heart remains with that baby. There are constant reminders- friends who had successful births around the same time remind me of the huge, gaping hole in our lives when I see their little one, or walking by the tree my amazingly thoughtful…

New image viewer in place for Omeka content

In anticipation for the addition of a large number of textual documents to be added to our online digital archive in the next year, we've added a new image viewer that allows for much more interaction with the digital items. Now a user can zoom in and navigate through an image or document. Here is an example:

One feature that we are excited to have in place with this new viewer is to provide a slideshow/scroll view of items with multiple pages or images. Thanks to Project Mirador!

Check back for more additions in the coming months.

New video capability to our digital collections

This week, we have added a handful of video content into our digital repository from the May 4 Collection. This represents just a small portion of the video collection that is out of copyright that the library is able to share openly, and was transferred from VHS over the last year. Please take some time and give it a look:

We are looking forward to adding video content from the May 4 Oral History Collection down the road as well, so check back for more updates!