Skip to main content

Issues in digitizing newsprint- A day in the life of a digital librarian

We are currently working on the 1990 decade of the Daily Kent Stater, which will be live in the next few weeks. I wanted to write a post about what the work entails for this project, and how many different processes are involved to get a single page posted online.

This process involves using three outside vendors once we have prepped the materials for scanning in-house. We use on vendor to create high resolution scans, another vendor to encode the words on each page for a searchable database, and finally another vendor to host the content. Before we publish the content online, I do a spot check of the original scans from the first vendor, as well as a review of the decade on a test site from the hosting vendor. I look for any known issues that we noted during the preparation phase (such as torn pages, mis-numbered issues, inserted materials, etc.). This is a tedious process, but does ensure the quality of the scans online. While I am not able to check every page of the decade, I have used some tools such as Adobe Photoshop and Adobe Bridge to help expedite this work. With Adobe Bridge for example, I can quickly verify and ensure the scanning benchmarks have been followed by the first vendor. Ideally, if I find something during the spot check, I can make a correction before these are loaded to the current online collection.

As a personal note, this is the type of work that I will normally save for the beginning or end of a work day (and preferably with a very large coffee in hand). I'll tackle one semester at a time, and spend anywhere from 15 mins to an hour reviewing the image files, METS/ALTO XML files and the individual issue PDF files. One issue I found during the most recent decade, was an issue with the raw scans. We had noted in a project spreadsheet that some non-related content had been glued to some cover pages. These were smaller pieces of paper with a note of a skipped issue due to campus closure or other reason, and had a seam of glue running along the gutter.

The fragile state of newsprint made these difficult to remove without extensive time and patience, and we wanted to see if the vendor could get a clean capture by moving the top glued page as much out of the frame as possible. For the most part, the gutter was wide enough that we were able to get the text captured in full. However, we had a small handful that had minimal gutters, and after examination I found in the raw scans that we did have some text loss. For these pages, we then worked to remove as much of the paper as possible with a little water, Q tips, and of course patience. We were able to remove enough of the top glued paper without damaging the original print. For these pages, I then re-scanned the pages in-house with our Bookeye 2 overhead scanning unit. I then work with the hosting vendor to replace these pages on the test site before we move them over to the live website. I've included a few pictures below of the process.

One sample page with a top glued page on the issue

A closer look. 
The text along the left is cut off as a result of the glued page

A tedious chore to say the least, but just one part of the puzzle that is digital newspaper archives.


Popular posts from this blog

Privacy and digital collections

This past October, I put in a book proposal on the topic of ethical decision-making around privacy issues in digital collections. It has been accepted by Morgan and Claypool, and I am cranking to meet a May 1st deadline to get this into print by November.

It's exciting, but also nerve-wracking and perhaps a little terrifying for a few reasons. Ethics is head space that I very much enjoy- This work will include a nod to an essay from Martin Heidegger, which oddly enough I used a different Heidegger essay in my museum studies MA thesis on the ethics of art conservation. The philosophy aspect in ethics is probably the most enjoyable part for me, but it's also unbelievably murky waters. I spent many years rejecting absolutes in my early twenties, though at some point I have to put the pen to the paper and just write. (Funny sidenote- This digital girl still prefers the analog. I write primarily on my laptop and then print out draft and edit by hand. I also hate, hate, hate e-book…

Tenure track, twins and prenatal loss

Life of late has been crazy busy. Technically the tenure clock is paused this year as I toll, yet the 2 year NHRPC grant kicked off last September as I returned from maternity leave, and I continue to make a stab at research and writing in the interim. But my life has changed quite a bit (and as such, the intermittent absence of the blog).

We welcomed twins last May, who are currently inches away from walking and continue to keep me on my toes in a daily whirlwind of activity. They came into this world exactly a year and a day after our devastating full term loss in 2015. Life is strange and odd, and often I find that I am still reeling when I think about the unexplained loss of our first. It has been difficult to move on, and feel a huge part of my heart remains with that baby. There are constant reminders- friends who had successful births around the same time remind me of the huge, gaping hole in our lives when I see their little one, or walking by the tree my amazingly thoughtful…

Accessibility requirements and digital collections

So, I have found recently it is possible to teach an old dog new tricks- For the first time in too long, I tackled an area that was completely new to me and have been diving into the world of accessibility requirements for digital objects. In part, this is coming as a response to a newer policy in place at Kent State addressing electronic and information technology accessibility.

Or for other digital librarians, a colleague at another university said "OCR'd PDFs just aren't going to cut it anymore". This statement I think reflects how many of us have practiced a simple approach to textual documents in the past. Batch run OCR before ingestion, and TADA! Done, or at least we had hoped.

But, as I have come to learn, this approach is does not fair well for screen readers or adapt for those with vision impairments. At Kent State, I've been fortunate to have some great folks in the Accessibility Office to offer advice and hands-on training (Thank you, Jason Piatt! My …