Skip to main content

Issues in digitizing newsprint- A day in the life of a digital librarian

We are currently working on the 1990 decade of the Daily Kent Stater, which will be live in the next few weeks. I wanted to write a post about what the work entails for this project, and how many different processes are involved to get a single page posted online.

This process involves using three outside vendors once we have prepped the materials for scanning in-house. We use on vendor to create high resolution scans, another vendor to encode the words on each page for a searchable database, and finally another vendor to host the content. Before we publish the content online, I do a spot check of the original scans from the first vendor, as well as a review of the decade on a test site from the hosting vendor. I look for any known issues that we noted during the preparation phase (such as torn pages, mis-numbered issues, inserted materials, etc.). This is a tedious process, but does ensure the quality of the scans online. While I am not able to check every page of the decade, I have used some tools such as Adobe Photoshop and Adobe Bridge to help expedite this work. With Adobe Bridge for example, I can quickly verify and ensure the scanning benchmarks have been followed by the first vendor. Ideally, if I find something during the spot check, I can make a correction before these are loaded to the current online collection.

As a personal note, this is the type of work that I will normally save for the beginning or end of a work day (and preferably with a very large coffee in hand). I'll tackle one semester at a time, and spend anywhere from 15 mins to an hour reviewing the image files, METS/ALTO XML files and the individual issue PDF files. One issue I found during the most recent decade, was an issue with the raw scans. We had noted in a project spreadsheet that some non-related content had been glued to some cover pages. These were smaller pieces of paper with a note of a skipped issue due to campus closure or other reason, and had a seam of glue running along the gutter.

The fragile state of newsprint made these difficult to remove without extensive time and patience, and we wanted to see if the vendor could get a clean capture by moving the top glued page as much out of the frame as possible. For the most part, the gutter was wide enough that we were able to get the text captured in full. However, we had a small handful that had minimal gutters, and after examination I found in the raw scans that we did have some text loss. For these pages, we then worked to remove as much of the paper as possible with a little water, Q tips, and of course patience. We were able to remove enough of the top glued paper without damaging the original print. For these pages, I then re-scanned the pages in-house with our Bookeye 2 overhead scanning unit. I then work with the hosting vendor to replace these pages on the test site before we move them over to the live website. I've included a few pictures below of the process.


One sample page with a top glued page on the issue





A closer look. 
The text along the left is cut off as a result of the glued page



A tedious chore to say the least, but just one part of the puzzle that is digital newspaper archives.

Comments

Popular posts from this blog

New image viewer in place for Omeka content

In anticipation for the addition of a large number of textual documents to be added to our online digital archive in the next year, we've added a new image viewer that allows for much more interaction with the digital items. Now a user can zoom in and navigate through an image or document. Here is an example:

http://omeka.library.kent.edu/special-collections/items/show/1458

One feature that we are excited to have in place with this new viewer is to provide a slideshow/scroll view of items with multiple pages or images. Thanks to Project Mirador!

Check back for more additions in the coming months.

Digital Scholarship

Digital Scholarship is a term that could probably defined in a dozen different ways if you asked a group of people to define it. It's a somewhat elusive concept, but for me, it's finding new connections between ideas that were previously unknown, using a tools and techniques that were not previously available using a little expertise from programmers, digital librarians and an array of other folks. What does this look like?

I've put together a few samples below. Some make use of data analysis, while others apply newer metadata applications to bring an idea into a new level of understanding and research. This goes beyond a digital repository that simply provides access to material, but allows a whole new level of interpretation and use.

Here are a list of some of my favorites:

Belfast Group Poetry Networks: showing an interactive network of a group of writers, and how we can now make new mappings of the connections within the group

Linked Jazz: a research project that uses L…

New video capability to our digital collections

This week, we have added a handful of video content into our digital repository from the May 4 Collection. This represents just a small portion of the video collection that is out of copyright that the library is able to share openly, and was transferred from VHS over the last year. Please take some time and give it a look:

http://omeka.library.kent.edu/special-collections/kent-state-shootings-digital-archive/

We are looking forward to adding video content from the May 4 Oral History Collection down the road as well, so check back for more updates!