We are currently working on the 1990 decade of the Daily Kent Stater, which will be live in the next few weeks. I wanted to write a post about what the work entails for this project, and how many different processes are involved to get a single page posted online.
This process involves using three outside vendors once we have prepped the materials for scanning in-house. We use on vendor to create high resolution scans, another vendor to encode the words on each page for a searchable database, and finally another vendor to host the content. Before we publish the content online, I do a spot check of the original scans from the first vendor, as well as a review of the decade on a test site from the hosting vendor. I look for any known issues that we noted during the preparation phase (such as torn pages, mis-numbered issues, inserted materials, etc.). This is a tedious process, but does ensure the quality of the scans online. While I am not able to check every page of the decade, I have used some tools such as Adobe Photoshop and Adobe Bridge to help expedite this work. With Adobe Bridge for example, I can quickly verify and ensure the scanning benchmarks have been followed by the first vendor. Ideally, if I find something during the spot check, I can make a correction before these are loaded to the current online collection.
As a personal note, this is the type of work that I will normally save for the beginning or end of a work day (and preferably with a very large coffee in hand). I'll tackle one semester at a time, and spend anywhere from 15 mins to an hour reviewing the image files, METS/ALTO XML files and the individual issue PDF files. One issue I found during the most recent decade, was an issue with the raw scans. We had noted in a project spreadsheet that some non-related content had been glued to some cover pages. These were smaller pieces of paper with a note of a skipped issue due to campus closure or other reason, and had a seam of glue running along the gutter.
The fragile state of newsprint made these difficult to remove without extensive time and patience, and we wanted to see if the vendor could get a clean capture by moving the top glued page as much out of the frame as possible. For the most part, the gutter was wide enough that we were able to get the text captured in full. However, we had a small handful that had minimal gutters, and after examination I found in the raw scans that we did have some text loss. For these pages, we then worked to remove as much of the paper as possible with a little water, Q tips, and of course patience. We were able to remove enough of the top glued paper without damaging the original print. For these pages, I then re-scanned the pages in-house with our Bookeye 2 overhead scanning unit. I then work with the hosting vendor to replace these pages on the test site before we move them over to the live website. I've included a few pictures below of the process.
A tedious chore to say the least, but just one part of the puzzle that is digital newspaper archives.
This process involves using three outside vendors once we have prepped the materials for scanning in-house. We use on vendor to create high resolution scans, another vendor to encode the words on each page for a searchable database, and finally another vendor to host the content. Before we publish the content online, I do a spot check of the original scans from the first vendor, as well as a review of the decade on a test site from the hosting vendor. I look for any known issues that we noted during the preparation phase (such as torn pages, mis-numbered issues, inserted materials, etc.). This is a tedious process, but does ensure the quality of the scans online. While I am not able to check every page of the decade, I have used some tools such as Adobe Photoshop and Adobe Bridge to help expedite this work. With Adobe Bridge for example, I can quickly verify and ensure the scanning benchmarks have been followed by the first vendor. Ideally, if I find something during the spot check, I can make a correction before these are loaded to the current online collection.
As a personal note, this is the type of work that I will normally save for the beginning or end of a work day (and preferably with a very large coffee in hand). I'll tackle one semester at a time, and spend anywhere from 15 mins to an hour reviewing the image files, METS/ALTO XML files and the individual issue PDF files. One issue I found during the most recent decade, was an issue with the raw scans. We had noted in a project spreadsheet that some non-related content had been glued to some cover pages. These were smaller pieces of paper with a note of a skipped issue due to campus closure or other reason, and had a seam of glue running along the gutter.
The fragile state of newsprint made these difficult to remove without extensive time and patience, and we wanted to see if the vendor could get a clean capture by moving the top glued page as much out of the frame as possible. For the most part, the gutter was wide enough that we were able to get the text captured in full. However, we had a small handful that had minimal gutters, and after examination I found in the raw scans that we did have some text loss. For these pages, we then worked to remove as much of the paper as possible with a little water, Q tips, and of course patience. We were able to remove enough of the top glued paper without damaging the original print. For these pages, I then re-scanned the pages in-house with our Bookeye 2 overhead scanning unit. I then work with the hosting vendor to replace these pages on the test site before we move them over to the live website. I've included a few pictures below of the process.
One sample page with a top glued page on the issue
A closer look.
The text along the left is cut off as a result of the glued page
A tedious chore to say the least, but just one part of the puzzle that is digital newspaper archives.
Comments
Post a Comment