| A TEXT CREATION PARTNERSHIP Companion |
⇐ Return to main (index) page.
Although humans can read scanned pages (just as we do paper pages), a computer understands these pages as no more than image files with black and white areas. In order to search, index, analyze, or rearrange these books, we need to provide the computer with electronic text as well as page images. Like most things, this text can either be generated by hand or by machine. The automated creation of electronic text is generally known as optical character recognition (OCR). In this process, software attempts to “read” the page images and map each chunk of black pixels it sees to the correct character or symbol. OCR technology is used in large scale digitization projects like Google Books and HathiTrust, as well as in common software such as Adobe Acrobat. The full-text search functions that are offered in the ECCO and Evans databases rely on OCR technology. OCR software is improving all the time. It works very well on modern books, but the older the book, the more the software struggles. For books printed before 1700, and for images that are blurry, spotty, or have other quality issues, it fails almost entirely, which is why the stand-alone EEBO database doesn’t include any searchable electronic text at all. That’s where the TCP comes in: we worked with vendors who manually keyed in the letters they saw on the page. In many cases, each page was transcribed by more than one person, and the results were compared against one another, to generate electronic text that is 99.995% accurate. Keying is the greatest expense in the TCP’s budget. However, done by a person trained to identify the features of early modern texts, it is actually more cost effective than sorting through and correcting poor-quality OCR.



