| A TEXT CREATION PARTNERSHIP Companion |
⇐ Return to main (index) page.
Most of the TCP's policies for transcription and encoding can be pieced together from the TCP documentation page. The documents listed and linked there are admittedly ill organized, incomplete (needing to be supplemented by ad-hoc decisions made in response to particular textual difficulties), and were never intended for public eyes. With those caveats, they remain valuable guides to both the philosophy and the practice of the TCP editors. Among the most useful:
Simply put, EEBO is a commercial product published by ProQuest LLC, and available to libraries for purchase or license. EEBO-TCP is a project based at the Universities of Michigan and Oxford, and supported by more than 150 libraries around the world.
EEBO consists of the complete digitized page images and bibliographic metadata (catalog records) for more than 135,000 early English books listed in Pollard & Redgrave’s Short-Title Catalogue (1475-1640) and Wing’s Short-Title Catalogue (1641-1700) and their revised editions, as well as the Thomason Tracts (1640-1661) collection and the Early English Books Tract Supplement. The TCP transcriptions are just that -- though searchable by bibliographic fields, their primary data is text without any page images. The images from which they were keyed belong to ProQuest. And they represent a subset -- a substantial subset, but still a subset -- of the complete set of books available through EEBO proper.
EEBO and EEBO-TCP remain closely bound, since it is on EEBO images that the TCP texts are based: With certain exceptions (musical notations, images, most mathematics, most non-Latin alphabets, hand-written additions, some artifacts of print), TCP captured the full text of each transcribed work in EEBO -- by intention, almost every unique monographic English-language title. This was done by manually keying the full text of each work and adding markup to indicate the structure of the text (chapter divisions, tables, lists, etc.). The result is an accurate transcription of each work, which can be fully searched, or used as the basis of a new project. EBO-TCP has produced about 64,000 texts, of which all but the last 4,000 are searchable online. The EEBO-TCP text files were delivered back to ProQuest and indexed in EEBO, so users at subscribing institutions can seamlessly perform full text searches and view transcriptions directly within the EEBO platform, although the texts can also be accessed in other ways, including TCP's own search sites hosted by the University of Michigan Library. EEBO-TCP is administered by the University of Michigan Library. During its most productive years, it employed full teams of editors at Michigan and Oxford, plus a few ancillary sites, and turned out a new text roughly every twenty minutes. Only a single editor, at Michigan, remains active today (2026).
The initial EEBO-TCP project began in 1999. Its goal was to key and encode 25,000 selected works from the EEBO corpus. This effort was completed in 2009, with the support of nearly 150 library partners. The 25,000 texts produced by this effort are called “Phase I.” This set of texts was released to the public on January 1, 2015. Anyone could and can search and view these texts online at the Michigan TCP site, or can download them in bulk for individual use and re-use.
Under the encouragement of the project advisory board, and with the promise of another round of support from many libraries, in 2008 the TCP decided to continue the work of EEBO-TCP in a second phase. As described more fully elsewhere on this site, EEBO Phase II adopted the audacious goal of keying and encoding at least one edition of each unique monographic English-language work (with principled exceptions) represented in EEBO. Our guess -- and it was only a guess -- was that that would require converting roughly 44,000 additional texts, contingent, of course, on obtaining the requisite funds. Our estimate of how many texts it would take to achieve comprehensive coverage has largely been borne out (perhaps 45,000 was more accurate?), while our fund-raising fell only slightly short. As eventually funded, EEBO-TCP Phase II was able to produce nearly 40,000 texts. All texts belonging to Phase 2 (both prospective texts and those already released) were opened to the public on 1 August 2020.
So ultimately, the entire EEBO-TCP corpus (Phase I and Phase II together) consists of about 65,000 works. And the disinction between phase 1 and phase 2 reflected a temporary arrangment, now long obsolete. There is no difference between the two phases now.
The cost of keying and encoding a book depends on how long the book is and how difficult it is to capture and edit the text. A book might be particularly challenging due to the difficulty of the font, the quality of the image (as preserved, or as captured on microfilm and digital scan), or simply the presence of unusual and complex textual features, such as large tables or genealogical charts. A work might consist of a single broadsheet, or thousands of pages. Our vendors charged a flat fee by the character (technically, by the kilobyte) of data captured. The costs of review and editing, which was done in-house at Michigan and Oxford, are measured in time, typically by counting how many books can be reviewed in a month. On average, we estimate that it cost $200-$250 to key, encode, and review a “typical” work. The cost of a very large work could easily have been in the thousands of dollars.
A research library paid $60,000 to become a partner, so each library that joined supported the conversion of 250-300 new books.
All of the texts are now freely available. Any notices you may find saying otherwise are now obsolete. Indeed, it was always part of the mission of the TCP to ensure that the text files we produced would ultimately be freely available to the public. The date that restrictions on sharing and distributing the texts depended on when the project was completed.
Variant: "May I do xx with your texts?" The answer is simple: we impose no restrictions whatever, and (so far as the TCP is concerned) you may do anything with them that you like: you may translate them, edit them, revise them, illustrate them, perform them, or re-publish them, with or without attribution. We regard the texts themselves as being in the public domain, and expressly disclaim any 'light' or 'thin' copyright that may be thought to adhere to our capture and encoding decisions. Nor is access and use of the texts governed by any license terms. We do suggest (not require) that courtesy and scholarly good practice dictate that you indicate who made the texts and where you got them. But that's up to you.
This is complicated, but, in general, no, if you can't see the page images now, the odds are that you are not entitled to. TCP keyed its texts from images supplied by commercial companies: ProQuest for EEBO, Gale-Cengage for ECCO, and Readex/Newsbank for Evans. In no case did we have the right to redistribute those images (aside from limited fair-use-based clauses that apply to all subscribing libraries). Nor do we even have copies of those images; so far as access to those digital facsimiles is concerned, Michigan and Oxford are in no different situation than any other subscribing library. Access to the underlying page images therefore remains restricted to customers of the companies in question (ProQuest, Gale, and Readex/Newsbank), and only they can answer questions about access to their products. In the early stages of the TCP project, this was not a problem, since the universities paying for the text, and obtaining first access to it, were all also subscribers to the image databases. Now that the text is free to everyone, there is inevitably a divide between those who can see only the text, and those who can see both text and image.
So, the first question is, do you belong to a subscribing institution? If so, you are entitled to view the page images, and any obstacles are practical ones. E.g., is our system for 'pulling down' images from the companies working? Often it is not (only the EEBO process works reliably). Then, are you accessing the TCP site from a URL identified with your institution? If not, you may need to use a campus computer, or use a institution-supporting VPN, or use the proxied link to the TCP site that may, or may not, be in your library's catalog. If you are entitled to view the images but cannot do so, it is in many cases best to go straight to the corporate sites (ProQuest, Gale, or Readex) and view the images there. If you have trouble locating the book on those sites that corresponds with our text, we can probably help with that. E.g., we can supply the ProQuest ID for any text you find in EEBO-TCP, which should take you direct the correct image set.
Finally, if you are not affiliated with a subscribing institution, as most of our users are perhaps not, then none of this will help: you will be driven in that case to third-party sources when seeking facsimiles of the original book. Note that many books have been digitized and the images published on free sites in the past decades (especially on Internet Archive); many owning libraries have digitized books in their keeping themselves, and posted the results; some have ended up in HathiTrust, and it is always worth looking. For a few books, Gale and ProQuest especially have published copies of their scanned images in the form of paper books, and sold them on Amazon at modest cost. We have frequently been able to point people to such external sources, or to holding libraries that may in some cases be able to provide digitized copies on request.
If you mean, 'a pdf facsimile of the original book', then this is basically question 12 above, q.v.
If you mean, 'a pdf version of the TCP transcription,' then this is just a matter of how best to generate a readable pdf from an XML source, and where to get the 'styling' (display) information from. You have many choices. For example:
Once you've either accepted our styling (simple and garish as it is, designed for the use of in-house editors more than readers), you can either use your browser's 'print to pdf' function again, or use external XML-to-PDF software. We have had good luck with the app called "wkhtmltopdf", which, though no longer under development, can still be obtained from https://wkhtmltopdf.org/. The syntax for conversion is very simple, the process is very quick, and the command-line form allows you to batch-convert any number of files. For a single file, use
wkhtmltopdf --allow ./STYLESHEET.css ./FILEIN.xml FILEOUT.pdf
Or run it from a (Windows) batch file like this:
FOR %%a IN (*.xml) DO "C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf" --allow ./STYLESHEET.css %%a ./pdf/%%a.pdf
During one test, we were especially intrigued by the different file sizes of the pdfs produced by different engines, starting with an XML file of 476 kB: Firefox native: 965 kB, Foxit pdf writer (run by MS Edge): 72,976 kB; Microsoft pdf writer: 49,152 kB; Chrome pdf writer: 2,329 kB; wkhtmltopdf: 696 kB. Your results may differ.
We have run the wkhtmltopdf process on all of the Evans-TCP files, with seeming success, and can readily do the same on any requested file or batch of files.
NOTE: styling XML with CSS has one major problem: CSS notoriously cannot handle tables. If the file depends on tables (and many do), you are best served by going the HTML route -- either using our online HTML, or converting the XML to HTML on your own.
We have in the past supplied subsets of the TCP files in some version of 'plain-text' format -- but only after discussing with the requester what exactly is meant by 'plain-text' in the particular instance. We can certainly do that again, or work with the requester to get it done, but hesitate to provide a one-size-fits-all 'plain text' without prior consultation, and without prior consideration of the intended purpose. The transcribed books are, after all, books, not just texts, with all the messiness that that implies. What should we do especially with all the textual objects that interrupt the main text flow? Marginal notes? indications of illegibility? inline illustrations? Or text that appears within illustrations? Likewise, what should we do with textual information stored in attribute values, such as page numbers? What to do with tables, nested lists, and numeric data? With attached catalog records (metadata)? With unusual characters, symbologies, and character encodings? With end-of-line hyphens? Once these questions are addressed, we can generally modify our generic 'stripmarkup' script and produce the appropriate format--or at least provide enough information to enable the user to do so in a more nuanced way.
We have attempted to do so for subsets in the past. This poses questions similar to those attending 'plain-text' files, since to arrive at a word count requires an understanding of what constitutes a word. Do partially illegible strings count? Abbreviations? numbers? tabular data? Counting the words in one sample of about 700 EEBO-TCP files suggests that we can expect approximately 6.89 bytes per word: that is, dividing the file size by 6.89 will yield an approximate word count for any given file or set of files, at least in the EEBO corpus. The entire EEBO corpus on that basis contains about 1.6 billion words. On the same basis, ECCO-TCP contains 138 million and Evans 100 million words. .
Yes. The metadata for the TCP texts comes in four overlapping or mutually interacting forms, and exemplifies the general TCP philosophy of focusing on text capture and markup, leaving the cataloging to others (to catalogers, in fact), and leveraging existing resources rather than creating everything afresh -- just as we did with the underlying images.
We hope to make the MARC for ECCO-TCP available here soon.