A TEXT CREATION PARTNERSHIP
Companion

⇐ Return to main (index) page.

Frequently asked questions

  1. Was my institution a TCP partner? Does it matter?
  2. May I contribute to the TCP corpus?
  3. Where can I access the TCP digital collections?
  4. How can I contact the TCP?
  5. Can I download the raw files?
  6. Where can I consult the documentation?
  7. What is the difference between Early English Books Online (EEBO) and EEBO-TCP?
  8. What is the difference between EEBO-TCP Phase I and Phase II?
  9. How much does it cost to key and encode a single TCP text?
  10. When did the texts become freely available?
  11. Are there restrictions on my use of the TCP texts?
  12. May I see or receive page images of the original book?
  13. May I have a pdf copy (of a given book)? Or html?
  14. Why don’t you use OCR?
  15. A work that I am interested in hasn’t been converted yet. When will you do it?
  16. Why does TCP (for the most part) only include one edition of a work?
  17. Do you want to know about errors?
  18. Can you supply plain-text versions of the files?
  19. Can you supply word-counts for the TCP corpora?
  20. Do the TCP files come with bibliographic metadata (catalog records or similar)?

1. Was my institution a TCP partner? Does it matter?

The TCP consisted of four overlapping partnerships — four groups of institutions that contributed to the production of four separate batches of texts. Those institutions received in return our thanks, the satisfaction of having added to the world's knowledge, perpetual ownership of the texts, and exclusive use of them for several years.  For all four of the partnerships — Evans, ECCO, and  EEBO Phases 1 and 2 —  the period of exclusive use has long ago ended and the texts are free to anyone to access and use. Therefore, for practical purposes, it does not matter whether or not one belongs to a partner institution.

2. May I contribute text to the TCP corpus?

Yes, individuals and individual projects are welcome to contribute transcriptions they have made to the TCP collection, and some have done so. For example, TCP has received texts from the Hamlet Quartos Project at Oxford and the Lexicons of Early Modern English project at Toronto, as well as some volunteer efforts. In order to "fit in," text must of course meet the basic TCP criteria of accuracy, fidelity to the source (no modernization), and freedom from legal restrictions on distribution, access, and use. If you have a text or collection of texts that you are willing to add to the corpus, feel free to contact us at tcp-info@umich.edu and we can talk about accuracy, fidelity, and markup as needed.

3. Where can I access the TCP digital collections?

The University of Michigan hosts the texts for online search at the following addresses. Note: These are the TCP home sites. Since the texts were intended for re-purposing and unconstrained re-use, TCP has contributed  the same texts, in their original form or as modified, alone or in combination with other texts, to other sites such as the ProQuest EEBO site, the Oxford Text Archive (now moved to the Literary and Linguistic Data Service), the Early Print site, the English Corpora Site, and the (now defunct) JISC Historic Texts portal.

4. How can I contact the TCP?

To contact the TCP team, please email us at tcp-info@umich.edu

5. Can I download the raw files?

Yes. The raw transcripts for EEBO Phases 1 and 2, ECCO, and Evans are all available for bulk download as zipped files for those wishing to do text mining or similar projects. These are substantial file transfers and should not be undertaken casually! For those who viewed the download arrangment long ago, note that the files have been reorganized, doing away with the old arrangement by date-of-release (since that no longer matters), and rearranging the files strictly by ID number. Public downloads are available from these Dropbox.com folders: If for some reason, these links fail, a backup copy is mirrored on Box.com: Most of the files are available in three forms:
  1. TCP (P4) XML.  This, the version that we generally recommend,  uses UTF-8 character encoding and TEI bibliographic headers based on MARC catalog records.  Its character inventory, converted from SGML SDATA character entities, consists mostly of corresponding single or composite Unicode characters, where those exist and are widely supported by fonts. The few exceptions are converted instead to text strings within {curly braces}.  Though the TCP schema includes customizations, some of which anticipate developments in TEI P5, this is still better thought of as an essentially TEI P4 version.
  2. SGML. The original SGML files, as produced by the TCP keyers and editors, also remain available. These files use 7-bit character encoding with named ('mnemonic') SDATA character entities and minimal headers but are otherwise very similar to TEI P3.  Only users of tools that cannot accept multi-byte character encodings, or those that desire utter losslessness, are likely to want to look at this version.
  3. P5 XML. For those who use TEI-based tools or need compatibility with other TEI corpora, we also make available (thanks especially to  Sebastian Rahtz and Lou Burnard at Oxford) an XML version conformant to TEI P5, largely but not completely lossless relative to the underlying SGML. This version features TEI headers (again, based for the most part on underlying MARC) and UTF-8 character encoding, solving the character issues by heavy use of the TEI <g> element.  The native home for most of the P5 files (EEBO-1, Evans, and ECCO) is on gitHub, but we provide a snapshot of the gitHub version for convenience. The P5 version of EEBO-2, since it was restricted at the time of its creation, is not currently available on gitHub, a public platform; we hope to change that.

6. Where can I consult the documentation?

Most of the TCP's policies for transcription and encoding can be pieced together from the TCP documentation page. The documents listed and linked there  are admittedly ill organized, incomplete (needing to be supplemented by ad-hoc decisions made in response to particular textual difficulties), and were never intended for public eyes. With those caveats, they remain valuable guides to both the philosophy and the practice of the TCP editors. Among the most useful:

7. What is the difference between Early English Books Online (EEBO) and EEBO-TCP?

Simply put, EEBO is a commercial product published by ProQuest LLC, and available to libraries for purchase or license. EEBO-TCP is a project based at the Universities of Michigan and Oxford, and supported by more than 150 libraries around the world.

EEBO consists of the complete digitized page images and bibliographic metadata (catalog records) for more than 135,000 early English books listed in Pollard & Redgrave’s Short-Title Catalogue (1475-1640) and Wing’s Short-Title Catalogue (1641-1700) and their revised editions, as well as the Thomason Tracts (1640-1661) collection and the Early English Books Tract Supplement. The TCP transcriptions are just that -- though searchable by bibliographic fields, their primary data is text without any page images. The images from which they were keyed belong to ProQuest. And they represent a subset -- a substantial subset, but still a subset -- of the complete set of books available through EEBO proper.

EEBO and EEBO-TCP remain closely bound, since it is on EEBO images that the TCP texts are based: With certain exceptions (musical notations, images, most mathematics, most non-Latin alphabets, hand-written additions, some artifacts of print), TCP captured the full text of each transcribed work in EEBO -- by intention, almost every unique monographic English-language title. This was done by manually keying the full text of each work and adding markup to indicate the structure of the text (chapter divisions, tables, lists, etc.). The result is an accurate transcription of each work, which can be fully searched, or used as the basis of a new project. EBO-TCP has produced about 64,000 texts, of which all but the last 4,000 are searchable online. The EEBO-TCP text files were delivered back to ProQuest and indexed in EEBO, so users at subscribing institutions can seamlessly perform full text searches and view transcriptions directly within the EEBO platform, although the texts can also be accessed in other ways, including TCP's own search sites hosted by the University of Michigan Library. EEBO-TCP is administered by the University of Michigan Library.  During its most productive years, it employed full teams of editors at Michigan and Oxford, plus a few ancillary sites, and turned out a new text roughly every twenty minutes. Only a single editor, at Michigan, remains active today (2026).

8. What is the difference between EEBO-TCP Phase I and Phase II?

The initial EEBO-TCP project began in 1999. Its goal was to key and encode 25,000 selected works from the EEBO corpus. This effort was completed in 2009, with the support of nearly 150 library partners. The 25,000 texts produced by this effort are called “Phase I.” This set of texts was released to the public on January 1, 2015. Anyone could and can search and view these texts online at the Michigan TCP site, or can download them in bulk for individual use and re-use.

Under the encouragement of the project advisory board, and with the promise of another round of support from many libraries, in 2008 the TCP decided to continue the work of EEBO-TCP in a second phase. As described more fully elsewhere on this site, EEBO Phase II adopted the audacious goal of keying and encoding at least one edition of each unique monographic English-language work (with principled exceptions) represented in EEBO.  Our guess -- and it was only a guess -- was that that would require converting roughly 44,000 additional texts, contingent, of course, on obtaining the requisite funds. Our estimate of how many texts it would take to achieve comprehensive coverage has largely been borne out (perhaps 45,000 was more accurate?),  while our fund-raising fell only slightly short. As eventually funded,  EEBO-TCP Phase II was able to produce nearly 40,000 texts. All texts belonging to Phase 2 (both prospective texts and those already released) were opened to the public on 1 August 2020.

So ultimately, the entire EEBO-TCP corpus (Phase I and Phase II together) consists of about 65,000 works. And the disinction between phase 1 and phase 2 reflected a temporary arrangment, now long obsolete. There is no difference between the two phases now.

9. How much does it cost to key and encode a single TCP text?

The cost of keying and encoding a book depends on how long the book is and how difficult it is to capture and edit the text. A book might be particularly challenging due to the difficulty of the font, the quality of the image (as preserved, or as captured on microfilm and digital scan), or simply the presence of unusual and complex textual features, such as large tables or genealogical charts. A work might consist of a single broadsheet, or thousands of pages. Our vendors charged a flat fee by the character (technically, by the kilobyte) of data captured. The costs of review and editing, which was done in-house at Michigan and Oxford, are measured in time, typically by counting how many books can be reviewed in a month. On average, we estimate that it cost $200-$250 to key, encode, and review a “typical” work.  The cost of a very large work could easily have been in the thousands of dollars.

A research library paid $60,000 to become a partner, so each library that joined supported the conversion of 250-300 new books.

10. When were the texts made freely available?

All of the texts are now freely available. Any notices you may find saying otherwise are now obsolete. Indeed, it was always part of the mission of the TCP to ensure that the text files we produced would ultimately be freely available to the public. The date that restrictions on sharing and distributing the texts depended on when the project was completed.

11. Are there any restrictions on my use of the TCP texts?

Variant: "May I do xx with your texts?" The answer is simple: we impose no restrictions whatever, and (so far as the TCP is concerned) you may do anything with them that you like: you may translate them, edit them, revise them, illustrate them, perform them, or re-publish them, with or without attribution. We regard the texts themselves as being in the public domain, and expressly disclaim any 'light' or 'thin' copyright that may be thought to adhere to our capture and encoding decisions. Nor is access and use of the texts governed by any license terms. We do suggest (not require) that courtesy and scholarly good practice dictate that you indicate who made the texts and where you got them. But that's up to you.

12. May I see the original book? Or images of its pages?

This is complicated, but, in general, no, if you can't see the page images now, the odds are that you are not entitled to. TCP keyed its texts from images supplied by commercial companies: ProQuest for EEBO, Gale-Cengage for ECCO, and Readex/Newsbank for Evans. In no case did we have the right to redistribute those images (aside from limited fair-use-based clauses that apply to all subscribing libraries). Nor do we even have copies of those images; so far as access to those digital facsimiles is concerned, Michigan and Oxford are in no different situation than any other subscribing library. Access to the underlying page images therefore remains restricted to customers of the companies in question (ProQuest, Gale, and Readex/Newsbank), and only they can answer questions about access to their products. In the early stages of the TCP project, this was not a problem, since the universities paying for the text, and obtaining first access to it, were all also subscribers to the image databases. Now that the text is free to everyone, there is inevitably a divide between those who can see only the text, and those who can see both text and image.

So, the first question is, do you belong to a subscribing institution? If so, you are entitled to view the page images, and any obstacles are practical ones. E.g., is our system for 'pulling down' images from the companies working? Often it is not (only the EEBO process works reliably). Then, are you accessing the TCP site from a URL identified with your institution? If not, you may need to use a campus computer, or use a institution-supporting VPN, or use the proxied link to the TCP site that may, or may not, be in your library's catalog. If you are entitled to view the images but cannot do so, it is in many cases best to go straight to the corporate sites (ProQuest, Gale, or Readex) and view the images there. If you have trouble locating the book on those sites that corresponds with our text, we can probably help with that. E.g., we can supply the ProQuest ID for any text you find in EEBO-TCP, which should take you direct the correct image set.

Finally, if you are not affiliated with a subscribing institution, as most of our users are perhaps not, then none of this will help: you will be driven in that case to third-party sources when seeking facsimiles of the original book. Note that many books have been digitized and the images published on free sites in the past decades (especially on Internet Archive); many owning libraries have digitized books in their keeping themselves, and posted the results; some have ended up in HathiTrust, and it is always worth looking. For a few books, Gale and ProQuest especially have published copies of their scanned images in the form of paper books, and sold them on Amazon at modest cost. We have frequently been able to point people to such external sources, or to holding libraries that may in some cases be able to provide digitized copies on request.

13. May I have a pdf copy (of a given book)? Or html?

If you mean, 'a pdf facsimile of the original book', then this is basically question 12 above, q.v.

If you mean, 'a pdf version of the TCP transcription,' then this is just a matter of how best to generate a readable pdf from an XML source, and where to get the 'styling' (display) information from. You have many choices. For example:

  1. HTML & browser. If you are reading the text on the online TCP site, go the the top page (the 'table of contents' page) for the item, scroll down to the button that says 'view entire text', click that (which displays the entire text in basic HTML), then use your browser to 'print to pdf' the resultant display. This is often not the most efficient way (browsers add varying amounts of file overhead to such operations), and may not be the most agreeable (if you dislike our online display), but it is the quickest way to get a pdf from one TCP file. The styling you get is based on that of our HTML, and cannot easily be changed without serious poking around in the code.
  2. XML + CSS. Obtain the raw XML of the file (request a file or group of files; or download the whole database (see question 5 above). The 'P4' or 'TCP XML' version of these files references a simple, crude CSS stylesheet that suffices to make them readable in any net-connected browser. Since the CSS is separate from the file, you are free to modify it as you please to get the display you like (just change the reference at the head of the file to point to a local copy rather than the copy of the file ("pfs.css") hosted here). Or write your own CSS or XSLT from scratch.

    Once you've either accepted our styling (simple and garish as it is, designed for the use of in-house editors more than readers), you can either use your browser's 'print to pdf' function again, or use external XML-to-PDF software. We have had good luck with the app called "wkhtmltopdf", which, though no longer under development, can still be obtained from https://wkhtmltopdf.org/. The syntax for conversion is very simple, the process is very quick, and the command-line form allows you to batch-convert any number of files. For a single file, use

    wkhtmltopdf --allow ./STYLESHEET.css ./FILEIN.xml FILEOUT.pdf

    Or run it from a (Windows) batch file like this:

    FOR %%a IN (*.xml) DO "C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf" --allow ./STYLESHEET.css %%a ./pdf/%%a.pdf

    During one test, we were especially intrigued by the different file sizes of the pdfs produced by different engines, starting with an XML file of 476 kB: Firefox native: 965 kB, Foxit pdf writer (run by MS Edge): 72,976 kB; Microsoft pdf writer: 49,152 kB; Chrome pdf writer: 2,329 kB; wkhtmltopdf: 696 kB. Your results may differ.

We have run the wkhtmltopdf process on all of the Evans-TCP files, with seeming success, and can readily do the same on any requested file or batch of files.

NOTE: styling XML with CSS has one major problem: CSS notoriously cannot handle tables. If the file depends on tables (and many do), you are best served by going the HTML route -- either using our online HTML, or converting the XML to HTML on your own.

14. Why don’t you use OCR?

Because of the irregularity and difficulty of early printing, the ambiguity of some character glyphs (we calculated at one point that a character resembling the letter 'z' could mean 14 different things (probably an undercount...), and the challenges of structural complexity, as well as the variable quality of the microfilm-based images from which we were working, optical character recognition was judged not reliable enough to “read” the EEBO images and so produce an accurate electronic text. The review and correction of the text produced would be so expensive and labor-intensive that it was more efficient to simply key the work from scratch. This may no longer be true -- TCP assessed the technology available twenty-five years ago (2000); and there has certainly been a great deal of interest over the past several years in Europe and in North America in improving OCR for older works (a number of research projects investigating this have used TCP texts as a “ground truth” against which to compare their results), and who knows what AI can do? See further our somewhat fuller discussion of the OCR vs. keying question.

15. A work that I am interested in hasn’t been converted yet. When will you do it?

We used to say that users (especially those affiliated with partner libraries) were welcome to request works from EEBO that had not yet been keyed, and that their requests would go to the top of the queue. Since active transcription has ceased,  and there is no longer a queue, asking for a particular work is not likely to receive a satisfactory response. But users are still encouraged to ask about works they fail to find in the EEBO-TCP corpus (via email to tcp-info@umich.edu), since it is possible that the work is available in a related edition, or is lagging in the pipeline, or that an alternative source can be found. If none of these things is true, information that a work is in demand is valuable in itself and may inform future digitization efforts, for which, thank you.

16. Why does TCP (for the most part) only include one edition of a work?

We recognize that each edition of a work is unique, that one cannot stand in for others, and that for many scholarly purposes, there is value in examining closely the differences between editions. However, given limited funding, our first priority was always to capture as many different works and as great a variety of text as we could, usually focusing on the first edition of each work. Simply put, for every book that we chose to convert, a different book did not get converted: duplication, even partial duplication, has its costs. However, we have keyed additional editions where there is sufficient justification for doing so, and a user has made a case for it.

17. I found an error in a transcription.

We are very grateful to those who report errors to us, and will incorporate corrections into our next release of the texts. Unfortunately we don’t yet offer a way to report (or correct) errors within the interface itself. Please get in touch at tcp-info@umich.edu. But bear in mind that some things that look like errors may in fact be irregularities, or even outright printer's errors, in the original book. And most of those we are committed to preserving, not editing away.

18. Can you supply plain-text versions of the files?

We have in the past supplied subsets of the TCP files in some version of 'plain-text' format -- but only after discussing with the requester what exactly is meant by 'plain-text' in the particular instance. We can certainly do that again, or work with the requester to get it done, but hesitate to provide a one-size-fits-all 'plain text' without prior consultation, and without prior consideration of the intended purpose. The transcribed books are, after all, books, not just texts, with all the messiness that that implies. What should we do especially with all the textual objects that interrupt the main text flow? Marginal notes? indications of illegibility? inline illustrations? Or text that appears within illustrations? Likewise, what should we do with textual information stored in attribute values, such as page numbers? What to do with tables, nested lists, and numeric data? With attached catalog records (metadata)? With unusual characters, symbologies, and character encodings? With end-of-line hyphens? Once these questions are addressed, we can generally modify our generic 'stripmarkup' script and produce the appropriate format--or at least provide enough information to enable the user to do so in a more nuanced way.

19. Can you supply word-counts for the TCP corpora?

We have attempted to do so for subsets in the past. This poses questions similar to those attending 'plain-text' files, since to arrive at a word count requires an understanding of what constitutes a word. Do partially illegible strings count? Abbreviations? numbers? tabular data? Counting the words in one sample of about 700 EEBO-TCP files suggests that we can expect approximately 6.89 bytes per word: that is, dividing the file size by 6.89 will yield an approximate word count for any given file or set of files, at least in the EEBO corpus. The entire EEBO corpus on that basis contains about 1.6 billion words. On the same basis, ECCO-TCP contains 138 million and Evans 100 million words. .

20. Do the TCP texts come with metadata?

Yes. The metadata for the TCP texts comes in four overlapping or mutually interacting forms, and exemplifies the general TCP philosophy of focusing on text capture and markup, leaving the cataloging to others (to catalogers, in fact), and leveraging existing resources rather than creating everything afresh -- just as we did with the underlying images.

  1. All of the files contain metatdata in the markup itself. This is the place to look for indications that a text, or portion of a text, is in particular languages, or belongs to certain text types. All of the files also contain an abbreviated header containing the minimal set of identifiers, such as the TCP ID number, STC number, etc.

  2. The production process for all of the files was registered in a project-specific 'dat' file -- a simple SGML-encoded text file, with the record for each item on its own line. This serves in effect as a newline-delimited database, but one that can be searched in quite complex ways by any grep-type search, or any text editor. These files began as lists (with minimal bibliographic and identifying information) of all the items that we could select for transcription; as items were selected, keyed, proof-read, edited, and placed online, each step was recorded in its own field.

  3. Bibliographic information was collected from both public sources and the image-providing companies in the form of library MARC records, these were then usually heavily modified in-house in order to provide a description of the keyed text (as well as the print original and image intermediaries), and in some cases a description of the editing process as well, this information derived from (2) above. We store these in MarcEdit text format (*.mrk or *.mtxt) to allow ease of editing with common tools. They should be transformable to binary MARC or to MARCxml with the MARCmaker tool in MARCedit, or modifiable with any text editor in their current form.

  4. File headers are attached to the P4 and P5 versions of the XML files (the SGML versions contain merely the minimal identier header), based on the MARC records described under (3) above, generated from the MARC by XSLT via a MARCxml intermediary. These headers do not extract all of the information from the MARC -- they are somewhat less comprehensive --, nor is all the information in the headers displayed online on the TCP site. In some cases you need to go back from the online display to the underlying headers, or from the headers to the underlying MARC, or even from the MARC back to the *dat file.

We hope to make the MARC for ECCO-TCP available here soon.