A Renaissance knowledge base (RKB), designed as a computer-searchable library of kernel electronic texts, is being designed at Toronto to assist in the study of the English Renaissance, a period conventionally beginning with the victory of Henry Tudor over Richard III at Bosworth Field in 1485 and ending with the English Commonwealth, the execution of Charles I and the close of the London theatres in the 1640s. RKB serves students of Renaissance drama, literature and history, especially in English studies. This intended audience determines how RKB should be tagged. Unlike a linguistic corpus, for example, RKB need not have morphological, part-of-speech and syntactic tags, but its literary-historical purpose requires all places, names and titles to be encoded. If we think of literary history as a quasi-grammatical system, then people, places and titles are its 'parts of speech', and events and dates are the syntax that binds people, places and titles into sequences. Because a text archive or repository normally includes any texts, whether tagged or not, the phrase "text database" better describes RKB. Databases, after all, are structured and selective.
There are well over 30,000 different works in the Short-Title Catalogue (STC) of printed books from 1475 to 1640, and many more manuscripts surviving from this period, catalogued by major libraries, the Historical Manuscripts Commission and government record offices. Putting even 5% of this material into machine-readable form would be an imposing task, but a much smaller reference collection of seminal books of the period would still go far to advance research.
How will the RKB be used? Students and researchers will be able to do interactive concordances of any word-form, word, group of words, tag, or phrase and so follow a hypertextual trail from one work to others. These citation lists can help a researcher trace the history of a topic, understand word meaning, and determine streams of influence. There will also be stylistic applications of the data, but the main purpose will probably be to gloss or comment on works by major authors.
The RKB kernel[2] consists of the following:
Because the kernel consists of full texts, not samples, the database structure must support an adequate description of early books and manuscripts, especially features such as contractions and brevigraphs, textual variants, and marginalia. The first set of works is heavily figurative, but the second defines or uses the English language less figuratively and so provides an essential background to understand the play on meaning so typical of poetry and drama. Such prose texts, abounding with references to people, places, titles, and events, also form an important part of the RKB kernel.[4]
Among the most important of these prose works are the dictionaries of the period. Because the first English table of words, A table alphabeticall by Robert Cawdrey, appeared only in 1604 and restricted itself to only 2543 hard words, English scholars have to recover contemporary information about word meaning either from definitions appearing within books, or from small glossaries such as Cawdrey's, many of which have been analyzed by Gabriele Stein and the late Jürgen Schäfer (1989). Yet a third source of lexicographical information exists: the bilingual dictionaries that served Englishmen of the Renaissance as reference books in understanding French, Italian, Latin, Spanish and other languages.
I have chosen five such dictionaries for the kernel:
There are larger dictionaries than these, such as Thomas Cooper's Latin-English one, and there may be better ones for some purposes, but the five kernel dictionaries all use substantial English sentences in explaining or translating foreign-language terms, influenced contemporary English readers, and contributed much to The Oxford English Dictionary. They are an ideal semantic background for the language of major English authors.
One difficulty that has prevented scholars from using them thoroughly has been their organization. Cotgrave's dictionary entries (though not Palsgrave's) are alphabetically listed by the foreign-language lemma. A text-retrieval program that could search these five books structured as one electronic text database would be able to extract all foreign-language lemmas, and their entries, in the explanations of which a queried English word was employed. Since several of these dictionaries cite and translate foreign-language phrases and proverbial expressions, the database also gives thousands of sentence pairs in which English words are used and translated in context.
The data structure imposed on the five dictionaries depends on their function in the RKB. In this paper I will describe a tagging system for only two of the books in the RKB, dictionaries by Palsgrave and Cotgrave. Both have been fully entered into machine-readable form now, and their tagging has been in progress (especially that of Palsgrave) for several years.
The RKB markup system derives from preliminary guidelines issued by the Text Encoding Initiative (TEI) in 1990, which implements a subset of Standard Generalized Markup Language (SGML), published by the International Standards Organization (ISO) (Sperberg-McQueen & Burnard, 1990).[5] The TEI implementation of SGML in its draft P1 guidelines is in course of revision now by dozens of scholars in many fields. A final version will be published in 1992. No other markup method promises to meet scholarly needs; in fact no competing markup system exists. No guidelines have ever been published for the COCOA markup employed by Oxford Concordance Program,[6] TACT and other text-retrieval software (named from the earliest Oxford mainframe concordance system). These programs lack a formalism to express the complexities of literary texts.
The TEI markup scheme is independent of any commercial or shareware text-retrieval system, even SGML text-editors. That is a strength, not a weakness. Unlike procedural markup, such as we use in word-processing programs like WordPerfect or scholarly text-formatting systems like TEX, SGML tags do not directly call for a procedure to be followed (e.g., italicizing a title). TEI markup is descriptive. It normally describes the function of a piece of text rather than indicating what should be done with that text, although, within SGML, tags can be created to indicate what TEI calls "rendition", that is, the appearance of the text on the page. Unlike non-procedural markup employed by popular software now in use for text retrieval and analysis, TEI tagging handles text in all languages. Like TACT markup but unlike WordCruncher markup,[7] TEI can tag different kinds of text that turn up randomly or intermittently, such as speeches, speech prefixes and stage directions. Unlike TACT markup but somewhat like WordCruncher markup, TEI expects that every text's structure have a "grammar" that can be parsed: it expects that a hierarchical structure will be assigned to a text, even if it be a simple two-level one that in effect recognizes only a lattice-like structure.
Any software has limitations but an academic markup scheme should have none. It should be able to reflect the complexities of texts without being compromised by local implementations of those texts for specific retrieval, analysis or editorial programs. TEI produces what is called an interchange format, that is, a set of tagging guidelines suitable for passing electronic texts around from one local system to another.
What is the point of encoding a text with SGML when most local software cannot handle that protocol? How individuals answer this question will depend on what is most important to them: the usability of the electronic text by others (i.e. considerations of 'scholarly publication'), or personal convenience. In my view, texts tagged for scholarly reasons will often contain more information than any local software can process. Editing 'down' hurts nothing while editing 'up' is impossible. At Toronto a TEI format will be transformed automatically into a form suitable for processing with TACT, a local text-retrieval system.
If an electronic text belongs to one person alone and will never be used by others, then the markup chosen for it need only meet the specifications of the software he or she uses. Yet that circumstance almost never arises any more where scholarly editions are concerned. For 20 years, editors have used COCOA or WordCruncher syntax for tagging but have chosen their tag types and tag tokens privately (sometimes with indecipherable abbreviations), without reference to any general consensus about what would work best for scholars. TEI seeks that interdisciplinary consensus for a tagging syntax independent of the life-cycle of specific pieces of hardware and software.
COCOA tags normally have three parts in their form: (1) delimiter characters (the diamond brackets, or some other symbols not found in the text), (2) a variable or type name, and (3) a value or token name. It is a convention that the variable or type name may be dropped if the delimiters themselves can carry that meaning. For example,
amount to the same thing. In the first form, single diamond brackets are the delimiters that separate the variable-type AUTHOR and the value-token John Palsgrave from the text. The variable or type may take any form. For example, other tags could be TITLE, DATE, PUBLISHER, etc. The value or token following it, John Palsgrave, may change, say, into other authors such as Cotgrave, Florio or Thomas. In the second form, the double diamond brackets are understood to stand for <AUTHOR > rather than <TITLE > or <DATE >. Any other tags must use different and unique delimiters.
COCOA tags of a given type, like <AUTHOR John Palsgrave>, hold until another tag of the same type occurs. That is, every word in the text following <AUTHOR John Palsgrave> would be tagged as being written by Palsgrave until a subsequent <AUTHOR > tag occurred. The span of such COCOA tags, then, is indefinite.
SGML/TEI tags differ in five principal ways, for my purposes.[8]
By permitting closing tags, SGML, unlike COCOA-style tagging, employs the text itself to tag the text, whereas the value or the token in a COCOA tag must always be added to the text by the editor. Another way of putting this is that SGML recognizes the difference between tags authorized by the text, and ones created by the editor from scratch. SGML tagging is textually 'conservative'.
COCOA-style tags in a text have no structure as a group; they can go anywhere, in any sequence, and are relevant only to the words that follow. Often such tags are never explained within an electronic text. In contrast, SGML-tags are normally declared in a "document-type definition" (DTD) at the start of the text. This declaration associates those tags in hierarchies, trees or groups of tags. A DTD normally expects that a text is structured like a Ukrainian doll or a Chinese box, one smaller piece or tag division completely inside another. 'Higher' or 'outer' structural units never overlap with 'lower' or 'inner' units. For example, a DTD for a novel might tell us that paragraphs fall always inside chapters, and chapters always inside books, i.e. that a paragraph is never split between two chapters. Or a DTD for a play might tell us that speeches always occur inside scenes, and scenes inside acts; a speech that carries on over two scenes would break the hierarchy. Texts that have flat 'lattice' structures, or several different structures, may also be represented within this formalism.
Figure 1 lists 53 tags employed in my encoding. These fall into four groups: entities (special graphic figures, brevigraphs, contractions and one diacritic, the soft hyphen), bibliographical tags (these mark page, line and column breaks and tag non-text blocks on the page), lexical tags (these identify the functions and language of symbols, words, phrases and sentences in the text, mostly within the dictionary entries themselves), and tags of functional structure of the books as a whole.[9]
I use SGML entities to tag ambiguous accented letters or brevigraphs that occur in pre-modern printing because each does not necessarily stand for the same set of letters each time. They cannot be handled as extensions of the basic English alphabet. The e-macron, for example, might represent en or em or even ett (as in sēd, stē and lrē). For this reason contracted letters have no fixed place in a collating sequence. Yet neither are they diacritics (letters that are part of a word but do not affect its alphabetical sorting). COCOA-tagging programs such as TACT have a problem in representing these.
Bibliographical tags separate extra-textual information from the body of the text while ensuring that the information appears accurately. By identifying all extra-textual blocks on the page, I can ensure that searches of the main body of the text will not pick up matches in the running titles, catchwords, marginalia, etc. Note that for convenience sake I also normalize page number, signature and foliation in the COCOA-style <page.break> tag. I employ five SGML tag attributes in this section, three in the <page.break> tag, and two in the <ornament> tag.
The lexical tags may need some formal explanation in three points.
First, with TEI/SGML tagging it is possible to associate a given language globally with a given tag. For instance, the head lemmas and sample quotations in Cotgrave's dictionary are always French, and the meaning and translation always English. There is no need to tag this aspect of font or "rendition" each time. The DTD can assign language to tags automatically. On the other hand, where foreign-language words are cited within Palsgrave's discussion of French grammar -- prose uniformly written in English -- they have to be marked as foreign each time.[10]
Second, the method for tagging a textual variant (which could be a correction by the editor) involves a triplet of linked tags, <var> (which encloses the variant of the immediately preceding text), <rdg> (which specifies the variant reading) and <wit> (which states the source or witness of the variant reading). A complication occurs when the variant covers more than one word in the text, or involves a deletion. Anchor tags must be placed before and after the words in the text for which the variant exists, and 'start' and 'end' attributes must be added to the <var> tag. Without these delimiting tags the match between variant and text-varied-from cannot be made. Again, COCOA-tagging software has difficulty with this problem, although it is endemic to critical editions.
Third, contractions and brevigraphs are expanded between the <expan> and </expan> tags. The attribute type gives the kind of abbreviation, orig gives the entity name for that abbreviated form, while the text itself contains the expanded form. Inquiries on the text, then, need not take into account the various abbreviated forms of a word, although the editorial information about the original reading and the editor's choice of expansion are always available.[11]
Facsimile and tagged forms of two pages in Palsgrave's Lesclarcissement, one from the table of substantives, another from the table of verbs, may be seen in Figure 2, Figure 3, Figure 4 and Figure 5. Tags on these pages have been printed in non-proportional Courier (text is in proportional Times) to show how much bulk encoding adds to a file (consider the effects of tagging for morphology and part-of-speech). Minimization techniques within TEI/SGML can reduce the number of tags displayed, but the invisible tags remain available, like formatting codes hidden until the user activates WordPerfect's Reveal Codes function. It is very important to have the tags available; it is a matter of choice whether to display them or not. Were words to be tagged for part-of-speech and lemmatized form, non-proportional Courier would swamp the page. Since software can use tags without displaying them, however, the ratio of tags to text is not a concern.
Palsgrave's dictionary, barely three generations younger than printing itself, evades a hierarchical structure anticipated by TEI/SGML. For instance, its table of contents follows the first of three books, which is a lengthy introduction. Its 'dictionaries' appear in successive tables and take several forms, schemas to which appear in Figure 6. An entry for substantives looks straightforward, but an entry for a verb places the lemma by which the entry is apparently alphabetized inside the opening sentence or phrase (generally in the first two words) and permits recursion (repetition of parts of the entry) at a minimum of three points in the regular sequences. The structure looks more like a maze than a hierarchy. Consider the entry beginning "I Lye at a siege byfore a towne". Two translations follow, connected by or, and succeeded by a note about the conjugation of what may be one or both of the verbs tenir and assieger.
In contrast, Cotgrave's dictionary has a conventional hierarchical structure more suitable to a TEI-SGML Document Type Definition. Within the dictionary proper, head-lemmas always follow alphabetical letter headings, and phrase-lemmas always come after a related head-lemma. Because head-lemma entries regularly overlap page and column boundaries, in TEI/SGML terms Cotgrave's book offers two overlapping hierarchies: letter / head-lemma / phrase, and part / page / column. The structure of the head-lemma -- see Figure 7, Figure 8 and Figure 9 for a facsimile page, its tagged form, and a general schema -- has repeated sequences of tagged fields, as well as multiple paths. Perhaps a rigidly consistent system of rules exists, but only a thorough computer-aided study of the sequence of tagged fields in all entries would show what they are. The "meaning" tag especially is a grab-bag of explanations of word-sense, synonyms, commentary on grammatical and even historical topics, and straight translation.
I found TEI/SGML guidelines to be a useful yardstick against which to work out the structure of both dictionaries. Both books resist a fixed structure, although entries are remarkably regular. The "cross-reference" sequence in Cotgrave, as (or other directive word) followed by another head-lemma, appears in several places, as if it were a called procedure. Cotgrave's treatment of phrases as sub-entries under the head-lemmas resembles his use of a sample quotation and its translation after the "meaning" section in the main entry.
SGML editors already exist for most platforms. Text-retrieval systems such as Pat from Open Text Systems at Waterloo can manage the new encoding format already under UNIX, and the Oxford English Dictionary, which Frank Tompa and his colleagues have structured into a database, already has SGML-like tags.
While existing software like Oxford Concordance Program and TACT will not easily be rewritten to interpret the new TEI/SGML interchange format, it should be possible for the Text Encoding Initiative itself or software developers to write independent programs to transform texts with TEI-SGML encoding into formats suitable for local processing. Translation of syntactic differences between the interchange and local formats should not be difficult. For instance, tags like <sig>...</sig> could be replaced by a COCOA-style tag such as <texttype sig>...<texttype main>. Attributes of a TEI-SGML tag can be rewritten as separate tags. For example, <page.break no="23" sig="C1v" fol="11v"> could be translated into <page 23> <sig C1v> <fol 11v>.
Other differences may not disappear so quickly. Simple variant readings have been readily encoded in COCOA format, but not ones that replace one sequence of words with another shorter or longer sequence. To match variant and lemma, the variant reading must be 'anchored' to the exact phrase in the main text. Anchoring other pieces of text, such as marginal glosses, poses a similar problem. Contractions and brevigraphs, which do not belong to the collating sequence and are not diacritics, also may not be easily translatable. Textual editors and students of the language will be very interested to recover from these dictionaries all occurrences of a contraction to see how many different ways it has been expanded.
A concordance program that can retrieve words, word groups, word-patterns and tags, either by themselves or in combination, should be able to extract most of what matters in these dictionaries: proper names and titles, inflections or parts-of-speech, labels or citations of a source language, and all examples of a given head-lemma (in its entry, in explicit cross-references and in phrases or sentences used under other head-lemmas).
Ideally, software should also be able to retrieve words, word groups, word patterns and tags not just with the immediate context but also with chunks of text nearby that are explicitly associated with a given tag. For the purposes of RKB, for example, it would be desirable to recall all translations, followed by the sample quotations they translate, which have specified words in them; or a listing of all occurrences of a group of English words with both the French head-lemma after which they appear and the immediate context. Because some English words are located many lines after their head-lemmas (e.g., in one of many phrases that are subsumed under the head-lemma), this context selection is not easily done by most text-retrieval systems. They need to be able to retrieve entire passages encoded by a single tag.
Tagging and text-retrieval requirements for Tudor bilingual dictionaries vary according to two things: the nature of the texts themselves; and the purposes to which they are put. Any scholarly system must be able to identify textual phenomena whatever they are. Because these text features cannot be all distinguished unambiguously, but must sometimes be recovered by interpretation, it is difficult to anticipate everything that might be tagged. For instance, I can tag Cotgrave's head- lemmas and phrasal lemmas, but satisfactorily dissecting the 'meaning' tag asks for lengthy analysis. Second, different scholars put texts to different uses. To analyse the texts automatically for collocational patterns, as I am doing in a separate research project, texts must be lemmatized; and at a miminum each word in such a file has to have three tags, its word-form, its part-of-speech and its inflectional form. Multiply the size of the file, then, threefold. The English lemmatizing system I am designing with Michael Stairs may well overload existing software. Scholars searching for names, on the other hand, can be satisfied with far fewer tags. Versatile text-retrieval software, then, has to work well with densely- or sparsely-tagged texts.
Since the early 1950s and until recently, anyone using computers in the humanities has had to cut cloth, the text, according to the tailor, that is, the computer platform and the software available for it. In the past five years, this situation has changed. Software has become increasingly general-purpose. Encouraged by this turn of events, we should insist on a tagging system suitable for our texts rather than training those texts and restricting their encoding systems for available software. "If you build it, he will come", says the voice in Field of Dreams. If we define our tags according to the needs of our texts, the software will be written to analyze them.
[1] I wish to thank Douglas Kibbee for many helpful criticisms of an earlier draft of this paper. I also wish to express my gratitude to the Social Sciences and Humanities Research Council of Canada for its support of this research, as well as to the Humanities and Social Sciences Committee of the University of Toronto, which supported text-entry of Palsgrave's text.
[2] Some of these texts have been deposited in the Oxford Text Archive.
[3] Chadwyck-Healey has promised a full-text collection of 4500 volumes of poetry by major authors up to the 20th century, and publishers such as Oxford University Press have begun a series of electronic texts.
[4] From the scholar's viewpoint, RKB might be characterized principally as the second set because commercial publishers are unlikely to issue those individually. Unlike works by major authors, the reference texts have seldom been published in modern critical editions amenable to optical scanning.
[5] This publication has been sponsored by the Association for Computing in the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics. The full-text databases published by Chadwyck-Healey, for instance, will employ TEI encoding. For an introduction to SGML, see Goldfarb, 1990.
[6] Micro-OCP is distributed by Oxford Electronic Publishing, Oxford University Press, Walton Street, Oxford OX2 6DP, UK; and in North America by OUP, 200 Madison Ave., New York, NY 10016, USA.
[7] TACT and WordCruncher are interactive text-retrieval programs.
[8] I emphasize that SGML is a much more complex system than I am able to describe here. I touch only on certain characteristics obvious to the scholar doing the tagging.
[9] Not all these tags and their attributes appear in the samples. The list of tag attributes is also incomplete (e.g., dialectal character of a translational equivalent tagged by <m>...</m>).
[10] One can also assign a language attribute to each head lemma in the event that directionality changes within the text (e.g., English-to-French, French-to-English).
[11] For most purposes, expansions may be represented simply within square brackets.