The Dictionarius of Firmin Le Ver (DLV) is a very large Latin-French dictionary compiled at the Carthusian house of St. Honoré at Thuison, near Abbeville in Northern France, in the first half of the fifteenth century. The text is preserved in Paris, Bibl.Nat. nouv.acq.fr. 1120 where it takes up 467 of the manuscript's 478 folios, and contains a total of 12,800 headwords, plus 37,700 sub-headwords, in all a text of 540,000 words, giving an electronic file of 4.5 Mb.
In 1986 William Edwards and I set out to produce a critical edition of the DLV, a project now virtually complete, and although we have always intended to use the text base of the dictionary for a variety of purposes, the production of the edition has remained paramount; we might have proceeded differently if we had had to consider the DLV as only a textbase for analysis and exploitation. Certainly we might have marked the text more than we did, but at the beginning we were unaware of many features of the text that would later capture our attention.
The text was entered in WordPerfect (initially 4.0 and 4.1, later 4.2, now converted to 5.1) and set out in a manner that aimed at representing, as best we could, the layout of the dictionary entries on the manuscript page. The entries in the DLV are better termed macro-entries: most are made up of a headword, marked in the manuscript by a coloured initial capital and followed by one or more sub-headwords, set at the left margin and beginning with a capital in the ordinary brown ink of the text. The definitions, which are in Latin and French are also in ordinary ink, though for the first few folios a hand, not Le Ver's, has underlined the French. This practice is found elsewhere in bilingual dictionaries. We entered the text respecting the line length of the manuscript column and used the WordPerfect codes on a colour screen to emulate some features of the layout. Bold capitals were used for the headwords, bolding with a single initial capital for the sub-heads and underlining, now italics, were used to set off the French. The Latin of the rest of the definitional and attributional material, by far the bulk of the text, was in regular type. The printed text looks like this:[1]
ABISSUS ab *a, quod est sine, et *bissus componitur Abissus .ssi abisme profunditas aquarum f impenetrabilis vel spelunca aquarum latitantium unde fontes et flumina procedunt, scilicet pelagus Abissus eciam dicitur profunditas scripturarum ABITIO .tionis .i. recessio In ¶Abeo, abis dicitur ABIUGO .gas ex *ab et *iugo .gas componitur act Abiugo .gas media correpta .i. a iugo separare, dissociare, abgregare desjoindre, separer Abiugatus .a .um desjoins, separés, divisés o ABIUNGO ex *ab et *iungo componitur act Abiungo .gis .xi .ctum desjoindre longe iungere, separare, segregare, dividere Abiunctus .a .um desjoint separatus, semotus Abiunctim adverbium separeement, desjointement ABIURO .ras ex *ab et *iuro .ras componitur Abiuro .ras .ratum .i. periurio negare act .i. deneer, nier par mentir, par parjuremens Abiuratus .ta .tum niés par mentir Abiuratio .tionis .i. rei credite abne- f gatio, periuratio, inficiatio deniemens Abiurtio .tionis idem, per sincopam f ABLACTO .ctas componitur de *ab et *lacto .ctas act Ablacto .ctas .ctatum ensevrer, sicome enfant on oste de la mamelle .i. a la- cte removere, extrahere et separare Ablactatus .a .um .i. ensevrés a lacte ex- o tractus, semotus, separatus a mamilla Ablactatio .tionis ensevremens f ABLATUS .ta .tum osteis remotus, o separatus, semotus ab *aufero .fers dicitur Ablativus .a .um qui aufert qui oste, ostans o Ablativus .tivi quidam casus ablatis m Ablatio .tionis ostemens semotio f
During the period it took to enter the text and beyond we continued to work on various ancillary studies, questions of sources and transmission, the status of French in the dictionary, the nature and function of metalanguage (Merrilees, 1988, 1990, 1991; Merrilees & Edwards, 1989). This last has led to further work on the structure of the dictionary entry, an extension of the visual aspect noted above (Merrilees, 1992). The main components of a dictionary entry are the lemma (headword or sub-headword) and the definition, but around these two poles there can be various markers and much additional information about the lemma, its attributes and function. In the DLV this additional information is distributed in three locations and we have found that each position appears to privilege certain kinds of information.
The three positions are post-lemmatic, post-definitional, and marginal (the right margin); their functions are:
Concording programs, such as WordCruncher, can easily pick up the metalinguistic vocabulary under analysis, but they are not well suited for dealing with component elements of a dictionary article as these stand in relation to one another. Nonetheless we have had useful output from WordCruncher, which William Edwards describes here, and with an indexing and concording program that David Megginson reports on later in this paper.
WordCruncher promotes itself accurately and concisely as text indexing and retrieval software; that is, a two-stage process involving the retrieval of data from pre-indexed DOS text files; in fact, the program operates as two interrelated, but distinct components, namely indexing and viewing. The program, in short, massages any DOS text file, creating in the process a word-frequency list which the user can then manipulate to gain random access to the text file and to retrieve data. The program's strength is the ease with which designated character-strings, suffixes and prefixes, or various combinations of words or letters can be searched; in our preliminary work of transcribing, entering and checking the dictionary text content, that searching capacity was invaluable. However, as we moved to an analysis of the structure of our text, as prepared, WordCruncher had its limitations, though we should point out that such limitations relate as much to our application as to the program itself. For example, WordCruncher can list all occurrences of a particular word, but cannot identify the most frequent word in the post-lemmatic position; it can list all French occurrences of the suffix -iet but cannot identify the frequency of French in the definitional position; it can provide all examples where words 1 and 2 are followed within a designated number of spaces by word 3 and/or 4, but the program cannot identify the schematic structure of an entry. In the initial stages this was less a concern to us than the capacity to have rapid access to the textbase.
WordCruncher will triangulate a given reference, provided the user has prepared (pre-indexed) the text according to a three-tiered system. The generation of publishable indices in a variety of formats -- book index, key-word-in-context, key-word-in-line -- is apparently the controlling principle of pre-indexing. We found that, for our purposes, a basic, 'untreated' text file in DOS format, with a unique filename, will provide access and data retrieval as satisfactory as one that has been given a more sophisticated pre-indexing treatment; additional preparatory marking, or adapting the text file to the hierarchical structure suggested by the program, yields limited further returns. Even if we had marked our text more than was done, the structure of the dictionary article could only have been partially captured by the three-level hierarchy.
With whatever level of sophistication the text is prepared, the program produces a word-frequency list, which is then used to mark the text, and through which both the program and the user access the file. The levels of indexing, or lack of them, do not in any way affect this random accessibility, nor the power of recall: lists of specified citations, modified or not, are available to the screen, as a DOS file, or can be sent to the printer.
As Brian Merrilees has shown, our medieval compiler greatly anticipated our task, simplifying the need for the detailed marking of our text, using principles of lay-out as pre-indexing tools which we have chosen to reproduce. Alphabetically arranged dictionaries, after all, are already largely pre-indexed.
The principal benefit of concording Le Ver's text is to provide random access to the French imbedded within the Latin text. It was hoped initially that WordCruncher's three levels of referencing could be used to provide a ready reference for each French word as follows: the designated French citation, in its extended, French context; its 'book' (dictionary and letter); together with its Latin referents: Latin headword and Latin sub-headword. However, extensive marking and preparation of the text to achieve this end yielded minimal advantages over the use of a largely unmarked text. For the Le Ver text the practical limitations of WordCruncher prevented a useful application of the three available levels of reference. As we progressed in our analysis, it became clear that a dictionary entry structure as we have described it above would have required a different kind of software. However, it was through WordCruncher's search results that we came to a fuller understanding of Le Ver's methods of compilation. For example, it was our searches of metalinguistic terms that confirmed the importance of the link between information and its location.
The preparatory efforts that proved the most useful, curiously, were not with respect to the preparation of the text file, but rather to the manipulation of the Character Sequence file, which provides five built-in default options for organising the generated word list: four default language sequences -- English, French, German, Spanish -- and a fifth, user-specific, personally tailored and modified sequence to reflect editorial practice; the dictionary of the text file's vocabulary, its list of unique words, can be here adapted to any hierarchical sequence or equivalence, as the user decides. Every ASCII character can be assigned, by the user, to one of seven types: Upper case, Lower case, Delimiter, DelimLower (marks the end of a word and is a separate word in itself, and can be searched as such), Hyphen, Apostrophe, or Ignore; the text will be indexed and the word-frequency file sorted accordingly.
The most immediate and useful product of a crunched text is this word-frequency file -- a list of all words found in the text, with the frequency of usage of each, sorted according to the designated character sequence. In addition to being an integral point of access by the program to the text, under the View option, this file can be manipulated as a generic word-processor file in its own right -- particularly useful as a proof-reading device, when special attention can be paid to single frequency occurrences. Further, it can be manipulated and sorted by frequency, suffix, etc. as a word processor file in its own right.
In this project our main purpose, which was to provide a traditional (i.e. printed) edition of a medieval manuscript text was obviously somewhat at odds with the preparation of a text for electronic manipulation. Nonetheless, WordCruncher proved to be a powerful and useful tool, but within prescribed parameters.
Electronic concording programs like WordCruncher create interactive concordances: the user decides what information to retrieve while using them. Printed concordances like the Microfiche Concordance to Old English Literature, on the other hand, are static concordances: the editors decide how to organise the information, and the user can access it only in that way. Interactive concordances are very useful tools within research projects like the Dictionarius Le Ver, but when we want to share our work with other scholars, they are still unsuitable for several reasons.
The first problem is distribution. An interactive concordance requires access to a computer, and if there is software bundled with it, it requires access to a specific type of computer. Scholars cannot simply pull the concordance off a library or bookstore shelf and browse through it, or bring it with them on research trips.
The second problem is the lack of standards among computers. Nearly all computers can exchange simple digits and Latin text using the ASCII or EBCDIC standards, but there is no universally accepted method for exchanging even something as simple as é or a 4-byte machine word (long integer), much less a complex binary file structure like the one used by WordCruncher. Today, an interactive concordance must be bound not only to a single computer, but to a single software package.
When it comes to publishing, static concordances avoid nearly all of the problems of interactive concordances. When they are printed on paper, they require no special technology to use, and they can follow standards of typesetting and book-binding which are already well established. Printed concordances also take advantage of the existing distribution system of book sellers and libraries to reach the largest possible audience, and are easy to bring into research facilities for field work.
Furthermore, looking up a single, complete word in a printed (paper) concordance can be as fast as looking it up in an interactive concordance on a computer. However, there are several serious disadvantages to printed static concordances.
First of all, static concordances always limit the user's choices in ways that interactive concordances do not. If a concordance is in alphabetical order, the user can find all words beginning with b grouped together, but not all words containing or, for example. Static concordances also allow only one way to access each citation: you can find all of the citations containing et and all of the citations containing on, but not the citations which contain both.
The second problem stems partly from the first. Concordances are very long, and become even longer when one tries to provide more options for the user. Even a simple, alphabetical concordance can be considerably longer than the original text. For example, if you are concording a 200-page text where the average citation is 40 words long, the concordance will be over 8,000 pages long in the same type size. If you add another type of listing, such as reverse spelling, the concordance will be over 16,000 pages long, and so on. Electronic interactive concordances can generate this information as required -- the average user will never need most of it -- but a static concordance must contain it all explicitly.
It will usually not be possible to publish an 8,000- or 16,000-page concordance printed on paper. The best alternative is microfiche, as the Dictionary of Old English project has done with its concordance. However, now the users are tied to a microfiche reader, and have already lost one of the greatest advantages of the printed static concordance -- its portability and freedom from technological constraints -- without gaining any of the advantages of interactive concordances. The only remaining advantage is that microfiche readers are more commonly available in libraries than computers. The rest of this paper will explore the options which we have considered at the Dictionarius Le Ver project to generate concise, useful static concordances for publication.
Usually, concordances show keywords in context, either with a fixed number of words on either side or within an entire quotation. The simplest way to generate a smaller concordance is to omit the context altogether. Here is a sample French concordance of an early draft of the Dictionarius Le Ver M section without context:
punir:
1 multo
punis:
1 multo (multatus)
punition:
1 multo (multatio)
pur:
1 merum (merum)
puree:
1 merula (merula)
purement:
1 merax (meraciter)
2 merosus (merose)
3 merus (mere)purgier:
1 mucus (muco)
purgiés:
1 mucus (mucatus)
purifiés:
1 merax
purs:
1 merax
2 merosus (merosus)
3 merusputain:
1 manzer
2 multicubaputerie:
1 meretrix (meretricatio)
A lexicon like the Dictionarius Le Ver is ideal for this sort of concordance. A non-contextual concordance of a novel, for example, would have to list only page numbers, and would be difficult to use because a page contains so many different words. The Dictionarius Le Ver is organised hierarchically by headword and sub-headword, and each sub-headword passage contains only a small amount of French. Furthermore, unlike such references as "page 38" or "Act 3, scene 5", the headword and sub-headword still give a fair bit of useful information about a word's context. Still, in this concordance, we are considering including the surrounding French for more context. In the case of putain, for example, the entries would look like this:
putain:
1 manzer bastard fil de putain publique de bordel
2 multicuba putain qui couche aveuc chescun ribaude
Since our first concordance is considerably smaller than the original text, we are able to list other types of information. For example, Le Ver often includes etymologies in his entries, usually Latin or Greek. Since these are all unambiguously marked in the text, we can concord them separately, and study the use of etymology throughout the lexicon. Here is an extract from the concordance of etymons from the same M section:
cedo:
1 matricida
2 morticinus
3 morticinus (morticína)
4 muricidacentaurus:
1 monocentaurus
ceros:
1 monoceros
colera:
1 melan (melancolia)
In this case, the headword and sub-headword alone will often be all the context required, as with colera, in the melancolia sub-entry under the headword MELAN. The etymon concordance is very short, but it still presents a single type of information well.
Fortunately for us, Le Ver produced his lexicon in fairly good alphabetical order. However, while the headwords are fully ordered and there are many cross-references, it is sometimes difficult to find where a sub-headword is defined within a headword article. Again, we have marked the sub-headwords in the electronic text, so it is a simple matter to concord them. The final sample concordance is a list of sub-headwords with their corresponding headwords:
emembris:
1 membrum
emembro:
1 membrum
emendo:
1 mando
2 mendaemensus:
1 metior
emergo:
1 mergo
emeritio:
1 meritus
emeritus:
1 meritus
This concordance will be short, but it can be very useful, both for finding sub-headword articles within the lexicon and for studying the structure of the headword articles themselves.
The Dictionarius Le Ver project can also produce short concordances of Latin words, cross-references within the lexicon, cited forms and even marginalia, since we have marked all of these in our text. Rather than producing one large, awkward printed concordance with extensive context, we are concentrating on small, easy to use lists. Without the context, the user will have to make frequent reference to the text itself, but the printed (or microfiche) concordances will permit use of the text in many different ways, and in many different places.
We have generated these concordances using standard Unix shell tools, with all of the files in plain ASCII format. This is one of our best defenses against obsolete technology, since an ASCII text file is usually easy to convert to any format. The concordance files themselves are also plain ASCII, although for the sake of this paper we have converted them to proper foreign characters and added boldface and italics.
One day computers will be more standardised and more easily available. The Text Encoding Initiative (TEI), headed by Lou Bernard and Michael Sperberg-McQueen, is working to establish standards which will allow different computers to exchange all types of textual information. Once the new standards are in place and there are programs on the market using them, publishing an electronic interactive concordance will be simple and cheap. Until then, however, static concordances will remain the best option.
Throughout this part of the paper I have been careful to specify printed static concordances. It is also possible to release the text of static concordances in an electronic format, using plain text escape sequences for foreign characters like é. Users will be able to take advantage of their own software (for example, a wordprocessor with macros) to generate new types of concordances from it. There are already good distribution systems in place for electronic texts (as opposed to binary files), such as the Oxford Text Archive and the Usenet computer network. Perhaps this is the best compromise we can make for now -- releasing a static concordance, both as a printed text (on paper or microfiche) and as electronic text for further work by other computer-literate scholars.
[1] Editorial note. Web formatting and display limitations lead us to prefer non-proportional characters in place of the original article's proportional characters.