Corpus Linguistics beyond Google: the WebCorp Linguist’s Search Engine

Keywords

linguistics, language use / linguistique, search engine, utilisation de langue

How to Cite

Renouf, A. (2009). Corpus Linguistics beyond Google: the WebCorp Linguist’s Search Engine. Digital Studies/le Champ Numérique, 1(1). DOI: http://doi.org/10.16995/dscn.138

Download

Download HTML

2323

Views

232

Downloads

3

Citations

1. Introduction

This paper is offered in tribute to Ian Lancashire as a pioneer in the field of Computing in the Humanities, a driving force who has through his vision and industry made an enormous contribution to each of the inter-dependent areas of text collection, archiving, annotation and access, thereby opening up to the world new fields of study together with a wealth of otherwise inaccessible primary and secondary reference material. In particular, we wish to record our appreciation of Ian's achievement and service in creating a stable of reverse bi-lingual early dictionary resources. These inadvertent English language inventories uniquely reveal contemporary meanings which clarify early quotation and reveal some fundamental present-day misinterpretations. What we linguists share with and appreciate in Ian as literary historian is his methodological unorthodoxy: we have both spent much of our respective careers turning data and process on their heads in the search for greater understanding of the facts of the language.

Computing in the Humanities tends to focus on the domain of "worked" or "analysed" data. This encompasses literary prose, drama and poetry which, though intrinsically 'primary' in nature, has been painstakingly drafted and re-drafted by the writer; reference material, such as bibliographies and inventories; and secondary works, such as annotated editions or dictionaries which have emerged from the two-stage process of data analysis and encoding. In contrast, Corpus Linguistics, a discipline or perhaps a tradition which avails itself heavily of much of the panoply of tools, methods and structures of Computing in the Humanities, sets itself apart by virtue of its traditional focus on "raw" data, and in front line, raw data of the non-literary variety. To the extent that this distinction is true (there being samples of literary work incorporated in many of the major English text corpora), it is because corpus-based researchers more usually teach language than literature, and their focus has thus primarily been on the conventions of non-literary speech and writing.

Corpus linguistics began in earnest almost 30 years ago, as technology allowed the digitisation and processing of text in more useful amounts. The early years of fascination simply with the written word incarnate in electronic (e.g. concordanced) form, and the firm belief in the representativeness and thus reliability of a small corpus sample (such as the pioneering BROWN corpus; see Francis and Kucera) to support grandiose statements about the state of the language, have given way to an awareness of the need for larger textual resources and a growing realisation that language use differs fundamentally between one topic and another, one genre and another, and one domain and another. This is leading to a more differentiated approach to text study and incidentally to a new interest in literary text, seen now rather in terms of a range of distinct sub-types of text, each with its own linguistic conventions and characteristics.

At the extreme other end of corpus linguistic activity lies the recent attempt, by individual linguists, to bypass the delays and other constraints entailed in the construction of traditional and ever larger text corpora, and to supplement these resources with immediate access to linguistic usage retrieved from the World Wide Web. Like Lancashire's EMEDD and LEME databases, the web is corpus-like to the extent that it contains stretches and fragments of authentic language use; in particular, the subset of authentic language which is written primarily with the intention of communicating ideas and information. However, like EMEDD and LEME, the web departs from the conventional notion of corpus in many ways, the most controversial for corpus linguists being the fact that the relevant data within it has not been designed to form a balanced whole, and thus cannot be subject to an exhaustive study that can provide objectively reliable, quantified results about the kind of language in question. To which the reply is that neither the Lancashire lexicographic databases nor the web pretend to be anything other than serendipitous data supplements to areas of linguistic sparseness, namely to early English use and to the most recent English use.

Notwithstanding its hybrid and often perverse language use, and the unorthodox nature of its texts and genres, incursions into the web were attempted from the late 90s by enterprising individual linguists, driven by curiosity to seek attestation of hypotheses about usage not represented in corpora elsewhere (e.g. Bergh et al; Brekke, "When 'Empiry' Strikes Back" and "From BNC to the Cybercorpus"). These linguists were commonly historical language experts, searching for supposedly rare or possibly obsolete usage, but there were seekers purely of the newest coinages as well. The necessary conditions for a methodical and reliable search of the web being largely absent, linguists nevertheless assailed this universe of organisational confusion and linguistic heterogeneity with such search tools as came to hand. Faute de mieux, these have been the commercial search engines, since they at least provide access to particular "pages" or texts containing a specified search term. However, the results are less than ideal, both in content and format, for linguistic research purposes.

The potential of the web, and the need for a purpose-built search mechanism, was clear in the mid nineties, and finally, in 1998, our RDUES team had time to create a very basic search system capable of extracting instances of linguistic usage from the Web. This was installed as a feedback tool at <http://www.webcorp.org.uk/>. Taking account of user comments, the tool was iteratively developed to extract data from web text, and to analyse, format and present this to linguistic researchers and other users in a manner closely resembling the output generated from conventional "raw" text corpora by conventional software. It was not a major conceptual leap. We were early web users, and had been developing innovative linguistic software tools for handling conventional corpora for over eighteen years. TheWebCorp project proper was born in 2000, funded for the first four years by the UK Engineering and Physical Science Research Council, and it ran its multi-disciplinary course, an iterative and combined effort on the part of linguists, software engineers and lexical statisticians (see Renouf, "WebCorp: providing"; Kehoe and Renouf, "WebCorp: Applying the Web"; Renouf and Mesquiriz, "The Accidental Corpus").

2. A WebCorp First Generation System

The first generation WebCorp system is represented in Figure 1, where the six main search stages are itemised and labelled. Basically, WebCorp is the medium through which user search requests are processed. It interprets the user search terms for the commercial search engine, which finds the URLs for apparently relevant texts. WebCorp then goes to the web texts themselves and passes them through a series of processes of analysis and formatting, to a state where they can be presented back to the user.

Diagram of the first generation system

Figure 1: WebCorp - diagram of the first generation system

Because the purpose of WebCorp is primarily to supplement and complement the information available to linguistic researchers, it is particularly useful in providing:

  • new words and phrases
  • new uses of sub-word morphemes, words and phrases
  • rare and possibly obsolete words and uses
  • rare and possibly obsolete constructions
  • lexical and phrasal productivity
  • lexical and phrasal creativity

The front-end, which allows users to specify their linguistic requirements in a series of simple search choices, is shown in its original version in Figure 2.[1]

Original Webcorp user interface Original Webcorp user interface

Figure 2: Original WebCorp user interface

The user interface allows users a choice of search engine, with Google generally being the most useful. Values for a series of other parameters can also be specified:

  • case sensitivity - allowing output for NATO and nato to be conflated or separate
  • output format - allowing HTML or plain text output; contexts numbered or not
  • web address - allowing URL display to be interspersed with the textual examples
  • concordance span - specifying sentence format or context length in words
  • number of concordance lines - specifying 5, 10, 50, or unlimited number of lines
  • site domain - optionally specifying domain search, e.g. .uk; www.nytimes.com
  • newspaper domain - specifying particular newspaper site group, e.g. uk tabloids
  • textual domain - specifying particular Open Directory category
  • word filter - specifying words to appear or not on same web pages as search term
  • time span - specifying date of last modification, or time span between given dates
  • collocational analysis - requesting external/internal collocational profile for term
  • number of concordance lines per text page - limiting examples from one writer
  • exclusion of non-textual information - e.g. link text, e-mail addresses

Some examples of the kind of information which can be gleaned by applying some of these in the first-generation WebCorp tool will now be presented.

2.1 WebCorp Output: Exploring New Words and Phrases

One of the benefits of WebCorp is, as noted earlier, its capacity to provide access to neologistic activity in text. The phrase wardrobe malfunction, for instance, quickly entered the 7th edition of Collins English Dictionary far in advance of its appearance in designed corpora. This phrase is a euphemism coined on February 1, 2004.[2] The contexts found in UK journalism on the Web in a search in 2005 include those shown in Figure 3:


1. Jackson, revealing her breast in an apparent "wardrobe malfunction".

2. The high spot of the American Football calendar was overshadowed by a "wardrobe malfunction".

3. Failing to get their stories straight about the 'wardrobe malfunction', both Timberlake and Jackson apologised.

4. there's lots of potential in that nice new phrase, "wardrobe malfunction".

5. Britney fell pregnant after a wardrobe malfunction, a lowboy fell on her.

6. A wardrobe malfunction prevented me from leaving the house on Tuesday

7. The only wardrobe malfunction that officials have to worry about is that the sexagenarian's cardigan might get caught on a mike stand and give him an attack of lumbago.


Figure 3: WebCorp output for "wardrobe malfunction" (domain: UK broadsheets)

The web-based data retrieved in Figure 3 demonstrates, through the frequent appearance of the phrase within inverted commas, that it is acknowledged as new and possibly quoted, and also that there is skepticism that the malfunction was accidental and not a publicity stunt. There is the first metalinguistic reference to the term in 4. It is also clear that there is much potential for humour, not just in the particular choice of euphemism, but through the allusive application of the term to quite other contexts and events, as in 5-7 above. These stages are all typical of the treatment received by new coinages (see Renouf, "Shall We Hors-d'Oeuvres?" and "Tracing lexical productivity").

2.2 WebCorp Output: Exploring Rare, Possibly Obsolete Words

The lead time involved in creating designed corpora means that they cannot capture all the latest coinages. Equally, the constraints on their size means that they probably cannot yield all the elements of vocabulary sought by the user. A case in point is the rare or obsolescent word, the existence and current usage of which might well interest a linguist. On the other hand, taking a sample old-fashioned term, the word repugn, and turning to web-based text, we do find some instances. None emerge from the UK broadsheet newspaper sites, but some do when the filtering requirement is relaxed to allow any texts with a URL ending ".uk" to be searched, as seen in Figure 4. From the UK-wide Web text search, we find that repugn is not so much old-fashioned as largely restricted to Earlier English contexts (e.g.1-11). The modern-day exceptions are its use in literary critical rhetoric (13), its parodying in weblogs and on newsboards (14), and in dictionary citations (15).


1. "those things which seem to … repugn most manifestly against God's word.

(Ridley, 1550, quoted by Bishop of London 4.11.2005)

2. should such still be…as may repugn against the Royal interest,

(1858-65 Thomas Carlyle: History of Friedrich II of Prussia V 13)

3. desiring you in this one request not to repugn the setting-forth of your own proper studies, (1573 Earl of Oxford letter to Shakespeare)

4. I be not compelled to the thing which my conscience doth repugn or strive against." (1570 John Fox's Acts and Monuments)

5. For it is a sin to withstand and to repugn against his Lord like the sin of idolatry

(1483 transl. William Caxton, First Edition. The History of Saul. Lives of the
Saints
compiled by Jacobus de Voragine, Archbishop of Genoa, 1275)

6. Yea, to repugn against his voice is as evil as the sin of soothsaying

(1562 Cranmer, Ridley et al, Homily 7: 'Againsta swearing and perjury')

7. let the others repugn as they will (1837 Thomas Carlyle The French Revolution)

8. not a few have known how to repugn with apt checks the bites of others,

(1903: English translation J.M. Rigg of The Decameron by Giovanni Boccaccio)

9. Nature, in horrible throes, will repugn against such substitution

(1850: Thomas Carlyle Latter Day Pamphlets: No. II. MODEL PRISONS)

10. the thing which my conscience doth repugn or strive against.

(1757, 1758: The Journal Of John Woolman Chapter V )

11. I do marvel greatly that ever any man should repugn or speak against the scripture (1531 William Tyndale A Pathway Into The Holy Scripture)

12. you're aiming to repugn the presumptions of your interlocuter.

(10.09.2003: Rex: ISCID Forums: Topic: Cosmogony, Holography and Causality)

13. Olson's Kingfishers was a deliberate effort to repugn the hopelessness of The Wasteland. (March 2004 as Global Voices Radio)

14. a force field to repel, repulse and repugn any or all of the assaults on your good time (May 25, 2005 Rob Thurman weblog)

15. [Verbs] resist; not submit; repugn, reluct, reluctate, withstand; stand up against,

(http://thesaurus.reference.com/roget/)


Figure 4: WebCorp output for repugn (domain:.uk)

Those characteristics are specific to the base form repugn, as a separate search carried out for repugned showed. [3]The latter inflection was found, in August 2006, to be used on web pornographic sites and chatrooms, with only a couple of instances of earlier English, and one instance of current English use. [4]

Another rare term, usedn't, is found not in UK broadsheets but only in a looser .uk search, and there only 7 times, six of which are cited metalinguistically in dictionary entries or discussion lists, and once in a quote from P.G. Wodehouse's (1922) "Bertie Changes His Mind" in Carry On, Jeeves.

  • "Makes a fellow feel a bit of an ass, what? I shouldn't wonder if they usedn't to stare at you from time to time, too, eh?"

2.3 WebCorp Output: Exploring Contiguous Variable Phrases

It is possible, using Google, to search for lexical patterns, but very difficult if one is particularly interested in the variability and creativity of phrases. With WebCorp, it is possible to use a filtering option to suppress key terms in the phrase, thereby forcing creative alternatives to present themselves. To illustrate this, let us take some quotations from Shakespeare's oeuvre which are now integral to everyday English, in order to see whether and how they are used in UK-based web text.

The first quotation: Shall I compare thee to a summer's day? Thou art more lovely and more temperate… originates in Sonnet 18. The search string is reduced to its likely key elements "compare thee to", and the lexical terms summer's or Summer's are suppressed, in order to encourage the lexical variation presented in Figure 5:


1. Shall I compare thee to a Sony Walkman, thou art more compact and more -

2. Shakespeare's famous cheese sonnet: Shall I compare thee to a Dairylea?
Thou are more creamy and more temperate

3. Shall I compare thee to Olivier? Thou art more luvvy and intemperate

4. Shall we compare thee to a perfect pair? Thou art more unlike and more obstinate

5. Shall I compare thee to a broken dream? Thou art so ugly in seasons all

6. Shall I compare thee to a full ashtray? Thou art more smelly and do nauseate

7. Shall I compare thee to a pint of Shires? Thou'rt more full-bodied and more sweet

8. Shall I compare thee to a feathered friend? Or to a roasting fowl

9. Iraq as the birthplace of civilisation, 'Shall I compare thee to old Sumer's day?'

10. Shall I compare thee to a day in Slough?

11. 'Shall I compare thee to a lettuce leaf?', William Shakesalad

12. Shall I compare thee to a European lab?

13. Shall I compare thee to a Nazi regime?

14. Shall I compare thee to a Minkowski-Ricardo-Leontief-Metzler matrix?

15. Shall I compare thee to the best in the business - or to the budget?

16. Shall I compare thee to a 52 inch Plasma with High-Definition optics

17. Shall I compare thee to a box of chocolates?

18. Shall I Compare Thee to a God-like Don? He was the perfect God-like Don,

19. 'Shall I compare thee to a worn PA?'

20. Shall I compare thee to an OS2-free Intel box

21. Shall I Compare Thee To A Pressure Wave?

22. Shall I compare thee to a Panda called Steve?

23. Should I compare thee to a scouser child?

24. Might I Compare thee to a...... Barbie doll

25. Why, I did but compare thee to some of the birds that are of the brisker sort


Figure 5: WebCorp output for "compare thee to";

suppression of words summer's and Summer's; domain:.uk

One could make much of the data in Figure 5 in a fuller analysis, but to give a brief sense of what is uniquely revealed here, we see that lines 1-7 re-work the first two lines in their entirety. They indicate that the phrase is basically recognisable by the key signal "compare thee to," lines 8 onwards focus just on the first line; 23-25 depart from the modal choice of shall. In all cases, the variants are obviously tailored lexically and referentially for a particular context, and all seek to evoke a humorous or ironic response. Interestingly, and consistent with our observation elsewhere in the area of phrasal creativity (Renouf, 2007b), several of these instances function as titles, headlines and opening sentence starts in news articles, since they are eye-catching and appealing to the target reader.

A second Shakespearean quotation taken to demonstrate WebCorp's ability to reveal a range of phrasal creativity is the proverbbrevity is the soul of wit, from Hamlet. This time, the search takes place without the suppression of key words, while the word wit was not specified in the search pattern. The results appear in Figure 6:


1. There's just one problem: it shouldn't exist. Brevity is the soul of Twenty20.

2. To those who say 'brevity is the soul of witt' I say 'not in a revieww'.

3. 'Brevity is the soul of lingerie' Dorothy Parker

4. suddenly I discover that brevity is the soul of wanting more.

5. brevity is the soul of detection

6. brevity is the soul of this particular erotic art

7. You evidently think that brevity is the soul of widowhood

8. since brevity is the soul of lexicography, suffice it to examine the key examples

9. Since brevity is the soul of wit, I will be brief:

10. If brevity is the soul of wit, then I'm a complete dumbass

11. If brevity is the soul of wit, then Cervantes is up there with Proust!

12. If brevity is the soul of wit, then the short story is very droll indeed

13. If brevity is the soul of wit, these are clever tunes

14. If brevity is the soul of wit, it also makes for a powerful ingredient in books


Figure 6: WebCorp output for "brevity is the soul of"; domain: .uk.

In Figure 6, we see what is a fairly predictable range of lexical variation in wordplay. What is interesting in lines 10-14 is the fact that the pattern is presented as a premise, to which is added a conclusion and thus departs from the original text source. [5]

2.4 WebCorp Output: Exploring Discontinuous Variable Phrases

A function of WebCorp which is certainly not provided by Google is the facility to search for discontinuous phrasal patterns. WebCorp supports this by accepting a "wildcard" (*) in the place of each word or suffix in the phrase. Figure 7 shows the output yielded for the phrase from Hamlet: "to be or not to be". Here, the key lexical item, be, is suppressed to encourage maximum creativity.


1. Risk factors in the 21st century. Diagnosis - to biopsy or not to biopsy

2. to hdparm or not to hdparm?

3. Monday 6 February 2006. Badgers: to cull or not to cull?

4. Walters, S.A. End-user document delivery services: to mediate or not to mediate

5. Conservation News, 52: Moore, S J. Bones - to degrease or not to degrease?

6. Skills News round-up Brief exchange: to legislate or not to legislate?


Figure 7: WebCorp output for to * or not to; suppression of be; domain: .uk

The results shown in Figure 7 are relatively sparse. As with in Figure 5, they show that the creative variants function as titles or text-openers; also that the statement's completion: that is the question is often omitted. In lines 1 and 2, the substitute verb is actually a vogueish noun-verb conversion, which brings a sense of modernity lacking in be.

From Romeo and Juliet, we see in Figure 8 the output generated by the quotation " What's in a name? A rose by any other name would smell as sweet." In order to encourage some creativity, this time the search string was rendered in reduced form, as "a * by any other name would," and the central node rose was suppressed. The textual domain was specified as any URL ending in .uk.


1. What's in a name? That which we call a hose, by any other name would smell of feet.

2. What's in a name? That which we call a spammer by any other name would still be scum.

3. That which we call a DVR by any other name would still smell like a commercial killer?

4. "What's in a name? That which we call a fart by any other name would smell as sweet".

5. a fart by any other name would sound as sweet

6. a Getz by any other name would smell sweeter and could probably increase sales by 15%,

7. Besides, a bar by any other name would smell as sweet, or something like that.

8. But a fool, by any other name, would get you another set?

9. But a tax by any other name would smell as foul.

10. the Brussels sprout? A sprout by any other name would be more attractive


Figure 8: WebCorp output for "a * by any other name would;" with suppression of rose; domain: .uk;

A point of interest in the output in Figure 8 is that the core residual words in the creative play, those which are presumably thought by the writer to guarantee recognition of the allusion seem not to be the lexical term rose, but the string "by any other name would." Further searches show that the string "by any other" alone is insufficient to allude to the source phrase except in one case. [6]


2.5 WebCorp Output: Exploring "Internal" Collocates

As demonstrated in Section 4 above, the use of a wildcard symbol to replace a word in a search phrase allows the retrieval of discontinuous phrases. Within the output, a range of lexical variants can occur in that slot. In order to have an overview of these, we have contrived the notion of "internal collocates" (Renouf, "WebCorp: providing"), and the team has developed a means of calculating and displaying these words. An example of this is shown in Figure 9. The phrase from Hamlet: "The lady doth protest too much, methinks" was submitted in reduced form as the search string " lady doth * too much," in which the word "protest" had further been suppressed.

Word

Total

1

profess

3

3

complain

3

3

assume

2

2

protesting

2

2

protect

2

2

bitch

2

2

Figure 9: Top Internal Collocates by Wildcard Position (excluding stopwords) for search term "lady doth * too much;" suppressed "-protest"

2.6 WebCorp Output: Exploring "External" Collocates

The WebCorp tool can produce collocational profiles of the words occurring within a window of 4 spaces to left and right of the search term. A sample is presented in Figure 10 for the term wireless, obsolescent in reference to the radio, but recently active in relation to the internet. This output is derived simply from the totality of words present on those web pages that happened to be consulted by the tool in a single search, and so falls short of the statistically reliable output that comes from a large, designed corpus of a known size. However, the frequencies from this web search do allow the user to see what the predominant patterns are in simple frequency terms, and the inventory of words explains the reason for the renewed appearance of wireless in today's text.[7]

Word

Total

L4

L3

L2

L1

R1

R2

R3

R4

Left
Total

Right
Total

internet

29

2

26

1

2

27

Cable

22

22

22

0

access

21

3

4

14

3

18

networking

17

16

1

0

17

minutes

14

14

14

0

economy

14

14

0

14

Times

13

2

1

10

3

10

technology

12

7

3

1

1

0

12

network

11

2

7

2

2

9

Online

11

10

1

0

11

site

10

4

1

1

4

10

0

Filed

8

1

7

0

8

Key Phrases: Cable wireless minutes wireless

site wireless wireless internet wireless networking wireless economy wireless technology wireless network wireless access

Figure 10: Top external collocates of "wireless" (grammatical items excluded)

2.7 WebCorp Output: Time Series Graphs

The renaissance of the term wireless is confirmed and interpreted by a time series graph produced by our APRIL software (Renouf et al, "Monitoring Lexical Innovation"), also available on our website, which shows the user that this term enjoys a surprising degree of use, and that it takes a steep upward turn from late 1999 through 2000, declining again from 2001.

Key: Dotted Line = Frequency per million words
Solid Line = Moving Average

Time Series Graph for <em> Wireless </em>

Figure 11: Time Series Graph for wireless, 1989-2005

2.8 Moving on from the WebCorp First-Generation System

The move from dealing with the conventional finite corpus to treating the open-ended web as a corpus brought with it new information but also new problems. These problems involved the inherent chaos of the web and ranged from the aberrant definition of a text, to the miscellany of information on what passed for a "page," to the absence of reliable punctuation or dating, to the need for multiple language identification. Through the years 1999-2004, we progressively dealt with these matters, resolving them either by elegant means or by simple-minded heuristics. Over those years, many linguists took to using WebCorp instead of Google,[8] and the vast quantities of user-feedback elicited from our web-based prototype WebCorp tool helped to identify the major snags and shortcomings experienced in the system by our thousands of regular users.

But the era of the Web as Corpus was set to move on. Several remaining shortcomings were ineradicable within the architecture as it stood. These included the inability to apply a full inventory of functions to optimise variable pattern search, the limitations on analytical processing requiring linguistic pre-annotation, and on the post-processing of output to produce statistical information. All these problems could be laid at the door of the commercial search engines on which theWebCorp system piggy-backed. As we said, in the early days, the only route to web texts was via the commercial search engines. This happily allowed WebCorp to conduct simple searches on the URLs containing the search term specified. However, as time has gone on, the interests of search engine companies and linguist-users have diverged. The former have firmly in their sights their commercial users, typically non-linguists who do not require sophisticated pattern search or frequency information. Meanwhile, linguists have become increasingly sophisticated in corpus-based searching, both on raw and analytically encoded and tagged text. A third interest group has also appeared: classroom teachers with a desire for simultaneous class use, engendered by the increasing availability of computer clusters.


3. WebCorpLSE - the Second-Generation WebCorp Linguist's Search Engine

The mismatch between user desire and search-engine limitation outlined above points to a solution, the exposition of which will be the topic of the remainder of this paper. The idea is to create an independent search engine, into which are incorporated the tools and functionality to support an improved Web text linguistic search system - improved in depth, delicacy, versatility, speed, and scale.

The new linguistic search engine is being created within our research unit. Stages in the software-technical work involved are catalogued in Kehoe and Gee, "New corpora from the Web." Of course, a search engine is not "linguistic," per se, but a series of software modules which go to specified "pages" on the web, and extract the associated strings of bytes. A model of the new system has been drawn up by Kehoe as follows (where WWW = world wide web; D/B = database; s/w = software; GUI = graphical user interface):

Architecture of the Webcorp Linguists's Search Engine

Figure 12: Architecture of the WebCorp Linguist's Search Engine

The components of the new linguistic search engine system are as follows:

• web crawler

• parser / tokeniser

• indexer

• WebCorp tools

• WebCorp user front-end

• more, also off-line, linguistic processing tools

The components which make the search engine "linguistic" are 3 and 4 in Figure 12; they comprise a series of existing corpus-analytical tools which process and post-process the text extracts which have been retrieved by the crawler at stage 1, and tokenised and indexed at stage 2.

4. New Information Provided the New WebCorpLSE System

In this section, we shall illustrate some of the benefits that are being provided through the increased functionality and capacity of the new WebCorpLSE system.

4.1 Word-initial Wildcard Search

The productivity of English news text is rather stable in terms of individual derivational morphemes. Rankings for the composition of new words accumulated over 17 years show that most affixes continue to contribute an approximately constant number of new words to the evolving lexicon, with some affixes, notably the older ones like un- and in-, holding prime position. However, there are some vogue affixes which stand out. These include the loosely synonymous -fest and -athon.

With first-generation WebCorp, we can observe the behaviour of prefixes, since it is only at the end of a word that a wildcard can currently be placed - in itself already an achievement, given the absence of wildcards in commercial search engine retrieval. With second-generation WebCorp, we can offer word-initial wildcard search, so that suffixal behaviour may also be examined. Taking the vogue suffix -fest, a Germanic calque, we find the following productivity in Figure 13:


1. Public gore- fest as heart surgery broadcast live

2. the resulting stare- fest feels consistent rather than monotonous

3. four-day geek fest in the Catalan capital for the latest in mobile telephony.

4. last year's multiplayer online zombie- fest set in a T-Virus-riddledRaccoonCity.

5. It 's hard to convey what the England captain added to the lunchtime banter-fest

6. The previous run had seen the quirkfest pre-empted and rescheduled

7. 'Cheaper By The Dozen 2' is the sequel to a 2003 kinderchucklefest

8. this gruesome ya-ya sisterhood hugfest starring Toni Collette and Cameron Diaz

9. Henry embraced his abuser during the Soccer Against Racism swankfest

10. a trend established by the public blubfest surrounding the death of Princess Diana


Figure 13: WebCorpLSE output for *fest, domain: UK broadsheets

4.2 Any Number of Words Specified in the Wildcard Position

For a discontinuous phrasal search, phrasal variability and creativity are ideally fully explored in one search. The new wildcard can specify a number of words up to a maximum in one search by the simple coding *2, meaning "up to any two words may occur in the asterisked slot."


1 wd

1. the play simply proves that too many kooks spoil the broth.

2. the old aphorisms hold true: too many vats spoil the broth.

3. Do too many cooks spoil the broth, or many hands make light work?

4. Too Many Crooks Spoil the Broth

5. Too many toadies spoil the interview

2 wds

1. Do too many peer reviewers spoil a scientific paper?

2. Too many Chelski players spoil the broth

3. what's that old saying... too many French chefs spoil the onion soup

4. too many time wasters spoil ebay for all of us


Figure 14: WebCorpLSE output for "too many *2 spoil;" domain .uk

4.3 Improved Search by Date

The text-dating information on the Web is currently inadequate for diachronic study. There is a dating-provision option, but it is either disregarded or filled with ambigous information. What can almost be guaranteed is that it does not reveal the date of composition, which is the vital piece of information for linguistic researchers. The Semantic Web initiative will hopefully contribute to an improvement in that respect. Meanwhile, we have developed algorithms which search, match and rank conflicting dates in and around the text. With the greater control over the texts downloaded, still further, perhaps language internal, clues are capable of exploitation. We present chronologically-ordered information in Figure 15 for the recent term phishing, which emerged in and around 2002-2003.

01/01/2003

of such a fraudulent email campaign - a phenomenon known as

phishing

- which increased by 400 per cent over the Christmas period

21/07/2003

the latest form of e-mail scam called "brand spoofing," "carding" or

"Phishing."

The official-looking messages tell recipients

01/01/2004

Comments » --> June 24th, 2004

Phishing

Attacks Prove Popular

01/01/2004

Passwords for Protection against Real-Time

Phishing

Attacks. SHA1 Collisions can be

01/01/2004

Simple steps to avoid being phished.

Phishing,

pronounced as "fishing", is

05/06/2004

May 6, 2004. The Cost of

Phishing

Hits $1.2 Billion

Figure 15: WebCorpLSE chronological output for phishing, domain .uk

4.4 Specification of Sentence Position and Punctuation

Here, let us take the dual transitive phrasal verb put up with. We are now able to specify a search for this phrasal verb in clause or sentence-final position, followed by a full-stop or equivalent. The search addresses an abiding concern of linguists: preposition stranding. That is to say, we seek only cases in which the preposition with is positioned at the end of a clause or sentence. The results for this search appear in Figure 16.


1. she's sympathetic to me, considering all the infidelities she's had to put up with.

2. those you fancy and even disagreeable folks you're obliged to put up with.

3. The kind of stuff some have to put up with.

4. we reckon they already have enough to put up with.

5. this was more of an insult than I could put up with.

6. Rather, they are something we put up with.

7. this battle is another sign of how much Israel has to put up with.

8. taking the piss to see how much the fans would put up with.

9. the crowning glory was the sledging Bush had to put up with.

10. The dog told him what he had to put up with.

11. it's no more than what many people have to put up with.

12. 'There's no way I would have put up with what he's put up with.


Figure 16: WebCorpLSE output for "put up with"

This output represents a crucial improvement in the kind of information which is available over and above what was previously accessible using commercial search engines.

4.5 Lexico-Grammatical Specification

It is clear from the above that such a lexical search also lends itself to a lexico-grammatical specification. This is possible with the new WebCorpLSE, where grammatical tagging is optionally available. This renders the search string in Figure 14, "too many *2 spoil," as [ too many NP spoil], and the search string in Figure 15 as something along the lines of: [NP + VB + put up with]. The replacement of real-time post-processing by pre-tagging of downloaded web text allows such output to be produced more efficiently.

5. Concluding Remarks

This paper, dedicated to Ian Lancashire, has attempted to show that our work on WebCorp and the WebCorp Linguist's Search Engine (WebCorpLSE) shares some of the significance of his work with EMEDD and LEME, in that each represents an innovative move in computing technology: an original adaptation, an augmentation and a development of existing technology. In each case, the unorthodox approach taken has transformed it into an easily accessible and dynamic e-resource.

It is perhaps useful to cite some of the observations made by other contributors to this volume in order to highlight some of the particular purpose and achievement of WebCorpLSE. Willard McCarty refers to "the operations of finding what is already to be found," and this accurately characterises the aim and function of WebCorpLSE, if one interprets it to mean that it provides the means of discovering the linguistic treasures buried in the Web. Russon Wooldridge says that CALL (computer-aided language learning) has changed quite dramatically in the last few years, due mainly to three factors: universal access to the multiple and varied language resources of the web; the economic and logistical cost of local-network teaching software; and "the inappropriateness of fixed here-and-now lab solutions to the need for independent learning." Wooldridge speaks of language learning rather than research, but these same three factors have also motivated the development of WebCorpLSE. John Bradley, meanwhile, refers to the recent findings in part of the TAPoR project, where there was "a sense of disappointment that what was currently available (in the way of textual criticism and computing tools) seemed to be largely unsuitable for the kind of work the scholar wanted to do." It is precisely this shortcoming which WebCorpLSE seeks to address. William Winder says that "a tagged text is reassuring in that it gives the impression that textual properties can be reduced to a Boolean algebra of classes, but something is missing, and it may be precisely what makes texts tick." Winder also states that grammars only exist as "reflections of the language they are destined to describe." We concur. It is this way of thinking that has led to our building into WebCorpLSE the flexibility to move between tagged data and raw text.

With these observations, we draw to a close, in the hope that the paper has demonstrated some of the important ways in which WebCorpLSE contributes to the field of corpus linguistics and, like Lancashire's EMEDD and LEME, advances the scale and delicacy of humanities-related textual search technology.



Works Cited

Bergh, G., A. Seppaenen, and J. Trotta . "Language Corpora and the Internet: A joint linguistic resource." Explorations in Corpus Linguistics. Ed. A. Renouf. Amsterdam: Rodopi, 1998. 41-56. Print.

Brekke, M. "When 'Empiry' Strikes Back: A Corporal Confrontation." Norway: Norwegian School of Economics, 1999. Print.

─── . "From BNC to the Cybercorpus: A Quantum Leap into Chaos?" Corpora Galore. Amsterdam: Rodopi, 2000. 227-47. Print.

Francis, W. N. and H. Kucera . Brown Corpus Manual of Information, Providence, Rhode Island: Brown U, 1964. Print; rpt. 1971 and 1979.

Kehoe, A. and A. Renouf . "WebCorp: Applying the Web to Linguistics and Linguistics to the Web." Proceedings of 11th International World Wide Web Conference. Honolulu, Hawaii, 7-11 May 2002. Web. <http://www2002.org/CDROM/poster/67/>.

─── . and M. Gee. "New corpora from the web: making web text more 'text-like.'"Towards Multimedia in Corpus Studies Ed. P. Pahta, I. Taavitsainen, T. Nevalainen, and J. Tyrkkö. U of Helsinki, 2007. Web.

Renouf, A.J . "WebCorp: providing a renewable data source for corpus linguists." Extending the scope of corpus-based research: new applications, new challenges. Ed. S. Granger and S. Petch-Tyson. Atlanta, GA: Rodopi, 2002. Print.

─── , Kehoe, A., and D. Mezquiriz. "The Accidental Corpus: issues involved in extracting linguistic information from the Web." Proceedings of 21st ICAME Conference, University of Gothenburg, May 22-26 2002. Ed. K. Aijmer and B. Altenberg. Atlanta, GA: Rodopi. 2004. Print. 404-19.

─── . "Shall we Hors-d'Oeuvres? Uses and Misuses of Gallicisms in English." Syntaxe, Lexique et Lexique-Grammaire: Hommage à Maurice Gross. Ed. Eric Laporte, Christian Leclère, Mireille Piot et Max Silberztein. Lingvisticae Investigationes Supplementa, 2004. Print.

--, Kehoe, A., and J. Banerjee. "The WebCorp Search Engine: A holistic approach to web text search." Corpus Linguistics, Vol 1:1,Proceedings of CL2005. Ed. P. Danielsson and M. Wagenmakers. University of Birmingham, 2006. Web. <http://www.corpus.bham.ac.uk/PCLC/#webcorpus

─── , Kehoe, A., and J. Banerjee."WebCorp: an integrated system for web text search." Corpus Linguistics and the Web. Ed. M. Hundt, N. Nesselhauf, and C. Biewer. New York: Rodopi, 2007. 47-68. Print.

─── . "Tracing lexical productivity and creativity in the British media: The Chavs and the Chav-nots." Lexical Creativity, Texts and Contexts. Ed. J. Munat. Amsterdam: John Benjamins, 2007. 61-89. Print.

─── , Pacey, M., Kehoe, A., and P. Davies. "Monitoring Lexical Innovation in Journalistic Text Across Time." Forthcoming.



[1] The updated version can be seen at http://www.webcorp.org.uk/wcadvanced.html

[2] by singer Justin Timberlake's publicist.

[3] In spring 2009, in contrast, it is used largely meta-linguistically, in some strange parodies on biblical use, with a few current regular uses though within an awkward array of lexico-syntax not indicative of settled use. The reviewer of this paper did not find the same profile of provenance for repugned in Web text searches in 2006 and 2009, so I have provided the update from WebCorp output. This change is useful in that it points up what regular users of the Web will know: that the same profile does not occur from one hour to another, much less three years later, endorsing our later argument for the greater stability which WebCorpLSE can provide.

[4] It would appear from comparable US data that it is still used in current American English.

[5] possibly confused with or influenced by another Shakespeare quotation: 'if music be the food of love, play on'.

[6] The string "by any other" is sufficient to allude to the source phrase in one case only: "The sponsors of the free card turn out to be two hitherto respectable travel bodies: the no-frills airline, Buzz, and Maison de la France - the French tourist office by any other nom ". There are zero instances of "a rose by any other * would" where name is suppressed; of "a * by any other * would", where name and rose are suppressed; or of "by any other * would" where name is suppressed.

[7] Minutes wireless is not a phrase, but two words which meet either side of an absent full-stop, in the sequence "Data delayed by at least 20 minutes Wireless economy Times Online "

[8] sample references:

(2003) Kübler, 'Using WebCorp in the classroom for building specialized dictionaries' in Aijmer & Altenberg (eds.) Advances in Corpus Linguistics, Rodopi.

(2004) Rocha 'Anaphoric demonstratives: dealing with the hard cases', in Branco, McEnery, Mitkov. (eds.). Anaphora Processing: Linguistic, Computational and Cognitive Modelling.

(2005) Kurjian,. Distinguishing similar constructions using a corpus-based investigation

(2005) Lüdeling, Evert, Baroni. Using Web Data for Linguistic Purposes.

(2006) Schmied, J. 'New ways of analysing ESL on the WWW with WebCorp and WebPhraseCount' in Renouf & Kehoe The Changing Face of Corpus Linguistics, Rodopi.

(2006) Mair, C. 'Tracking ongoing grammatical change and recent diversification in present-day standard English: the complementary role of small and large corpora' in Renouf & Kehoe.

(2004) Burrough-Boenisch, J. Righting English that's gone Dutch (2nd Ed.), Voorburg: Kemper Conseil.

(2006) Maher, A. 'WebCorp as a Translation Resource' in Caduceus, newsletter of AmericanTranslators' Association Medical Division

Share

Authors

Antoinette Renouf (Birmingham City University)

Download

Issue

Dates

Licence

Creative Commons Attribution 4.0

Identifiers

File Checksums (MD5)

  • HTML: edb004a5767c4707eaa8344c9d295f69