Skip to main content
“Fawcett”: A Toolkit to Begin an Historical Semantic Web

Keywords

semantic web, research tools, history, accessibility / web sémantique, outils de recherches, histoire, accessibilité

How to Cite

Robertson, B. G. (2009). “Fawcett”: A Toolkit to Begin an Historical Semantic Web. Digital Studies/le Champ Numérique, 1(2). DOI: http://doi.org/10.16995/dscn.112

Downloads

Download HTML

1964

Views

187

Downloads

1

Citations

Introduction

The Web of 2009 tantalizes the user who approaches it with an interest in history. Hughes’ and Greengrass’ recently-published Virtual Representation of the Past makes clear the variety of newly developed resources that are available. Nevertheless, the ever-increasing array of historical source material, archival records, inscriptions, and reports now published on the Web, including online topic-based websites and online journals, are all too often just beyond the grasp of the non-specialist scholar. Even today, professional historical research on the Web is dependent on the researcher knowing the right websites ahead of time, rather than on simple queries of the Web as a whole. Some component of this global information network should be capable of supporting a query searching for all online historical evidence relating to a given time and place. For instance, it should be possible to enter the query “1767 AD” in order to bring the user in contact with the newspaper transcriptions from that year provided by Costa’s Geography of Slavery, with the pertinent proceedings of the Old Bailey courthouse in London published online by Emsley et al. (2009) and the large number of remaining online sources associated with that year. It should be possible to receive highly relevant results from such queries instead of the largely irrelevant results often generated by a Google search. Such a service would, moreover, make online historical research more useful and more pertinent to the interested layperson. Imagine, for instance, a family visiting Brittany to trace their family roots. What texts, artefacts and scholarly discussions exist, they might ask, that pertain to Brittany at the time when their ancestors came to the North America? The online summaries of a local tourist bureau or Wikipedia articles are unlikely to suffice, but if the online digital contributions of local and national museums and archives were made available, such a family could more effectively plan their personal historical journey of discovery.

There is a second audience that would potentially be interested in using a unified entry point for historical research: the growing number of people who willingly contribute their expertise online through wikis, blogs, and discussion for a (Sunstein149-164). Such a group, provided with a general outline or index and simple digital tools to associate secondary materials to this index, would become a powerful adjunct to the core of professional researchers for a given topic, filling in helpful ancillary materials such as references to journal articles, on-line discussions, and references in the popular media. In short, a usable, global overview of historical resources on the Web might become a sort of seed crystal, whose introduction facilitates the seemingly spontaneous creation of a much larger matrix of interconnected information.

The requirements for such a system are demanding, but this paper will show that a combination of web 2.0 programming techniques and Semantic Web technologies and standards will go a long way toward meeting this challenge. First, a common schema specifying the basics of historical information is required. Moreover, individual projects must be able to provide their data in this schema, or related ones, without being required to use them internally. For the application of such a schema to be possible, researchers will require a means to translate the metadata governing documents from one schema to another, ideally through a method that requires nothing from the document’s original authors. The first part of this paper, therefore, shows how, with slight modifications, the applications and standards of the Resource Description Framework (RDF) supports these requirements.

Second, historians need to consider the kinds of tools that will effectively enable users to browse thousands, and even hundreds of thousands, of historical events. Supporting such a task not only entails questions relating to user interface design, but it also raises problems pertaining to the volume of data possibly being exchanged. In order to illustrate these issues, the second part of this paper introduces the Fawcett Toolkit, a computer application built upon web 2.0 techniques and current standards relating to the markup of humanities documents. Published under a free and open source licence, and built upon similarly licensed software, the Fawcett Toolkit represents the current state of the Historical Event Markup and Linking Project, ongoing research into markup languages designed to aggregate historical event data and represent them as computer-generated visualizations such as maps, timelines, and animations (Robertson1051-1052; Robertson).

Each advance in work such as this brings into view new opportunities and new technical and social impediments. The third part of this paper outlines two such issues. First, it demonstrates that historical data acquired online require developers to undertake a more careful approach to the selection and transmission of text expressed in languages other than English. Second, I suggest that one implication stemming from projects such as the Fawcett Toolkit is the following: if the digital historical community wishes to support the powerful and precise aggregation of historical content on-line, it must make open licensing of raw data a fundamental requirement for academic publishing. Only then will it be possible to build a network of historical information that enables specialist and non-specialist users to access the full array of content and analyses created by digital historians.


1.0 Schemas for Historical Events

A reasonable consensus has emerged regarding the data types and relationships that are needed to categorize the past online. The Heml project provides XML schemas to outline the past as a series of events. They were first devised in 2001 and revised in 2003 (Robertson). In these schemas, “events” minimally comprise a textual label, such as the “Battle of Actium,” and a temporal label that can be resolved to machine readable formats. Optionally, an event may comprise participants (who in turn might have roles in the event), and locales. Finally, these schemas connect the event model with evidence, either online or in print. The Heml XML Schema was designed to provide a missing component within the context of a much more comprehensive set of specifications for XML, the Text Encoding Initiative's (TEI's) P4 Guidelines. The P4 Guidelines were published in The TEI Consortium: guidelines for electronic text encoding and interchange and included every other component required to compose a digital historical commentary (TEI Consortium, 2002). They did not, however, offer the capability to encode historic events, a practice that makes it easier to generate timelines, maps, and other guides for the reader.

TEI's P5 specification fills this gap (TEI Consortium, 2007). It also includes event tags, though the exact modalities of their use are slightly different from those of Heml. At present, TEI events tags appear to have a primarily descriptive function, since they must appear within a tag describing a person (or a place), and, according to common convention, nested XML tags imply that the enclosing tag has ownership over the enclosed elements. However, as we shall see, TEI P5 event tags have an excellent array of possible qualifiers, and can be associated with places defined elsewhere in the text and with anchor tags indicating the evidence within the text for the event.

In a much-quoted turn of phrase, Dempsey notes the “recombinant potential” of reusable and pervasive cultural information on the Web when encoded in the more formal language of propositional logic. Indeed, recent research shows that historical data encoded in this manner can make good use of artificial intelligence research through data-mining techniques (Ciravegna et al.). Somewhat overlooked, perhaps, is the potential of Semantic Web technologies to represent broad, not necessarily deep, knowledge about the past: to link together data from highly disparate sources and schemas into a common schema, and then to make this pool easily searchable.

The specification that appears most likely to meet humanists’ requirements in this respect is the CIDOC-CRM or “Cultural Reference Model” (Doerr, Hunter, and Lagoze 169). It also models historical events. While the TEI appears to use historical events as descriptors for places and persons, the CIDOC-CRM was built to meet the needs of the field of cultural heritage, and, in particular, the needs of researchers and practitioners describing its objects. It defines these objects largely in terms of the events they undergo, and thereby through the people and places associated with them. It has achieved the status of a standard approved by the International Organization for Standards (ISO), the leading international standards-setting body (ISO 21127:2006). In the digital humanities literature, the potential applicability of the CIDOC-CRM standard has generated considerable interest (Eide).

The CIDOC-CRM has not been without its detractors. Rejecting its suitability as a set of specifications to support an ancient world data mining-project, Gilles points out that the CIDOC-CRM’s orientation toward historical events puts entities such as people, places, and inscriptions at a distance from each other. And he notes that there are large problems left unsolved by the CIDOC-CRM, particularly those of reference and evidence. As a result Gilles proposes to delay its implementation in his informatics; in this paper I hope to show that the incompleteness of the CIDOC-CRM for any given task need not deter us from using it to describe many things and relations (Gilles).


2.0 Historical RDF

The Semantic Web is based on a technology known as the Resource Description Framework, or RDF (Beckett). Although an XML representation of RDF data exists, the scholar accustomed to XML should avoid viewing RDF through an XML frame of reference, and instead understand an RDF database or document as one or more statements, with each statement comprising three parts: a subject, a predicate (or property), and an object. This is illustrated in Table 1. Each row of that table is a single statement, sometimes called a “triple.” In the subject column, strings with underscores, such as john_hammond, represent variables; in common practice these are encoded with Uniform Resource Identifiers (URIs) and represented to users through the string associated with the variable, using a predicate such as has_label.


Subject

Predicate/Property

Object

1

john_hammond

has_label

“John Hammond”

2

founding_of_owens_gallery

has_label

“Founding of OwensArtGallery”

3

founding_of_owens_gallery

has_participant

john_hammond

4

founding_of_owens_gallery

has_date

5

hammond_visits_san_frans

has_label

“Hammond Visits San Francisco”

6

hammond_visits_san_frans

has_participant

J_hammond

7

j_hammond

owl:sameAs

john_hammond


Table 1: Historical Statements in RDF


Such “raw” data is usually only expressed to the user in the format described in the Object Column. It is usually material that is “machine readable” – dates, names, latitude and longitude markers – material the computer uses to compute and then form visualizations for the user. This information is not necessarily contained in one document, repository, or library. As long as all the documents and data use the same URIs when referring to the same concept, the statements may originate from multiple databases, documents, and even web services. Moreover, the order of the statements does not change the meaning of the set of statements.

Scholars in the Humanities who have worked with digital documents are probably familiar with technologies that define a markup specification, such as Document Type Definitions (DTDs) or XML Schemas. These formally state which markup elements may appear and constrain their location and relationships within the document, indicating, for instance, that an element comprising a quotation may not appear within an element denoting a personal name; or that an element that denotes an historical event must have some chronological information associated with it. In common use, these are given as input to computer programs that will declare if a document “conforms” to the DTD or XML Schema. Most documents will “declare” their DTD or XML Schema at the beginning of the document, and the processor will fail with an error if the document does not conform to the specification to which it professes to adhere. Given this stricture, it can be assumed that general purpose tools for markup languages may be applied with good effect.

When dealing with XML documents, one common application that relies on document specifications is a program that transforms a first document, such as one conforming to the TEI specification expressed in a DTD or XML Schema, into another kind of XML document, say an XHTML file that is specified in its own DTD or XML Schema, and can be read by a web browser. The technology most commonly used for this is “Extensible Stylesheet Language Transformations” (XSLT). Because the source document conforms to an XML specification, the author of the XSLT program is able to assume safely that the same location and governing relations in the source document will pertain in all conforming documents, and can therefore reliably replicate them in the end document. Elsewhere, the conforming XML data is used as the input to a program that generates a visual component, such as a chart. In most cases, the specification exists as a kind of contract between the source of data and the processes that consume it, and so it might be said that the tools for specifications commonly employed by Humanities scholars define data in order to constrain it.

While it is possible to use constraining specifications for RDF, it and other features of the Semantic Web are driven by technologies that require a shift in thinking for developers and users. In general, it may be said that Semantic Web tools specify data not in order to constrain it, but rather to permit its discovery and interpolation. We may illustrate the potential of this approach with an example. Consider statement three in Table 1, and assume that has_participant is defined explicitly as a predicate that associates historical events with the human agents that participate in them. (In an XML definition this “has a participant” relationship is usually implied through a nested structure, just as an XML paragraph implies a “have a sentence” predicate because it has sentences nested within it. In RDF, the “has a participant” relationship would be expressed through the three data, or “triple”.) Using common XML approaches, this information alone would not pass muster: minimally, an historical event should have chronological information associated with it, and this information would be defined in the DTD or schema.

Applications that use specifications for the Semantic Web, such as RDF Schemas, take a much more liberal view of this data. If, as we stated above, the predicate has_participant is defined to associate historical events with human agents, then associated technologies would deduce that the URI founding_of_owens_gallery is an historical event, unless otherwise informed. The fact that founding_of_owens_gallery does not yet have a date or a labelling text associated with it does not impede this deduction because it can be provided such associations using data from another, as yet unknown, source. In terms of formal logic, Semantic Web technologies employ the “open world assumption:” they do not reject a statement as false even if a pertinent datum such as a specified date is missing. As a result, computational tools cannot rely on such RDF specifications to act as a guarantee of the completeness of incoming data.

Table 1 illustrates another powerful and simple RDF tool that that supports the discovery of historical events. With reference to lines one and six, it might have been suspected that there are two variables used to represent one person, John Hammond. This is a likely scenario, since the aggregation of historical data will result in multiple URIs that, in fact, refer to the same person, or place or event. However, the statement in line seven makes the two URIs, john_hammond and j_hammond, equivalent, so that the same person is represented as participating in events where either URI appears. [1]


3.0 The Fawcett Toolkit

The remainder of this paper presents the Fawcett Toolkit, a complementary suite of general-purpose digital tools to support work on the historical Semantic Web. The toolkit first provides components for “servers,” computer programs that collect and reconcile the data as described above and then send out parts of that data in response to queries. It also provides these for “clients,” typically web-browsers running on computing devices across the Internet. These browsers access web pages containing components that query a server. The server replies to the query with the appropriate data and the clients use these to produce historical maps or other visualizations. The servers and clients must of course agree on a common language for the queries and replies. The Worldwide Web Consortium has defined just such a query language for RDF called SPARQL, and it has specified a standard response format for replies from queries. Though in some respects SPARQL is less powerful than other RDF query languages, it has the advantage of being well-implemented, especially in the Joseki server software available from Hewlett Packard. The version of Joseki provided in the Fawcett Toolkit is equipped with an RDF schema file that allows it to discover CIDOC-CRM events in heterogeneous RDF data. It loads this data into the server with, for example, an XSLT stylesheet that transforms conforming TEI P5 documents into the CIDOC-CRM-based markup that it uses internally. Thus, besides its original intention as a means of visualizing large numbers of historical events from across the web, the toolkit can be used as an adjunct to a digital library that comprises historical material or as “scaffolding” while marking up TEI P5 event tags.


4.0 A Simple Event Schema Based on the CIDOC-CRM

The CIDOC-CRM is a general and abstract model, one that intentionally avoids specifying how to encode machine-actionable data pertaining to time and location. It also does not provide a model for associating evidence with events. In modelling events for the current Fawcett Toolkit, we have taken a pragmatic approach to these matters, using the CIDOC-CRM's classes and properties where appropriate and then supplementing them with simple relations where necessary. Illustration one presents part of a sample event in graphical form, with the blue elliptical nodes representing URIs, the white rectangles representing strings, numbers and other machine actionable data, and the arcs between nodes representing the property relationships between RDF resources. The illustration shows that some shortcuts have been made to simplify the markup in comparison to the complete CIDOC-CRM. For example, the schema here makes the assumption that all locations are points and so it uses the common geo: namespace to identify latitude and longitude. The representation of time is also simplified compared to the complete CRM schema, as well, though the Fawcett Toolkit can still process all the chronological relations offered by the Heml XML Schema.[2]


Figure 1. RDF Representation of an Event Originally Encoded in TEI P5

Figure 1: RDF Representation of an Event Originally Encoded in TEI P5

The project's SPARQL server, packaged as part of the toolkit release, contains 5,600 events and over 115,000 statements. The “reasoners” built into the Joseki server are provided with simple rules for interchange between specifications for historical data, and using these they derive many of the events and statements from other RDF schemas. For instance, some events were encoded by students using “Semantic MediaWiki,” an extension to the software that runs Wikipedia (Krötzsch, Vrandecic, and Völkel). Using this extension, the MediaWiki environment can function as a user-friendly RDF editor. However, the data encoded by the wiki environment employed a different schema from the modified CIDOC-CRM schema described here and were therefore adapted to this schema by using the techniques described above.

Other events were rendered into conforming RDF, represented in XML, through XSLT transformations of XML source documents. Because of their novelty and general utility to Humanities scholars, special attention was given to the XML event tags in the TEI P5 specification, with the aim of exploring how the event visualization tools could interact efficiently with digital library software such as the Perseus Digital Library, which is now available under a free and open-source license. The goal was to augment the users' experience of the library without adding code to the library’s software. The Perseus Digital Library is now installed locally and a summer research student converted the Perseus text of Sallust's Bellum Catilinae from the older P4 standard of the TEI to the more recent P5 standard. He added appropriate P5 event tags, as well as place and person tags in Latin and English. An XSLT stylesheet was written to convert this XML-encoded event data into RDF that conformed to the CIDOC-CRM specification, and the data stored among others in the server software was to be queried and visualized on Web clients.

The results were satisfactory enough to inspire us to encode a wholly new text, the diary of John Hammond as he traveled from Montreal to the West Coast in 1871, a holding of the Mount Allison University Archives (Hammond).[3] The historical markup of this text was not completed, but enough was done to confirm the results obtained from the ancient text. This finding is significant because the Hammond diary uses a different means of sectioning the text and provides temporal descriptions of events that are sometimes precise to the hour. Examples from the Hammond diary are used to illustrate visualization techniques in the following two sections.


5.0 A Historical Mapping Widget


The Mapping Widget
Figure 2: The Mapping Widget

Although there are many experimental visualizations of the RDF data available in the Fawcett Toolkit, the most developed of these is the mapping widget appearing in Illustration 2. The visualization shown here exemplifies the approach we have taken in order to maximize the amount of historical information available in one viewing while minimizing the demands on the server. In the Fawcett Toolkit, the SPARQL query engine is the only server software associated with the RDF data, and its only task is to make replies to the client's queries, encoding this data in the lightweight JSON format. JSON encoded queries are encapsulated in a common JavaScript library used by all Fawcett tools, and this library itself makes use of the sparql.js JavaScript query library (Feigenbaum, Torres, and Yung).

The JSON-only approach benefits the process in two ways. First, it offloads the tasks of drawing the map and its dynamic components onto the client computer, a fair exchange since there are far more clients than servers. Second, it reduces the size of the files that must be communicated from server to client.

Further reductions in map files are gained because the mapping widget uses AJAX-like communication to add further information to the map as the user interacts with it.[4] On its first drawing, the map widget builds a layer within the OpenLayers JavaScript mapping toolkit. This layer does not include the event information for each location (some of which might be expected to be associated with a large number of events). Rather, each point is encoded with its co-ordinates, label, and the URI of the location in the RDF database. When the user mouses over the location, a SPARQL query is constructed to request the corresponding list of events, which are then rendered in chronological order below the map. In Illustration two, the user has moused over the location corresponding to the mouth of the Columbia River, and the single event that takes place at this location, according to in the original TEI document, is “Hammond heads toward Portland, OR ....” In this way, AJAX programming allows the user to navigate a potentially very large dataset with very few delays.


6.0 Rendering Source Documents Inline

In order to improve user experience, the same interactive approach was extended to the display of textual evidence associated with the event. In previous work with Heml-Cocoon software, source documents were accessed through labelled hyperlinks. This was not an optimal solution, however, because it drew users away from the central navigating tool, the map or the timeline, and it prevented side-by-side comparison of source documents. In print media, it is common to list citations inline or to provide a brief quotation that is embedded in the sentence. For the Fawcett Toolkit, it was hoped that we might find an analogue that could apply in the digital world.

One of the important roles of digital library software like Perseus is so-called "chunking", a function that divides the text in suitable sections such as books and lines of poetry or pages of text (Crane and Wulfman79). The Perseus software not only serves these chunks as fully rendered web pages, but also, given the proper address, as fragments of the raw TEI file in XML. Because the chunking pattern of the text is encoded in the TEI file, the XSLT file that creates the RDF can also discern the specification and create conforming patterns for the addresses of the page links. Thus it is possible for the mapping widget to provide textual evidence drawn from the digital library and render it inline, in a way that does not detract the reader from the map.

Illustration three shows the relations necessary to make this possible. As was the case with the example shown in Illustration one, the evidence, here given the URI hammond_diary#KamloopsLetter, has a label. Its web address, though, is to the Perseus 'xmlchunk.jsp' service. In addition, the evidence object is associated with an appropriate XSLT file available on the web.[5] This transforms the TEI-encoded XML chunk into a fragment of a web page encoded in XHTML.


Graph of Relations Needed to Render a Reference Inline

Figure 3: Graph of Relations Needed to Render a Reference Inline

The result of providing these two relations is a series of user interactions best explained with reference to Illustration four. When the user clicks on the text “Hammond heads toward Portland, OR ...”, the words ”Evidence” and “Referred To In” appear below. These are the labels of two kinds of relations between the event and a text. RDF and the software allow for the creation of a custom taxonomy that describes the relationships between event and text, which may include categories such as “Eyewitness Account,” “Memoir,” and even types of secondary resources such as “Discussion” and even “Refutation”.[6]

When a user clicks on either the “Evidence” or “Referred to In” headings, the citation for each applicable source appears. In the case of our example, only one source appears, “The Diary of John Hammond p. 10b.” However, because a large number of citations potentially could be displayed at the instigation of the user, the data is again rendered through an asynchronous update of the page. Finally, when the user clicks on this text, the browser fetches both the TEI fragment and the XSLT file associated with the reference that are provided in RDF. It then renders the TEI fragment into XHTML using the XSL file and inserts the resulting XHTML inline. Subsequent clicks on the text, citation, or resource class label will hide and show all enclosed materials, allowing the user to explore this material in-place.


Evidence Rendered Inline
Figure 4: Evidence Rendered Inline

Some scholars might be concerned that this approach to rendering an historical text is part of an unfortunate trend in the online publication of documents. It is sometimes claimed that the digital medium excessively fragments texts, doing damage to their cohesiveness or to their narrative integrity. However, it should be noted that texts have been fragmented for the sake of historical information many years before the advent of the Internet: the historical sourcebook is a well-established genre. In fact, this inline rendering technique is a substitute, not for extensively quoting the text, but rather for citing it and, in all likelihood, it being left unread. As a result, a technique such as this makes the text more accessible to the person with historical interests, not less. Finally, the digital medium makes it possible to offer multiple paths of research. In this case, it would be a simple improvement to add a link that allows the researcher to load the text in its own browser tab or window in order to explore the context of a particular source or to read a text that has been discovered through a geo-temporal search process.


7.0 Other Experiments

The Fawcett Toolkit includes some less well-developed experiments in historical encoding and visualization. We used a hemlRDF:comprisesEvent property to nest events within each other. For instance, an event labelled “John Hammond's Trip to Western North America” could comprise all the events marked up in the diary, and the single event referring to Hammond’s trip would itself be one of several events nested under an event labelled “John Hammond’s Life,” using the hemlRDF:comprisesEvent relationship. This hierarchy of events was used to draw a tree list using a component from the Yahoo user interface library. In this form of visualization, the events in the tree that have hemlRDF:comprisesEvent properties are indicated with an expand icon, shown in the form of a “+” sign that, when clicked, retrieves and renders the child events. The nesting is not ranked, so it could continue through a great number of levels. It is yet another way that large numbers of events can be displayed together.

Users might want to add to the pool of references associated with an event. The JavaScript experiment named “Libenter” shows how the client’s browser can capture a description of the location on the page of a portion of text that has been highlighted with the browser’s “select” tool. Because SPARQL now includes commands to “update,” or add to, RDF databases, it would be possible to associate with an event the URI of a pertinent webpage and the description of the important part of that page, similar to what was done above with the TEI chunks from the Perseus server.

Finally, the Joseki server used in this work had an important alteration that was necessary for it to serve historical data successfully. A property function was written for it so that it would translate the various temporal representations applied to events into plain numbers. These numbers allow the client to request a temporal range or data sorted temporally.


8.0 Future Work

The examples given above were all drawn from an English source and with English metadata labels, but often historical research involves many languages. It is, of course, considered best for a researcher to read evidence and discussions in the author's original language. However, even the most linguistically adept professional historian, not to mention the interested layperson, cannot be expected to fulfill this standard at all times. When the historical researcher cannot read the pertinent document in its original language, she would naturally prefer having access to a translation rendered in a language that she is familiar with. For these reasons, a system such as the one described here should, when possible, provide researchers with translated sources and discussions that suit their abilities.

The process of altering the responses of a computer program in order to suit a user's linguistic and cultural preferences is known commonly as "internationalization." In recent years, several standards have arisen to support this process in the networked world of the Web. First, there is BCP 47, a standard set of abbreviations for languages, their variants, and their computer encodings (Phillips and Davis). Documents encoded in XML languages, such as XHTML, identify the language of the text enclosed within an element through the use of an xml:lang attribute equalling the appropriate abbreviation. Finally, the hypertext transfer protocol− that is, the language that a Web browser speaks to a Web server when requesting materials − includes a line in the “header” of the request labelled “Accept-Language,” whose list of language tags “restricts the set of natural languages that are preferred as a response to the request” (Fielding et al.). When a server receives such a request with this header line, it uses this information to match the text resources available to it, choosing the alternative that is highest on the list of user preferences.

This process is appropriate to, for instance, a commercial Web service, but not to the scholarly processes we have in mind, because it implies that all the linguistic representations of a given text are equal in value. At very least, a text that is in the author's original language should be preferred if the user's Accept-Language header lists this among the languages that he can read. Ideally, the server might rate the linguistic fidelity of the source – for instance, ranking a translation by the author above other translations – and serve the best text with this and the user's language abilities in mind. Finally, if the only document available to the browser is in a language that the researcher cannot read, it is important that the server provide this source nonetheless because its absence would suggest that there is no asserted evidence for the event. A determined researcher would, after all, contact a colleague who could offer assistance in the translation.

The web servers built by the Heml project upon the Cocoon XML publishing engine were able to negotiate language content in this manner, but an unmodified server that only returns RDF in response to SPARQL queries could not. This lack of negotiation occurs because, in common practice with RDF, the metadata that describes the language of resources does not appear within the propositional logic on which SPARQL queries are based. Thus, even though SPARQL makes it possible to filter a query based on language (for example, excluding event labels that are not within a given set), SPARQL does not make it possible to perform queries of the sort, “what languages are used to label this event?” Yet it is exactly the latter form of queries that are necessary for the kind of more advanced language-matching necessary for a historian who would like the query to fall back to resources that are in languages not listed in the Accept-Language list sent by his or her browser. In order to perform these sorts of functions, the metadata that encodes the language of resources must be declared within RDF statements. Since it is not reasonable to hope that those producing the RDF data would adhere to such a language markup scheme, the solution seems to be to modify the RDF server so that it rewrites the RDF graph to express linguistic information in this manner when each RDF source is loaded into the server. Such a system would generate properties that are defined as subproperties of rdfs:label, including one for each linguistic option encoded in BCP 47. These could then be queried through the rdfs:label property using an RDFS reasoner.

RDFS subproperties can also be used to indicate the identity and authority of the person associating a given property with a given event. If, for instance, historian Smith disputed that the passage cited in Illustration one (labelled with the URI #KamloopsLetter and expanded in Illustration three) actually provides evidence for that event, it would be possible to indicate that the relation between “August 15, 1871” is asserted by Chung, but disputed by Lebans, and generate different chronologies and event lists according to their opinions. This paves the way for satisfactorily using the more complex relations of the CIDOC-CRM, indicating relations like causation, that are only acceptable when ascribed to a scholar's opinion. This work is planned for the Summer of 2009.

These processes, then, are the pitfalls and potentials of a global index of historical resources based in a simplified use of the CIDOC-CRM, served with slightly modified SPARQL servers, and visualized with in-browser JavaScript programs. There remains, however, one paramount issue on which the success of this approach depends: a changing culture of research in the Humanities. It has long been observed that the emphasis in the humanities on solitary research betrays its roots in monastic scholarship. But this lack of a collaborative spirit is perhaps most obvious in the realm of digital publication, where the real basis of the interchange of ideas, such as the data and metadata encoded in databases and TEI-encoded XML files, are rarely published. Indeed, in today's security-minded Internet there are ever-increasing ways for a Humanities project to impede the aggregation of its resources. It should be understood that, as a discipline, History has more to gain from the responsible popularization of its topic and from the free interchange of ideas than each scholar does to lose if his data is not always associated with his exact web sites. The biological sciences recognize the benefits of the free flow of data. For example, The Broad Institute, , a leading Genome research lab, publishes complete mammal genome sequences and makes them available to fellow researchers based on the following conditions: “1. The data may be freely downloaded, used in analyses, and repackaged in databases. 2. Users are free to use the data in scientific papers analyzing particular genes and regions if the provider of these data ... is properly acknowledged.” (Broad Institute of MIT and Harvard). The rising tide of online historical information is the historians' equivalent to the genome database, and it should be made as widely available, searchable, and interchangeable as possible.



Works Cited

Beckett, Dave. “RDF/XML Syntax Specification (Revised): W3C Recommendation 10 February 2004.” W3C: World Wide Web Consortium Website. 2004. Web. 6 June 2009. <http://www.w3.org/TR/rdf-syntax-grammar>.

Broad Institute of MIT and Harvard. “Horse Genome Project.” At Broad Institute Website. 2009. Web. 6 June 2009. < http://www.broad.mit.edu/mammals/horse>.

Ciravegna, Fabio et al. “Finding Needles in Haystacks: Data-mining in Distributed Historical Datasets.” The Virtual Representation of the Past. Ed. Lorna Hughes and Mark Greengrass. Farnham, UK: Ashgate, 2008. 65-79. Print.

Costa, Tom. “The Geography of Slavery in Virginia.” Virginia Center for Digital History Website. 2005. Web, 6 June 2009. <http://www2.vcdh.virginia.edu/gos>.

Crane, Gregory, and Clifford Wulfman. “Towards a cultural heritage digital library.” JCDL ’03: Proceedings of the 3rd ACM / IEEE-CS joint conference on Digital libraries. Houston: IEEE Computer Society, 2003. 75-86. Print.

Dempsey, L. “Divided by a Common Language: Digital Library Developments in the US and UK.” The 4th International JISC / CNI Conference Website, Edinburgh. 26-27 June 2002. 2002. Web. 6 June 2009. < http://www.ukoln.ac.uk/events/jisc-cni-2002/presentations/ppt-2000-html/lorcan-dempsey_files/v3_document.htm >.

Doerr, M., J. Hunter, and C. Lagoze. “Towards a Core Ontology for Information Integration.” Journal of Digital Information 4.1 (2003): 169. Print.

Eide, Oyvind. “The Exhibition Problem. A Real-life Example with a Suggested Solution.” In Literary and Linguistic Computing 23.1 (2008): 27-37. Print.

Emsley, Clive, Tim Hitchcock, and Robert Shoemaker. Old Bailey Online. 2009. Web. 6 June 2009. <http://www.oldbaileyonline.org>.

Feigenbaum, Lee, Elias Torres, and Wing Yung. Sparql.js. 2007. Web. 6 June 2009. <http://thefigtrees.net/lee/sw/sparql.js>.

Fielding, R. et al. “HTTP/1.1: Header Field Definitions, June 1999.” W3C: World Wide Web Consortium Website. 1999. Web. 6 June 2009. < http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html>.

Gilles, Sean. “Concordia, Vocabularies, and CIDOC CRM.” 2008. Web. 6 June 2009. < http://concordia.atlantides.org/docs/concordia-crm.html#on-the-cidoc-crm >.

Hammond, John. Mount Allison Archives . John Hammond Fonds, 2008. 8800/2. Print.

Harper, J. Russell. Early Painters and Engravers in Canada. Birkenhead, England: Uof Toronto P, 1970. Print.

Hughes, Lorna and Mark Greengrass. The Virtual Representation of the Past . Farnham, UK: Ashgate, 2008. Print.

Krötzsch, Markus, et al. “Semantic Wikipedia.”Web Semantics: Science, Services and Agents on the Web 5.4 (2007): 251-261. Print.

Phillips, A. and E. Davis. “BCP 47: Tags for Identifying Languages.” RFC Editor Website. 2006. Web. 6 June 2009. < http://www.rfc-editor.org/rfc/bcp/bcp47.txt>.

Robertson, Bruce. “Exploring Historical RDF with Heml.” Digital Humanities Quarterly. 3(1). 2009. Web. 6 June 2009. < http://www.digitalhumanities.org/dhq/vol/003/1/000026.html>.

───. “DTDs and Schemata.” The Historical Event Markup and Linking Project Website. 2009. Web. 6 June 2009. < http://heml.mta.ca/samples/blocks/heml/schemata>.

───. “Visualizing an historical semantic web with Heml.” Proceedings of the 15th international conference on World Wide Web (WWW’06). Edinburgh, Scotland, May 23-26, 2006. New York: ACM: 1051-1052. 2006. Web. 6 June 2009. < http://www2006.org/programme/files/xhtml/p199/pp199-robertson-xhtml.html >.

Sunstein, Cass R. Infotopia: How Many Minds Produce Knowledge. Oxford: Oxford UP, 2008. Print.

TEI Consortium. The TEI Consortium : guidelines for electronic text encoding and interchange . Oxford: Humanities Computing Unit, University of Oxford, 2002. Print.

TEI Consortium. TEI P5: Guidelines for Electronic Text Encoding and Interchange . 2007. Web. 6 June 2009. <http://www.tei-c.org/release/doc/tei-p5-doc/en/html >.

Thaller, Manfred. “Which? What? When? On the Virtual Representation of Time.” The Virtual Representation of the Past. Eds. Lorna Hughes and Mark Greengrass. Farnham, UK: Ashgate, 2008. 115-124. Print.



Endnotes

[1] In fact, this example is somewhat simplified, since it would likely result in two John Hammonds being associated with all the events in which each is associated elsewhere in the RDF. A better approach is to use RDF Schema’s “subclassing” for this problem.

[2] The text highlighted in this paper admittedly is not representative of the many problems of encoding time. A recent discussion of these issues can be found in (Thaller, 2008).

[3] Hammond was an artist and photographer who studied with Whistler in 1885-6. He was instrumental in establishing the collection of the Owens Art Gallery of Mount Allison University, and was director of the art school at Mount Allison College 1907-1920 (Harper, 1970: 144-45).

[4] AJAX stands for “Asynchronous Javascript and XML.” This process is only AJAX-like because the mode of communication between the server and the client is not XML, but rather JSON responses to SPARQL queries.

[5] The XSL file is provided indirectly through the hammond_diary#TeiFragmentRenderer so that more than one address could be given for the file, and so that the client can more easily cache the compiled XSLT file.

[6] Even though the reference is defined with the same hemlRDF:Evidence element as illustrated in Illustration 1, the “Referred To In” reference type also appears because “Evidence” is encoded as a subclass of “Referred To In,” and all references appear in both their own class and all superclasses. It is not clear if this is the preferred behaviour.

Share

Authors

Bruce G. Robertson (Mount Allison University)

Downloads

Issue

Licence

Creative Commons Attribution 4.0

Identifiers

File Checksums (MD5)

  • HTML: 2ea2b1f7fcd56681c1d9b5ff1c744256