Introduction[1]
Canada’s leadership in the digital humanities continues: 2009 welcomed the release of the Canadian Century Research Infrastructure (CCRI). The CCRI is a census-based digital project that involves scholars in the fields of history, sociology, geography, demography, economics, statistics, and interdisciplinary studies, all drawn from seven Canadian universities and partnered with IBM and Statistics Canada. The first project based in the Humanities and Social Sciences to receive a major grant from the Canadian Foundation for Innovation, the CCRI’s mandate is to create public use micro-databases of the 1911, 1921, 1931, 1941, and 1951 national censuses of Canada. This new research infrastructure represents a fresh contribution to the study of the making of modern Canada, and has combined ‘traditional’ research methodologies in the humanities and social sciences with an array of digital products and data processing software. As a result, a wealth of new research on individuals, families, households, and communities (ethnic, religious, geographic, etc.) will be supported and enriched. This article discusses the creation of what has become an innovative cyber-infrastructure, one which has successfully merged quantitative and qualitative data into an interactive digital platform.
With an eye towards providing an overview of the project, this article explores several of the goals and challenges that characterized the construction of the database and highlights the innovative solutions that were crafted along the way. It examines some of the tools and strategies that have been created as a result of the project, including the project’s innovative Sample Point Identification, Data Entry and Reporting program (SPIDER), and conceptual organization of one of the CCRI’s distinguishing characteristics: a contextual database that provides a wealth of qualitative data central to understanding the cultural, social, political, and economic contexts within which censuses were taken. Moreover, the long-term contributions and possibilities of the CCRI database, which have already fostered in both Canada and abroad, and the type of new research it can support will all be considered. Overall, the article tackles four main topics: the genesis and context of the CCRI in social science research, the challenges presented by the unique data capturing requirements of the project and their resolution, the creation of new and over-arching tools designed to help distribute project data and to make them as accessible as possible, and the long-term benefits of the project itself for research in the digital humanities.
1.0 The project
The CCRI was conceived of and developed by its principal investigator, Chad Gaffield, and team leaders at seven partner Canadian Universities: Peter Baskerville and Eric Sager (University of Victoria); Carl Anheim, Charles Jones and Lorne Tepperman (University of Toronto); Gordon Darroch and Evelyn Ruppert (York University); Adam J. Green (University of Ottawa); Claude Bellavance and France Normand (Université du Québec à Trois-Rivières); Marc St-Hillaire (Université Laval); and Sean Cadigan (Memorial University). Each university developed a version of its own CCRI data research centre, many of which were physically constructed to meet the security requirements of Statistics Canada.
In 2002, the CCRI received its initial funding from the Canadian Foundation for Innovation (CFI). Over the course of the following five years, project participants created 1.8 million case records from five separate Canadian censuses, all in computer-readable format. These case records were then cross-tabulated so that researchers could run queries across both variables and census year. Each census year was then overlaid with a Geographic Information System (GIS) mapping sequence, thereby representing the shifting boundaries of census districts over the course of the first half of the twentieth century. GIS map layers enable geographic location, selection, aggregation, and analysis of sample data, as well as some mapping of generalized census data. In addition, the database information received added contextual value from a national survey of newspapers (in both English and French) covering the social and political context of census-taking, a review of official parliamentary proceedings at both the federal and provincial levels, and a host of internal working documents, memoranda, and correspondence from the officials who conducted the original censuses and tabulated the original data at Statistics Canada. The result is a massive, searchable, and interactive research infrastructure capable of supporting research in history, geography, sociology, Canadian Studies, medicine, and a host of other disciplines either separately or as multidisciplinary efforts.
The value of census enumerations for research is substantial: arguably, the census offers the most comprehensive evidence concerning the make-up of the Canadian population as a whole. In theory, the modern census covers the entire country, and in practice it reports on more residents in Canada than any other source. Moreover, the Canadian census not only includes almost all the key questions usually included in the censuses of other countries, but also some less frequently asked questions such as those concerning religion, language, and rural/urban distinctions. These diverse features explain why Statistics Canada has, due to demand, created microdata sets for research purposes for the 1971, 1976, 1981, 1986, 1991, 1996, and 2001 enumerations.
Far from being confined to the Canadian context, census-based analysis has become something of a staple across the Western world. Indeed, the CCRI fits into a much larger international context of assembling census-based databases for a range of collaborative projects. Among the more established and innovative of such endeavours are the North Atlantic Population Project (NAPP), the U.S-based Integrated Public Use Microdata Series (IPUMS USA 2009), the University of Victoria-based Canadian Families Project (Canadian Families Project), and several other smaller government and commercial databases. In each case, the primary goal of these databases is to enhance – or, in some cases enable – social science research by professionals, students, academics, and interested members of the larger public. Having made accessible a massive storehouse of information on the populations of Canada, the United States, the United Kingdom, and other countries, these projects allow for the study of human behaviour across time, the movement of peoples across borders, large-scale patterns of demography and human geography, the wax and wane of the prevalence of specific languages, religions and ethnic identifications, and economic trends involving occupation, labour force participation, home ownership and income. To some extent, the potential of these databases is only beginning to be fully understood; recent initiatives using the data have included medical studies (by tracing gene pools over time), family patterns (examining fertility, gendered participation in the workforce, and standards of living), and the construction of identity (tracing the origins of self-identification as “Canadian” or “African-American”).
As is discussed by principal investigator Chad Gaffield in his article “Conceptualizing and Constructing the Canadian Century Research Infrastructure,” the CCRI has always been underwritten by the desire to create a true “research infrastructure” – a large-scale research database, normally associated with the natural sciences or engineering, that would facilitate national and interdisciplinary social science research, and which assumes that its content “will be used by other individuals as well as its creators for both ongoing and not-yet anticipated research efforts” (Gaffield 54). The CCRI team leaders were also always focused on an additional principle: that traditionally “quantitative” data (such as census enumeration totals) and traditionally “qualitative” data (such as newspaper articles) were integrated into the same research infrastructure.
In addition, and as has been emphasized by the team leaders since the outset of the project, the Canadian Century Research Infrastructure enables an innovative and internationally significant program of research focused on a central question: What characteristics, processes, and circumstances explain the making of modern Canada? Building upon a wealth of research completed over the past twenty-five years, the new images of complexity and diversity have reflected the reinterpretation of Canadian history in terms of not only the "famous and infamous" but also, and perhaps most importantly, the "anonymous." In this view, an understanding of the thoughts and actions of those in positions of official and unofficial power are considered to be a necessary but not sufficient condition for historical explanation. Rather than being characterized as passive beneficiaries or victims of those in leadership positions, every person is considered to have contributed in diverse and uneven ways to the making of the Canadian project. In recent decades, scholars have been revealing the "hidden history" of Canada in which large-scale social change occurs as a result of individual decisions and actions multiplied over and over across the entire population. It is in this sense that scholars now analyze change in Canada from both the "top down" and the "bottom up," as well as from the vantage point of the many interactions among individuals within families, communities, and larger jurisdictions.[2]
In this context, the CCRI is designed to overcome a major difficulty now impeding research on census data: the lack of a research infrastructure to study the making of modern Canada. Prior to the efforts of the CCRI project, it was arguably impossible to undertake systematic and comprehensive research on census data for the period between 1901 and 1971 because no research infrastructure for such work had been created. For this reason, the CCRI offers not only new databases for the twentieth century, but also ultimately connects them to previously-created databases to build a research infrastructure that extends back to Canadian Confederation and forward to the twenty-first century. The CCRI project therefore represents the culmination of research efforts that have been contributing to what Chad Gaffield has called "the rewriting of Canadian history." It is these research projects that, taken together, will significantly advance our knowledge of the characteristics, processes, and circumstances of the making, by both the "famous" and the "anonymous," of modern Canada.
2.0 CCRI Microdata Processing: Innovation in Information Technology[3]
In order to enable such research, the primary task became capturing the raw data originally generated in the taking of five national Canadian censuses. According to Statistic Canada’s published census tables, this meant processing responses for 7, 206, 643 people in 1911, 8,787,949 in 1921, 10,376,786 in 1931, 11,506,565 in 1941, and 14,009,429 in 1951.[4] Gordon Darroch, Richard Smith, and Michel Gaudreault (2007) have offered a comprehensive overview of the important theoretical and practical questions surrounding the development of designs for large representative samples of the census along with the software developed by the CCRI for capturing and processing that data (65-75). What we wish to highlight here is the significant and innovative contributions of the CCRI Information Technology (IT) team through a discussion on the evolution of the team's work. Indeed, one of the major contributions of the CCRI has been its ability to further advance the collection and dissemination of census statistics statistics using cutting-edge software tools. Where the existing software was inadequate to CCRI team aspirations, the IT team designed and developed its own.
Computer software was essential to process census data from approximately twenty million images, the result of digitizing the original census schedules which had ben stored on microfilm.[5] Software development was the responsibility of the CCRI IT team located across the country, through the core of the IT team was located at the University of Ottawa where the central server was housed. There were three eras of software development during the project’s life: The first era saw the CCRI IT team focus on the 1911 census, and the goal was to develop software similar to the programs used by previous census projects (such as the 1901 Canadian Families Project), with expertise rooted in Microsoft Visual Basic and Microsoft Access. The second era, which began in late 2004 and focused on the 1921 to 1941 censuses, witnessed the growth of the IT team due to the recruitment of software professionals. The goal in this phase became innovation, and software development was built around the Java programming language. The third era, from mid-2006 onward, focused on the challenges of the 1951 census. The census of 1951 marks a new point of departure in the history of the census in Canada in that the majority of responses were recorded using machine-readable technology (punch cards). What follows is an overview of some of the major challenges faced, and innovations advanced, by the IT team in its three major phases of life.
3.0 Sample Point Selection
The CCRI model is based on a partial-count sample; the basic sample units are dwellings, and all responses for each individual in the sampled dwelling was recorded. Thus, the project defined year-specific sampling strategies that determined which of the available dwellings on the census schedule images would be keyed (that is, captured for the sample) for a given census. In the first era of software development for the CCRI, two programs were used together to generate the 1911 samples. The first presented the coordinates of the sample points for data capture, which had been previously identified from microfilm reels. It recorded how many households existed for each census sub-district; that is, using a third-party image viewer the data entry operator examined each schedule and recorded the observed number of dwellings per sub-district. The second program used these counts and a random number generator to generate a list of dwelling numbers for each sub-district. Each number in the list represented a dwelling that was to be included in the sample for a particular sub-district.
In the second era of software development, which focused on the censuses of 1921-1941, a new program was developed for sample point selection by the CCRI IT team that improved the quality of the sample by giving more control to the data entry operators. Time had been lost during microdata capture for 1911 when the computer selected a dwelling, for example, which had to be relocated and replaced at data entry time because it was illegible. Moreover, data capture was also slowed because the digital image had to be positioned in the viewer to locate the header information and then repositioned to reveal the dwelling data for the selected sample point. As Darroch, Smith, and Gaudreault explain, “we decided to redesign the entire census microdata creation process, enhancing its simplicity and elegance by combining the diverse requirements into a single application. Our ambition was to identify or tag the sample points of interest on a digital image and automatically keep track of their location during the entire research life cycle” (68).
The CCRI IT team began formulating a specially-designed image viewer to browse all the dwellings on a reel. One of the most innovative tools developed by the CCRI IT team was the Sample Point Identification, Data Entry and Reporting (SPIDER) software. SPIDER was envisioned by the CCRI as the launch pad for all of the project’s software tools related to creating microdata from digital images. It was developed to take advantage of the opportunity to use digital images of census schedules in a way that no existing software could. When a target dwelling was encountered (the characteristics that determined a target dwelling were defined by the sampling protocol developed for the particular census year), the data entry operator would confirm the legibility of the sample and could then use a point-and-click interface to highlight the data of the individuals that occupied the dwelling. When the user marked the image, the program used the image coordinates to generate a sample point and save it to a central database. The data entry operator now used the newly designed image viewer to browse all the dwellings on a microfilm reel, and store census schedule images, sample point dwelling coordinates, data capture, error messages, notes, comments, and audit traces in a database for immediate retrieval at any time (Darroch, et. al. 69).
The sample point selection approach had to be revisited for the 1951census because the 1951 schedules were markedly different from the schedules used in prior years. The previous versions of the sample point selection software were designed for a multiple tabular census schedule where each row of the form represented the data recorded for one individual. In other words, a single form contained the complete microdata for several individuals. The original 1951 schedule was completely different: each form contained the data for only a single individual. Accordingly, a new image viewer was developed to index the 1951 images. As the data entry operator browsed each reel, he or she could examine each image and use the software to tag it.
The 1951 sampling program used the results of the indexing process to generate sample points. The household information collected during the indexing phase was used to identify the target households for which sample points were created. Additionally, a substitution program was written to select a suitable replacement for sample points that, during the data entry process, were determined to be unusable.
4.0 Data Capture
After sample point identification, the microdata was captured. Data entry operators keyed the microdata from the images into electronic forms developed by the IT group. Several programs then performed basic validations on the keyed data before storing it in a DB2 database.
For the 1911 census, a data capture program presented a screen that resembled the census form. At the top of the screen was a locator number that identified a particular dwelling on a particular image. The leading characters of the number identified which image should be opened by the operator (using a third-party image viewer). The final digits of the locator number were used to identify the target dwelling among several dwellings that might appear on the image. The microdata for the individuals in the target dwelling could then be keyed into the data capture software.
For the 1921-1941 censuses, the data entry program also used a grid layout for capturing the individual responses. Instead of presenting a locator number to the operator, however, CCRI software used the data collected during sample point selection to present the actual image in a custom viewer that highlighted the target individuals.
The 1951 data capture program was based on an innovative image viewer created by the CCRI IT team that featured optical mark recognition capabilities. The viewer not only allowed the operator to view the images of each form belonging to a particular sample point, but it also interpreted the markings on the form and automatically completed a data grid with the corresponding answers. The operator could then manually specify the correct answer when the computer failed to recognize the correct response.
5.0 Data Cleaning
Once data capture was complete, the collected data was submitted to a CCRI cleaning program that generated an improved (or “cleaned”) copy of the original verbatim data. Data cleaning is the process in which “dirty” data are detected and corrected in order to prepare the data for validation; this usually meant cleaning misspelled, illegible, or erroneous responses. The cleaning cycle software verified the data in three steps: promotion, standardization, and validation.
Promotion: during data entry, an operator could suggest an alternate value when the enumerator’s recorded response was missing, illegible, or suspect. The cleaning software replaced the missing or suspect value with the operator’s suggestion in the “cleaned” copy of the data.
Standardization: misspellings, abbreviations, and synonyms of standard answers were replaced with the standard form of the response in the “cleaned” copy of the data. For example, “c.o.e” in the RELIGION column was replaced with “Church of England” in the “cleaned” copy of the data.
Validation: the data was subjected to various validations. When an exception was detected, the suspect record was sent back to the originating CCRI centre where the data was entered with an error message. For example, if a married individual was less than 12 years old, the data record was flagged and sent back for review.
6.0 Data Coding
Coding is the process of associating each response to its corresponding code in a chosen coding scheme. A coding scheme can be very specific and address a single domain, or it can be broad and offer codes for several domains. The coding scheme created by the CCRI was very specific: it offered codes for each of the domains (e.g. LANGUAGE, RELIGION) covered by the 1911-1951 censuses. The CCRI IT team developed code management software that allowed the coding team to manage (add, modify, or delete) the coding schemes used to code the captured data.
Moreover, the CCRI IT team developed a code mapping system (the association of a response to a particular code scheme code) that allowed members of the cleaning team to associate one or more standardized responses to a specific code from a chosen coding scheme. Mappings were an efficient way to store coding information. For example, presume that there are four thousand nurses in the database. For example, if the occupation “NURSE” was to be coded to 12345, then the CCRI software created a single mapping to reflect the association rather than modifying the data of the four thousand individuals that have nursing as an occupation. This software also provided a review feature, which allowed other team members to improve the work of the coding team as well as a sign-off function that team leaders in the coding team used to accept or reject the reviewed mappings.
In addition, the CCRI Geocoding Team defined a series of polygons that represent geographical regions within Canada (St. Hilaire, et. al.). Each polygon has its own unique identifier called a “CCRIUID” by the project. The IT group also developed a program that used summary files produced by the Geocoding Team in order to assign a unique CCRIUID to every individual in the CCRI database; thus, they could facilitate the future use of the integrated data set for later users.
7.0 Data Delivery
With a view towards the horizon of data analysis in the humanities, the CCRI database was constructed in order to be compatible with DB2-type database files. However, the Statistics Canada Research Data Centers – the primary venue through which non-aggregated CCRI data will be available upon completion of the project – do not currently support DB2 data files. As a result, CCRI data will be made available to researchers via a specially developed extract file designed to work around this problem. Specifically, the extract program produces a flat file of coded census microdata for a specific census year. This data in the file is stored in a common format that is readable by most statistical analysis tools such as SPSS and SAS. The two main functions of the extract program are selection and coding. Census data is protected by privacy legislation; therefore, the extract program suppresses all responses that might be used to identify an individual. Moreover, the codable responses that are extracted are coded before they are written to the flat file; that is, the extract program uses the mappings created to convert the textual responses to their equivalent codes. Certain extracted variables are not codable (for example the “CCRIUID”) and are written as is to the extract file.
In addition to the coded microdata, for every individual the extract file contains a number of variables that were derived from the keyed data. These derived variables are keys and sequence numbers that are used to structure the contents of the extract file. The IT group also created a program to generate these variables and store them in the database for retrieval by the extract program.
As can be surmised from the details above, the innovative and sophisticated software developed by CCRI IT members was central to the success of the CCRI project. Original software developed by the CCRI significantly advanced the ways in which census data—from approximately twenty million images in this case—could be processed. The innovations by the CCRI IT have laid the groundwork for future research concerned with the mining of census and other statistical data.
8.0 CCRI Contextual Data
One of the central components of the original vision of the CCRI was the concept of “data on data.” Whereas metadata such as census enumerator instructions reveal much about how the Canadian census was originally organized and constructed, the project’s collection of contextual data provides researchers with the background evidence needed to understand the broader intellectual and sociological influences in the making, taking, and reception of the census across Canada (Bellavance, et. al.). Contextual data raise and inform a host of questions not directly related to the enumerations themselves: How has the census been used by those outside of government circles? To what extent has the census guided fundamental decisions by Canadians in their lives? To what extent did census taking inform or was informed by the Canadian political process? In order to address such questions, and allow users access to a full range of research possibilities, the CCRI built a contextual database composed of Statistics Canada documents, political debates, and newspaper coverage taken from the era in which the censuses were constructed. Because of the epistemological questions concerning what constitutes such data, capturing them proved challenging. The process itself, however, opened up new research opportunities and avenues of exploration concerning the making and taking of the census.
Part of the work at the University of Ottawa centre, for example, was to develop a strategy for identifying material within the federal political debates – recorded verbatim in the federally archived Hansard volumes – that would be useful for researchers, especially once the content was linked to the process of the census.[6] As one might expect when working with a volume of information this size,[7] one of the immediate questions which arose revolved around what criteria should be established to effectively capture “relevant” data. One obvious research strategy was to utilize the Hansard indexes and capture all of the debates indexed under the term “census.” Such an approach, however, would not exhaust the data which the CCRI team wanted to capture in the spirit of the “contextual” data. A debate on immigration policy, for example, might prove valuable as a contextual source surrounding the immigration, nationality, and origin questions, but such a debate was not indexed under the term “census.” Thus, the task became to capture the debates not indexed. We concluded that there was an abundance of contextual data pertaining to the census that could be captured using more creative and imaginative research strategies.
The CCRI’s approach was to blend a range of search and capture approaches. First, sets of keywords were developed that would match Hansard index terms with census questions. For example, for the cost of education question in 1911 (Schedule 1, column 37) “education” and “schools” were obvious search terms, but they were not the only terms searched in the index that would aid in building a contextual data repository for this question. Other relevant terms that would capture contextual data that might uncover debates pertaining to this question were generated as well, such as “income,” “wealth,” “poverty,” “occupation,” and the like. Developing a set of keywords for each census question allowed us to mine debates that were not included in the index under the obvious terms related to the census question, but were potentially central to many of the contextual debates surrounding the census.
Second, we utilized the Hansard name index in order to create a “record linkage” of central figures in the making of the census (Winchester). For example, Sidney Arthur Fisher, the Minister of Agriculture (the department responsible for the census at the time of the 1911 enumeration), was an important politician involved in most of the debates concerning the 1911 census. Robert Borden was also a central figure in these census debates. Reading all of the debates in which they took part, according to the Hansard index, was more time consuming, and much of what was read was not captured and included in the contextual database; however, doing so certainly allowed us to capture more contextual data than we would have otherwise captured. Moreover, such a record linkage has also allowed us to increase the historical knowledge regarding these key figures in the history of the census in Canada.
Of course, our strategies in 2004 have already been usurped by new technology designed to execute textual searches in digitally-scanned images. By contrast, in 2004 (when the research began) the Hansard debates were yet to be digitized. The process of digitizing the historical debates, in fact, remains ongoing and highlights one of the unique challenges of long-term research in the digital humanities. Had they been digitized at the time of our contextual data mining, however, it is still unlikely that any effective digital search would have revealed a significant amount of contextual data for capture. That is, textual search technology in digitized images was still nascent at the time; indeed, the digitization and "searchability" of historical documents is a rapidly evolving process. Once the Hansard volumes are fully digitized and made available in the more advanced digital technology of today and tomorrow, they will no doubt allow the researcher to generate his or her own search terms and create his or her own record linkages in order to uncover his or her own topic-specific contextual data.
Original documents first generated at Statistics Canada were also added to the newpaper and Hansard databases. Beginning in the late winter of 2004, members of the CCRI team, with partners at Statistics Canada, examined materials from the Statistics Canada archives and evaluated them according to their relevance to the project. In addition to metadata in the form of published documentary sources, such as enumerator instructions, team members explored hundreds of boxes of unpublished correspondence, internal reports, analytical papers, bulletins, memorandums, personal notes, and the like. Initially, the exploration of these sources was undertaken to enhance the researcher’s understanding of the published metadata. The richness of what was found, however, surpassed CCRI team leaders’ expectations (Gaffield 60). The Statistics Canada sources provided new revelations concerning some of the reasoning behind decisions on the wording of census questions, strategies for elevating public awareness concerning an upcoming census, and the types of post-census requests made by arms of government, corporations, and community groups, all of whom were anxious to update their understanding of the Canadian landscape by using the latest census results. The result has been another layer of contextual data currently being integrated into the contextual database that can enhance the census researcher’s understanding of the ways in which the census was made, taken, and received by Canadians.
By making the contextual data a central component of the research infrastructure, the CCRI has put to rest the notion that training in quantitative research is a prerequisite for explorations into the census. One potential consequence of the CCRI is that it might attract those researchers not normally moved by the numbers themselves. In fact, many of the researchers employed by the CCRI throughout the infrastructure’s development were themselves “qualitative” scholars with little or no quantitative research training. The CCRI has blended quantitative and qualitative research in ways that complement each other, offer the possibility for study in and of themselves, and, perhaps more importantly, suggest collaborative team research opportunities.
9.0 Creating a Comprehensive User Guide
Creating a user guide for the entire CCRI was the final – and perhaps the greatest – challenge. Since the project was a collaborative effort involving a group of scholars from a number of disciplinary backgrounds, one enormous quandary was how to create a guide that would make the CCRI accessible in the disciplinary language of a variety of researchers. Before we could achieve a comprehensive guide for the CCRI, however, we had to consider a process that could bring together the exchange of concepts and methods from an interdisciplinary team, concepts that had characterized the CCRI from its inception, into a common language. In other words, how could a user guide be created in a singular way, using a singular language, while remaining true to CCRI philosophy of accommodating the disciplinary expectations and priorities of team members using multiple research strategies from a diverse spectrum of disciplines? (Gaffield 54-64). The answer was not clear-cut, but the best strategy quickly became clear: it would require a common effort by team members from across the disciplines, members from across the country, and partners at Statistics Canada, all of whom had to contribute to a symbiotic process. In this way, the user guide’s construction – and ultimately the user guide itself – represents the fusion of disciplinary and institutional cultures. The result, it is hoped, is a user guide that appeals to a wide variety of scholars and specialists.
Having devised a general approach, those overseeing the construction of the user guide shifted to much more specific challenges facing the integration of so many different types of data. The creation, for instance, of the Data Dictionary (a standard tool provided to researchers engaging a census database so as to define all applicable census terms) brought to the table questions that had hovered above the heads of CCRI researchers for years, but were theretofore not a major concern. The original conceptualization of the CCRI user guide was one that would provide an overall guide for each census year, then one specific user guide for each census variable. For its part and by its very nature, however, a Data Dictionary seeks to offer definitions for every database variable that exists. An unexpected and complex theoretical question was suddenly a matter of practical concern: what do we mean by a database variable? CCRI researchers had built their concept of “database variable” upon a less clearly defined “census variable,” which really meant “census question.” A single census question, however, did not always provide a single database variable. In fact, a single census question would often produce two or more database variables. For example, the “age” question, as asked between 1911 to 1941, produced two pieces of information that were recorded on the census schedule in a single cell. The first piece of data represented a numeric time value equal to the age of the respondent (e.g. 5, 15, 25), and the second piece of data represented the time unit (e.g. years, months, weeks, especially for the very young). Thus, a response in a cell for one individual could be recorded by the enumerator as “25 years,” while a response for another individual could be recorded as “5 months.”
Understanding how to overcome this challenge required us to rethink our understanding of how variables were captured in the census prior to 1951. That is, since 1951, censuses have been designed with computer data capture and databases in mind. Ideally, this design requires that only a single piece of information be recorded for each census question. Where there are supplementary data to be added to a single question, however, those data are given separate entry fields, which represent separate variables in the database, thereby establishing a modern-day “one-value-per-cell” rule. The 1911 to 1941 census schedule forms, however, were designed prior to the computer age. The census schedule forms for 1911 to 1941 were recorded in manually-completed grids (the columns represented the census questions and the rows represented the individuals enumerated). Ideally, enumerators would record one piece of information per question for each individual in the appropriate census grid cell. As the “age” question example above illustrates, however, this was not always the case. In some other instances, in fact, the enumerator was specifically instructed to record additional information for a single question either inside or outside of the provided cell. For example, in the 1941 “relationship to head” question, enumerators were instructed to record both the relationship to the head of the household being enumerated (in the cell provided) and whether the individual was a dependant of that household head (the enumerator was to denote this supplementary data in either the same cell or elsewhere on the schedule, such as the margins). A review of the completed schedules revealed that the enumerators used a variety of techniques to record more than one piece of information in a single cell or elsewhere on the schedule. Thus, on a number of occasions, the one-value-per-cell rule was violated.
The CCRI data capture system attempted to eliminate all violations of the one-value-per-cell rule that were discovered during an analysis of the census schedules and enumerator instructions. As a result, the CCRI data capture forms contain more columns than the schedules they represent. The extra columns were created to allow for the collection of “supplemental” data into the database. For example, the single census question “age” was captured as two variables: AGE_AMOUNT and AGE_UNIT. The AGE_AMOUNT variable captures the number (e.g. 16), while the AGE_UNIT variable captures the time unit (e.g. years, months, weeks, days).
Thus, the challenge for the User Guide team became one of providing a user guide for each census variable (i.e. question), while at the same time providing descriptions of each CCRI database variable. The challenge, in the end, became an opportunity to add a layer of information for the user regarding the multiplicity of database variables derived from single census questions. The User Guide Team thus employed a two-pronged approach to variable descriptions. On the one hand, the user would be provided with descriptions of each census question, along with relevant metadata appropriate for the analysis of that question; from there the user would be pointed to the single or multiple database variables derived from that question. Conversely, a user choosing to look at the database variable definitions first would be pointed to the “source” of the variable, most often a census question, and thus the relevant metadata appropriate for the analysis of that variable. In this way, they could be made aware that the variable of concern had “partner” variables that might be of interest, or, more importantly, might be essential for the user’s research. The result has been that the user has the range of metadata available in each census question user guide, including volumes of material, both published and unpublished, spanning all five census years from 1911 to 1951, that are dynamically linked to the data dictionary in a single workstation.
10.0 Looking Forward: Potential and Proliferation
The CCRI project began with the central goal of creating a research infrastructure to enable innovative and internationally significant programs of research focused upon a central question: what characteristics, processes, and circumstances explain the making of modern Canada? After more than five years of research, the construction of seven data centres across Canada, the digitizing of close to twenty million images, and the work of over one hundred academics, students, and coordination personnel the CCRI is now set to enable systematic, comprehensive, and comparative research and analysis on the first half of the twentieth century in Canada. A wealth of information collected from newspapers, political debates, secondary literature, and Statistics Canada internal records will also be added to the re-sampled content of these censuses, including an entirely new set of tools for geographic analysis.
The CCRI data has already begun its release within the academic community in partnership with Statistics Canada; this has included a range of individual and team presentations at the Carto 2008 conference (joint conference of the Canadian Cartographic Association and theAssociation of Canadian Map Libraries and Archives) in Vancouver, B.C. (May 2008), the Social Science History AssociationMeeting, in Miami, Florida (October 2008), leCongrès de l'association Canadienne des géographes / Canadian Association of Geographers, in Quebec City, Quebec (May 2008), and the Canadian Historical Association Annual Meeting in Vancouver, B.C. (June 2008). CCRI team leader Peter Baskerville also organized a gala conference held at the University of Alberta between October 3-5, 2008 entitled:State of the World: Information Infrastructure Construction and Dissemination for Humanities and Social Science Research. The prime objective of this conference was to boost the image of humanities and social science research within the broader Canadian research environment, with the CCRI as the central backdrop to the conference proceedings. In addition, attendees were invited to read posters and sift through beta versions of the CCRI infrastructure so as to be introduced to its features and organization.
These conferences have certainly highlighted some of the unique potential contributions that the CCRI is offering to the research community. One well-developed example has been the emerging potential for GIS data analysis: the assignment of the above-mentioned CCRIUIDs will allow the linking of aggregated microdata to summary table data. As a result, the GIS team will create both pre-defined aggregations by other geographic criteria (e.g. size of community, economic base, ecological regions, history of development) and plans to enable users to aggregate by user-defined criteria, both based on the characteristics of the CSD summary data (e.g. population density, ethnic make-up), geographic relationships, or through interactive selection of geographic areas (Moldofsky and Lowe).
In addition to these types of professional and research-based initiatives, the potential for the infrastructure to be utilized as an academic and teaching tool has already been beta tested with some inspiring results. As indicated by team leader Charles Jones at the University of Toronto, census-based databases allow for a nearly infinite range of topics and subject matter for large classes. Jones suggests that using historical census data – once enabled – helps teach students “specify their hypotheses, argument and findings in a circumscribed space” while also allowing them to develop skills needed for larger-scale project design. Up until this point, projects such as IPUMS and NAPP have been useful for accessing aggregate tables, while Public Use Microdata Files (PUMF) from recent censuses have been made operational by Statistics Canada. Jones found that even uninitiated students quickly came back with questions: why were certain questions posed or not posed in a given census? why have coding schemes changed? Why are variables in the public use files from modern Censuses configured to protect confidentiality – given that open access is a fundamental requirement for utilizing the wealth of information that is census data – when older censuses indicate no need for similar protection (Jones)?
11.0 CCRI Delivery
These comparisons bring us to the question of the delivery of the CCRI database files to users. Initially, the sample data files – as well as the associated geographic files and User Guides – will be distributed by Statistics Canada through their Research Data Centres (RDCs). These will undergo the same validation procedures and have the same constraints on the publication of data, similar to the procedures followed by the Public Use Microdata Files (PUNF) in modern censuses. In addition, part of the CCRI's plan has always been to allow users to extract sample data – properly anonymized and aggregated – through a publicly-accessible website interface, similar to the IPUMS project in the United States.
The implementation of a full data extraction web utility will probably have to wait for future follow-up proposals to CFI and other agencies. The CCRI database will continue to be housed and maintained at several of the current project centers, with the University of Alberta and the Université du Québec à Trois-Rivières likely occupying the lead roles in this maintenance; CCRI geographic reference files, GIS layers, summary tables, and other non-confidential files will ultimately be distributed through the main CCRI website, through some websites for related projects, and, hopefully, through a range of Library catalogues as well (Moldofsky and Lowe). Also on the immediate horizon is the creation of an extraction database so that future dissemination centers have at their disposal a full range of service tools for the research community.
The CCRI was a massive initiative. It was, in fact, one of the largest humanities and social sciences projects of its kind theretofore. And yet the CCRI represents in many ways only the beginning of what is possible in creating advanced, dynamic humanities and social science research initiatives. Through the collaborative use of sophisticated software and information technology, the CCRI has created an infrastructure that will enable census-based research with the potential to fundamentally transform our understanding of the making of Canada throughout the twentieth century. More than that, its success can serve to remind us about the potential of humanities and social science research in the digital age. Its release will mark not an ending but, hopefully, the initiation of a new era in humanities and social science digital research.
Works Cited
Bellavance, Claude France Normand, Evelyn S. Ruppert. “Census in Context: Documenting and Understanding the Making of Early-Twentieth-Century Canadian Censuses.” Historical Methods 40.2 (2007): 92-103. Print.
Canadian Families Project. Canadian Families Project Website . 2006. University of Victoria, Victoria. Web. 8 June 2009 <http://web.uvic.ca/hrd/cfp/>.
Darroch, Gordon, Richard D.B. Smith, and Michel Gaudreault. “CCRI Sample Designs and Sample Point Identification, Data Entry, and Reporting (SPIDER) Software.” Historical Methods 40.2 (2007): 65-75. Print.
Gaffield, Chad. “Conceptualizing and Constructing the Canadian Century Research Infrastructure.” Historical Methods 40.2 (2007): 54-64. Print.
IPUMS USA. Integrated Public Use Microdata Series: Census Microdata for Social and Economic Research. 2009. University of Minnesota, Minnesota Population Center, Minneapolis. Web. 8 June 2009. < http://usa.ipums.org/usa/>.
Jones, Charles. “Using Historical Census Data in Teaching.” CCRI Newsletter 1.6 (2008): 2-4. Print.
Moldofsky, Byron and Kerry Lowe “ Canadian Century Research Infrastructure Project: Historical GIS in the Service of Modern Census Mapping and Analysis . ” Unpublished Conference Paper. Presented at Carto2008 Conference, CCA/ACMLA, May 14, 2008; abstract. Web. 8 June 2009 < http://www.rdl.sfu.ca/imgs/carto/CARTO_Abstracts.pdf>.
NAPP. North Atlantic Population Project Website . 2009.University of Minnesota, Minnesota Population Center, Minneapolis. Web. 8 June 2009 < http://www.nappdata.org/napp/>.
St-Hilaire, Marc Byron Moldofsky, Laurent Richard, and Mariange Beaudry. “Geocoding and Mapping Historical Census Data: The Geographical Component of the Canadian Century Research Infrastructure.” Historical Methods 40.2 (2007):76-91. Print.
Winchester, Ian . “The Linkage of Historical Records by Man and Computer: Techniques and Problems.” Journal of Interdisciplinary History 1.1 (1970): 107-124. Print.
Population Volumes of the Census of Canada. The Dominion Bureau of Statistics / Statistics Canada, 1911-1951. Print.
Endnotes
[1] Anthony and Adam would like to thank the extended CCRI team for their hard work and generous support. In addition to the team leaders and IT team members mentioned below, Carmen Bauer, Mariange Beaudry, Nicola Farnworth, Kerrie Lowe, Byron Moldofsky, Mirela Matiu, Terry Quinlan, Laurent Richard, and Doug Thompson have all generated material on which this article was based.
[2] This discourse is drawn from the CCRI User Guide, but its underpinnings reach back to the original Canadian Foundation for Innovation (CFI) grant proposal.
[3] This section draws from Richard Smith’s and Brian Jennings’ explanation of CCRI software, its tools and features, and general overview of its use, which can be found in the CCRI User Guide (Canadian Century Research Infrastructure). Smith and Jennings are the key CCRI IT software architects.
[4] These data are available in a wide range of STC published tables. In this case, they were drawn from the Population Volumes of the Census of Canada 1911-1951, published by the Dominion Bureau of Statistics.
[5] The transfer of original census returns onto microfilm began in 1955 and was thought of as a long-term solution to the degradation of the paper records. In many cases, the paper records were thereafter destroyed.
[6] The example of Parliamentary Debates will be used here; for a review of data capturing involving Canadian newspapers, see Claude Bellevance, France Normand, and Evelyn S. Ruppert (2007).
[7] There are thousands of pages of recorded debate in each year of the federal parliament, both in the House of Commons and the Senate.