HOUSE_OVERSIGHT_017013.jpg

Source: HOUSE_OVERSIGHT • Size: 0.0 KB • OCR Confidence: 85.0%

Extracted Text (OCR)

II. Construction of Historical N-grams Corpora As noted in the paper text, we did not analyze the entire set of 15 million books digitized by Google. Instead, we 1. Performed further filtering steps to select only a subset of books with highly accurate metadata. 2. Subdivided the books into ‘base corpora’ using such metadata fields as language, country of publication, and subject. 3. For each base corpus, construct a massive numerical table that lists, for each n-gram (often a word or phrase), how often it appears in the given base corpus in every single year between 1550 and 2008. In this section, we will describe these three steps. These additional steps ensure high data quality, and also make it possible to examine historical trends without violating the ‘fair use’ principle of copyright law: our object of study is the frequency tables produced in step 3 (which are available as supplemental data), and not the full-text of the books. II.1. Additional filtering of books IJ.1A. Accuracy of Date-of-Publication metadata Accurate date-of-publication data is crucial component in the production of time-resolved n-grams data. Because our study focused most centrally on the English language corpus, we decided to apply more stringent inclusion criteria in order to make sure the accuracy of the date-of-publication data was as high as possible. We found that the lion's share of date-of-publication errors were due to so-called 'bound-withs' - single volumes that contain multiple works, such as anthologies or collected works of a given author. Among these bound-withs, the most inaccurately dated subclass were serial publications, such as journals and periodicals. For instance, many journals had publication dates which were erroneously attributed to the year in which the first issue of the journal had been published. These journals and serial publications also represented a different aspect of culture than the books did. For these reasons, we decided to filter out all serial publications to the extent possible. Our ‘Serial Killer’ algorithm removed serial publications by looking for suggestive metadata entries, containing one or more of the following: 1. Serial-associated titles, containing such phrases as ‘Journal of, 'US Government report’, etc. 2. Serial-associated authors, such as those in which the author field is blank, too numerous, or contains words such as 'committee’. Note that the match is case-insensitive, and it must be to a complete word in the title; thus the filtering of titles containing the word ‘digest’ does not lead to the removal of works with ‘digestion’ in the title. The entire list of serial-associated title phrases and serial-associated author phrases is included as supplemental data (Appendix). For English books, 29.4% of books were filtered using the ‘Serial Killer’, with the title filter removing 2% and the author filter removing 27.4%. Foreign language corpora were filtered in a similar fashion. This filtering step markedly increased the accuracy of the metadata dates. We determined metadata accuracy by examining 1000 filtered volumes distributed uniformly over time from 1801-2000 (5 per year). An annotator with no knowledge of our study manually determined the date-of-publication. The annotator was aware of the Google metadata dates during this process. We found that 5.8% of English books had 5 HOUSE_OVERSIGHT_017013

Document Preview

Click to view full size

Document Details

Filename	HOUSE_OVERSIGHT_017013.jpg
File Size	0.0 KB
OCR Confidence	85.0%
Has Readable Text	Yes
Text Length	3,411 characters
Indexed	2026-02-04T16:29:57.990915