HOUSE_OVERSIGHT_016997.jpg

Source: HOUSE_OVERSIGHT • Size: 0.0 KB • OCR Confidence: 85.0%

Extracted Text (OCR)

(“3.14159”) and typos (“excesss”). An n-gram is sequence of 1-grams, such as the phrases “stock market” (a 2-gram) and “the United States of America” (a 5-gram). We restricted n to 5, and limited our study to n-grams occurring at least 40 times in the corpus. Usage frequency is computed by dividing the number of instances of the n-gram in a given year by the total number of words in the corpus in that year. For instance, in 1861, the 1- gram “slavery” appeared in the corpus 21,460 times, on 11,687 pages of 1,208 books. The corpus contains 386,434,758 words from 1861; thus the frequency is 5.5x10°. “slavery” peaked during the civil war (early 1860s) and then again during the civil rights movement (1955-1968) (Fig. 1B) In contrast, we compare the frequency of “the Great War” to the frequencies of “World War I” and “World War II.” “the Great War” peaks between 1915 and 1941. But although its frequency drops thereafter, interest in the underlying events had not disappeared; instead, they are referred to as “World War I” (Fig. 1C). These examples highlight two central factors that contribute to culturomic trends. Cultural change guides the concepts we discuss (such as “slavery”). Linguistic change — which, of course, has cultural roots — affects the words we use for those concepts (“the Great War” vs. “World War I’). In this paper, we will examine both linguistic changes, such as changes in the lexicon and grammar; and cultural phenomena, such as how we remember people and events. The full dataset, which comprises over two billion culturomic trajectories, is available for download or exploration at www.culturomics.org. The Size of the English Lexicon How many words are in the English language (9)? We call a 1-gram “common” if its frequency is greater than one per billion. (This corresponds to the frequency of the words listed in leading dictionaries (7).) We compiled a list of all common 1-grams im 1900, 1950, and 2000 based on the frequency of each 1-gram in the preceding decade. These lists contained 1,117,997 common 1-grams in 1900, 1,102,920 in 1950, and 1,489,337 in 2000. Not all common 1-grams are English words. Many fell into three non-word categories: (i) 1-grams with non- alphabetic characters (“‘18r’”, “3.14159”); (i) misspellings (“becuase, “abberation”); and (111) foreign words (“sensitivo”). To estimate the number of English words, we manually annotated random samples from the lists of common 1-grams (7) and determined what fraction were members of the above non-word categories. The result ranged from 51% of all common 1|-grams in 1900 to 31% in 2000. Using this technique, we estimated the number of words in the English lexicon as 544,000 in 1900, 597,000 in 1950, and 1,022,000 in 2000. The lexicon is enjoying a period of enormous growth: the addition of ~8500 words/year has increased the size of the language by over 70% during the last fifty years (Fig. 2A). Notably, we found more words than appear in any dictionary. For instance, the 2002 Webster’s Third New International Dictionary [W3], which keeps track of the contemporary American lexicon, lists approximately 348,000 single-word wordforms (/0); the American Heritage Dictionary of the English Language, Fourth Edition (AHD4) lists 116,161 (//). (Both contain additional multi-word entries.) Part of this gap is because dictionaries often exclude proper nouns and compound words (“whalewatching”). Even accounting for these factors, we found many undocumented words, such as “aridification” (the process by which a geographic region becomes dry), “slenthem” (a musical instrument), and, appropriately, the word “deletable.” This gap between dictionaries and the lexicon results from a balance that every dictionary must strike: it must be comprehensive enough to be a useful reference, but concise enough to be printed, shipped, and used. As such, many infrequent words are omitted. To gauge how well dictionaries reflect the lexicon, we ordered our year 2000 lexicon by frequency, divided it into eight deciles (ranging from 10° — 10° to 10° — 10°), and sampled each decile (7). We manually checked how many sample words were listed in the OED (12) and in the Merriam-Webster Unabridged Dictionary [MWD]. (We excluded proper nouns, since neither OED nor MWD lists them.) Both dictionaries had excellent coverage of high frequency words, but less coverage for frequencies below 10° °: 67% of words in the 10° — 10° range were listed in neither dictionary (Fig. 2B). Consistent with Zipf’s famous law, a large fraction of the words in our lexicon (63%) were in this lowest frequency bin. As a result, we estimated that 52% of the English lexicon — the majority of the words used in English books — consists of lexical “dark matter” undocumented in standard references (/2). To keep up with the lexicon, dictionaries are updated regularly (/3). We examined how well these changes corresponded with changes in actual usage by studying the 2077 1-gram headwords added to AHD4 in 2000. The overall frequency of these words, such as “buckyball” and “netiquette”, has soared since 1950: two-thirds exhibited recent, sharp increases in frequency (>2X from 1950-2000) (Fig. 2C). Nevertheless, there was a lag between lexicographers and the lexicon. Over half the words added to AHD4 were part of the English lexicon a century ago (frequency >10° from 1890-1900). In fact, some newly- added words, such as “gypseous” and “amplidyne”, have already undergone a steep decline in frequency (Fig. 2D). Not only must lexicographers avoid adding words that have fallen out of fashion, they must also weed obsolete words from earlier editions. This is an imperfect process. We Sciencexpress / www.sciencexpress.org / 16 December 2010 / Page 2 / 10.1126/science.1199644 HOUSE_OVERSIGHT_016997 Downloaded from www.sciencemag.org on December 16, 2010

Document Preview

Click to view full size

Extracted Information

Dates

December 16, 2010

Document Details

Filename	HOUSE_OVERSIGHT_016997.jpg
File Size	0.0 KB
OCR Confidence	85.0%
Has Readable Text	Yes
Text Length	5,867 characters
Indexed	2026-02-04T16:29:56.395853