HOUSE_OVERSIGHT_016997.jpg
Extracted Text (OCR)
(“3.14159”) and typos (“excesss”). An n-gram is sequence of
1-grams, such as the phrases “stock market” (a 2-gram) and
“the United States of America” (a 5-gram). We restricted n to
5, and limited our study to n-grams occurring at least 40 times
in the corpus.
Usage frequency is computed by dividing the number of
instances of the n-gram in a given year by the total number of
words in the corpus in that year. For instance, in 1861, the 1-
gram “slavery” appeared in the corpus 21,460 times, on
11,687 pages of 1,208 books. The corpus contains
386,434,758 words from 1861; thus the frequency is 5.5x10°.
“slavery” peaked during the civil war (early 1860s) and then
again during the civil rights movement (1955-1968) (Fig. 1B)
In contrast, we compare the frequency of “the Great War”
to the frequencies of “World War I” and “World War II.” “the
Great War” peaks between 1915 and 1941. But although its
frequency drops thereafter, interest in the underlying events
had not disappeared; instead, they are referred to as “World
War I” (Fig. 1C).
These examples highlight two central factors that
contribute to culturomic trends. Cultural change guides the
concepts we discuss (such as “slavery”). Linguistic change —
which, of course, has cultural roots — affects the words we use
for those concepts (“the Great War” vs. “World War I’). In
this paper, we will examine both linguistic changes, such as
changes in the lexicon and grammar; and cultural phenomena,
such as how we remember people and events.
The full dataset, which comprises over two billion
culturomic trajectories, is available for download or
exploration at www.culturomics.org.
The Size of the English Lexicon
How many words are in the English language (9)?
We call a 1-gram “common” if its frequency is greater
than one per billion. (This corresponds to the frequency of the
words listed in leading dictionaries (7).) We compiled a list of
all common 1-grams im 1900, 1950, and 2000 based on the
frequency of each 1-gram in the preceding decade. These lists
contained 1,117,997 common 1-grams in 1900, 1,102,920 in
1950, and 1,489,337 in 2000.
Not all common 1-grams are English words. Many fell
into three non-word categories: (i) 1-grams with non-
alphabetic characters (“‘18r’”, “3.14159”); (i) misspellings
(“becuase, “abberation”); and (111) foreign words
(“sensitivo”).
To estimate the number of English words, we manually
annotated random samples from the lists of common 1-grams
(7) and determined what fraction were members of the above
non-word categories. The result ranged from 51% of all
common 1|-grams in 1900 to 31% in 2000.
Using this technique, we estimated the number of words in
the English lexicon as 544,000 in 1900, 597,000 in 1950, and
1,022,000 in 2000. The lexicon is enjoying a period of
enormous growth: the addition of ~8500 words/year has
increased the size of the language by over 70% during the last
fifty years (Fig. 2A).
Notably, we found more words than appear in any
dictionary. For instance, the 2002 Webster’s Third New
International Dictionary [W3], which keeps track of the
contemporary American lexicon, lists approximately 348,000
single-word wordforms (/0); the American Heritage
Dictionary of the English Language, Fourth Edition (AHD4)
lists 116,161 (//). (Both contain additional multi-word
entries.) Part of this gap is because dictionaries often exclude
proper nouns and compound words (“whalewatching”). Even
accounting for these factors, we found many undocumented
words, such as “aridification” (the process by which a
geographic region becomes dry), “slenthem” (a musical
instrument), and, appropriately, the word “deletable.”
This gap between dictionaries and the lexicon results from
a balance that every dictionary must strike: it must be
comprehensive enough to be a useful reference, but concise
enough to be printed, shipped, and used. As such, many
infrequent words are omitted. To gauge how well dictionaries
reflect the lexicon, we ordered our year 2000 lexicon by
frequency, divided it into eight deciles (ranging from 10° —
10° to 10° — 10°), and sampled each decile (7). We manually
checked how many sample words were listed in the OED (12)
and in the Merriam-Webster Unabridged Dictionary [MWD].
(We excluded proper nouns, since neither OED nor MWD
lists them.) Both dictionaries had excellent coverage of high
frequency words, but less coverage for frequencies below 10°
°: 67% of words in the 10° — 10° range were listed in neither
dictionary (Fig. 2B). Consistent with Zipf’s famous law, a
large fraction of the words in our lexicon (63%) were in this
lowest frequency bin. As a result, we estimated that 52% of
the English lexicon — the majority of the words used in
English books — consists of lexical “dark matter”
undocumented in standard references (/2).
To keep up with the lexicon, dictionaries are updated
regularly (/3). We examined how well these changes
corresponded with changes in actual usage by studying the
2077 1-gram headwords added to AHD4 in 2000. The overall
frequency of these words, such as “buckyball” and
“netiquette”, has soared since 1950: two-thirds exhibited
recent, sharp increases in frequency (>2X from 1950-2000)
(Fig. 2C). Nevertheless, there was a lag between
lexicographers and the lexicon. Over half the words added to
AHD4 were part of the English lexicon a century ago
(frequency >10° from 1890-1900). In fact, some newly-
added words, such as “gypseous” and “amplidyne”, have
already undergone a steep decline in frequency (Fig. 2D).
Not only must lexicographers avoid adding words that
have fallen out of fashion, they must also weed obsolete
words from earlier editions. This is an imperfect process. We
Sciencexpress / www.sciencexpress.org / 16 December 2010 / Page 2 / 10.1126/science.1199644
HOUSE_OVERSIGHT_016997
Downloaded from www.sciencemag.org on December 16, 2010
Extracted Information
Dates
Document Details
| Filename | HOUSE_OVERSIGHT_016997.jpg |
| File Size | 0.0 KB |
| OCR Confidence | 85.0% |
| Has Readable Text | Yes |
| Text Length | 5,867 characters |
| Indexed | 2026-02-04T16:29:56.395853 |