HOUSE_OVERSIGHT_016996.jpg
Extracted Text (OCR)
Research Article
Quantitative Analysis of Culture Using Millions of Digitized Books
Jean-Baptiste Michel,’?***+ Yuan Kui Shen,° Aviva Presser Aiden,° Adrian Veres,’ Matthew K. Gray,* The Google Books
Team,° Joseph P. Pickett,’ Dale Hoiberg,”” Dan Clancy,® Peter Norvig,® Jon Orwant,® Steven Pinker,’ Martin A. Nowak,)!)”
Erez Lieberman Aiden! !7!*!4:15-6#+
‘Program for Evolutionary Dynamics, Harvard University, Cambridge, MA 02138, USA. “Institute for Quantitative Social
Sciences, Harvard University, Cambridge, MA 02138, USA. *Department of Psychology, Harvard University, Cambridge, MA
02138, USA. “Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA. °Computer Science and
Artificial Intelligence Laboratory, MIT, Cambridge, MA 02139, USA. °Harvard Medical School, Boston, MA, 02115, USA.
"Harvard College, Cambridge, MA 02138, USA. ®Google, Inc., Mountain View, CA, 94043, USA. "Houghton Mifflin Harcourt,
Boston, MA 02116, USA. '’Encyclopaedia Britannica, Inc., Chicago, IL 60654, USA. ''Dept of Organismic and Evolutionary
Biology, Harvard University, Cambridge, MA 02138, USA. '*Dept of Mathematics, Harvard University, Cambridge, MA
02138, USA. “Broad Institute of Harvard and MIT, Harvard University, Cambridge, MA 02138, USA. School of Engineering
and Applied Sciences, Harvard University, Cambridge, MA 02138, USA. Harvard Society of Fellows, Harvard University,
Cambridge, MA 02138, USA. '°L aboratory-at-Large, Harvard University, Cambridge, MA 02138, USA.
*These authors contributed equally to this work.
+To whom correspondence should be addressed. E-mail: jb.michel@gmail.com (J.B.M.); erez@erez.com (E.A.).
We constructed a corpus of digitized texts containing
about 4% of all books ever printed. Analysis of this
corpus enables us to investigate cultural trends
quantitatively. We survey the vast terrain of
“culturomics”, focusing on linguistic and cultural
phenomena that were reflected in the English language
between 1800 and 2000. We show how this approach can
provide insights about fields as diverse as lexicography,
the evolution of grammar, collective memory, the
adoption of technology, the pursuit of fame, censorship,
and historical epidemiology. “Culturomics” extends the
boundaries of rigorous quantitative inquiry to a wide
array of new phenomena spanning the social sciences and
the humanities.
Reading small collections of carefully chosen works enables
scholars to make powerful inferences about trends in human
thought. However, this approach rarely enables precise
measurement of the underlying phenomena. Attempts to
introduce quantitative methods into the study of culture (/-6)
have been hampered by the lack of suitable data.
We report the creation of a corpus of 5,195,769 digitized
books containing ~4% of all books ever published.
Computational analysis of this corpus enables us to observe
cultural trends and subject them to quantitative investigation.
“Culturomics” extends the boundaries of scientific inquiry to
a wide array of new phenomena.
The corpus has emerged from Google’s effort to digitize
books. Most books were drawn from over 40 university
libraries around the world. Each page was scanned with
custom equipment (7), and the text digitized using optical
character recognition (OCR). Additional volumes — both
physical and digital — were contributed by publishers.
Metadata describing date and place of publication were
provided by the libraries and publishers, and supplemented
with bibliographic databases. Over 15 million books have
been digitized [12% of all books ever published (7)]. We
selected a subset of over 5 million books for analysis on the
basis of the quality of their OCR and metadata (Fig. 1A) (7).
Periodicals were excluded.
The resulting corpus contains over 500 billion words, in
English (361 billion), French (45B), Spanish (45B), German
(37B), Chinese (13B), Russian (35B), and Hebrew (2B). The
oldest works were published in the 1500s. The early decades
are represented by only a few books per year, comprising
several hundred thousand words. By 1800, the corpus grows
to 60 million words per year; by 1900, 1.4 billion; and by
2000, 8 billion.
The corpus cannot be read by a human. If you tried to read
only the entries from the year 2000 alone, at the reasonable
pace of 200 words/minute, without interruptions for food or
sleep, it would take eighty years. The sequence of letters is
one thousand times longer than the human genome: if you
wrote it out in a straight line, it would reach to the moon and
back 10 times over (8).
To make release of the data possible in light of copyright
constraints, we restricted our study to the question of how
often a given “1-gram” or “n-gram” was used over time. A 1-
gram is a string of characters uninterrupted by a space; this
includes words (“banana”, “SCUBA”) but also numbers
Sciencexpress / www.sciencexpress.org / 16 December 2010 / Page 1 / 10.1126/science.1199644
HOUSE_OVERSIGHT_016996
Downloaded from www.sciencemag.org on December 16, 2010
Document Details
| Filename | HOUSE_OVERSIGHT_016996.jpg |
| File Size | 0.0 KB |
| OCR Confidence | 85.0% |
| Has Readable Text | Yes |
| Text Length | 5,025 characters |
| Indexed | 2026-02-04T16:29:54.931632 |