HOUSE_OVERSIGHT_017015.jpg
Extracted Text (OCR)
(approximately 235,000) of the books were filtered out in this way. Table $1 lists the fraction removed at
this stage for our other non-English corpora.
I1.1D. Year Restriction
In order to further ensure publication date accuracy and consistency of dates across all our corpora, we
implemented a publication year restriction and only retained books with publication years starting from
1550 and ending in 2008. We found that a significant fraction of mis-dated books have a publication year
of 0 or dates prior to the invention of printing. The number of books filtered due to this year range
restriction is considerably small, usually under 2% of the original number of books.
The fraction of the corpus removed by all stages of the filtering is summarized in Table $1. Note that
because the filters are applied in a fixed order, the statistics presented below are influenced by the
sequence in which the filters were applied. For example, books that trigger both the OCR quality filter and
by the language correction filter are excluded by the OCR quality filter, which is performed first. Of course,
the actual subset of books filtered is the same regardless of the order in which the filters are applied.
I].2. Metadata based subdivision of the Google Books Collection
II].2A. Determination of language
To create accurate corpora in particular languages that minimize cross-language contamination, it is
important to be able to accurately associate books with the language in which they were written. To
determine the language in which a text is written, we rely on metadata derived from our 100 bibliographic
sources, as well as statistical language determination using the Popat algorithm (Ref $3). The algorithm
takes advantage of the fact that certain character sequences, such as ‘the’, 'of, and ‘ion", occur more
frequently in English. In contrast, the sequences '‘la', 'aux', and 'de’ occur more frequently in French.
These patterns can be used to distinguish between books written in English and those written in
French. More generally, given the entire text of a book, the algorithm can reliably classify the book into
one of the 32 supported language types. The final consensus language was determined based on the
metadata sources as well as the results of the statistical language determination algorithm, with the
statistical algorithm as the higher priority.
II.2B. Determination of book subject assignments
Book subject assignments were determined using a book's Book Industry Standards and Communication
(BISAC) subject categories. BISAC subject headings are a system for categorizing books based on
content developed by the BISAC subject codes committee overseen by the Book Industry Study Group.
They are often used for a variety of purposes, such as to determine how books are shelved in stores. For
English, 92.4% of the books had at least one BISAC subject assignment. In cases where there were
multiple subject assignments, we took the more commonly used subject heading and discarded the rest.
II.2C. Determination of book country-of-publication
Country of publication was determined on the basis of our 100 bibliographic sources; 97% of the books
had a country-of-publication assignment. The country code used is the 2 letter code as defined in the /SO
3166-1 alpha-2 standard. More specifically, when constructing our US versus British English corpora, we
used the codes "us" (United States) and "gb" (Great Britain) to filter our volumes.
HOUSE_OVERSIGHT_017015