HOUSE_OVERSIGHT_017011.jpg

Source: HOUSE_OVERSIGHT • Size: 0.0 KB • OCR Confidence: 85.0%

Extracted Text (OCR)

I. Overview of Google Books Digitization In 2004, Google began scanning books to make their contents searchable and discoverable online. To date, Google has scanned over fifteen million books: over 11% of all the books ever published. The collection contains over five billion pages and two trillion words, with books dating back to as early as 1473 and with text in 478 languages. Over two million of these scanned books were given directly to Google by their publishers; the rest are borrowed from large libraries such as the University of Michigan and the New York Public Library. The scanning effort involves significant engineering challenges, some of which are highly relevant to the construction of the historical n-grams corpus. We survey those issues here. The result of the next three steps is a collection of digital texts associated with particular book editions, as well as composite metadata for each edition combining the information contained in all metadata sources. I.1. Metadata Over 100 sources of metadata information were used by Google to generate a comprehensive catalog of books. Some of these sources are library catalogs (e.g., the list of books in the collections of University of Michigan, or union catalogs such as the collective list of books in Bosnian libraries), some are from retailers (e.g., Decitre, a French bookseller), and some are from commercial aggregators (e.g., Ingram). In addition, Google also receives metadata from its 30,000 partner publishers. Each metadata source consists of a series of digital records, typically in either the MARC format favored by libraries, or the ONIX format used by the publishing industry. Each record refers to either a specific edition of a book or a physical copy of a book on a library shelf, and contains conventional bibliographic data such as title, author(s), publisher, date of publication, and language(s) of publication. Cataloguing practices vary widely among these sources, and even within a single source over time. Thus two records for the same edition will often differ in multiple fields. This is especially true for serials (e.g., the Congressional Record) and multivolume works such as sets (e.g., the three volumes of The Lord of the Rings). The matter is further complicated by ambiguities in the definition of the word ‘book’ itself. Including translations, there are over three thousand editions derived from Mark Twain’s original Tom Sawyer. Google’s process of converting the billions of metadata records into a single nonredundant database of book editions consists of the following principal steps: 1. Coarsely dividing the billions of metadata records into groups that may refer to the same work (e.g., Tom Sawyer). 2. Identifying and aggregating multivolume works based on the presence of cues from individual records. 3. Subdividing the group of records corresponding to each work into constituent groups corresponding to the various editions (e.g., the 1909 publication of De lotgevallen van Tom Sawyer, translated from English to Dutch by Johan Braakensiek). 4. Merging the records for each edition into a new “consensus” record. The result is a set of consensus records, where each record corresponds to a distinct book edition and work, and where the contents of each record are formed out of fields from multiple sources. The number of records in this set -- i.e., the number of known book editions -- increases every year as more books are written. In August 2010, this evaluation identified 129 million editions, which is the working estimate we use in this paper of all the editions ever published (this includes serials and sets but excludes kits, mixed media, and 3 HOUSE_OVERSIGHT_017011

Document Preview

Click to view full size

Document Details

Filename	HOUSE_OVERSIGHT_017011.jpg
File Size	0.0 KB
OCR Confidence	85.0%
Has Readable Text	Yes
Text Length	3,721 characters
Indexed	2026-02-04T16:29:57.827557