Hi all, I'm excited to announce that Amazon Elastic MapReduce is now hosting the Google Books n-gram dataset in Amazon S3. The data has been converted to SequenceFile format to make it easy to process using Hadoop. I spent some time this week playing with the data using Hive and put together an article which demonstrates how easy it is to get interesting results:
http://aws.amazon.com/articles/5249664154115844 I've included details about the public dataset at the bottom of this e-mail. The original data came from here: http://ngrams.googlelabs.com/datasets I'm looking forward to seeing what the community does with this data. Andrew == What are n-grams? == N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token. For example, the following sentence. The yellow dog played fetch. Would produce the following 2-grams. ["The", "yellow"] ["yellow", 'dog"] ["dog", "played"] ["played", "fetch"] ["fetch", "."] Or the following 3-grams. ["The", "yellow", "dog"] ["yellow", "dog", "played"] ["dog", "played", "fetch"] ["played", "fetch", "."] You can aggregate equivalent n-grams to find the total number of occurrences of that n-gram. This dataset contains counts of n-grams by year along three axis: total occurrences, number of pages on which they occur, and number of books in which they appear. == Dataset format == There are a number of different datasets available. Each dataset is a single n-gram type (1-gram, 2-gram, etc.) for a given input corpus (such as English or Russian text). We store the datasets in a single object in Amazon S3. The file is in sequence file format with block level LZO compression. The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable. The value is a tab separated string containing the following fields: n-gram - The actual n-gram. year - The year for this aggregation. occurrences - The number of times this n-gram appeared in this year. pages - The number of pages this n-gram appeared on in this year. books - The number of books this n-gram appeared in during this year. The n-gram field is a space separated representation of the tuple. analysis is often described as 1991 1 1 1 == Available Datasets == The entire dataset hasn't been released yet, but those that were complete as of the time of writing are available. Here are the names of the available corpuses and their abbreviation. English - eng-all English One Million - eng-1M American English - eng-us-all British English - eng-gb-all English Fiktion - eng-fiction-all Chinese (simplified) - chi-sim-all French - fre-all German - ger-all Russian - rus-all Spanish - spa-all Within each corpus there are up to five datasets, representing the n-grams from length one to five. These can be found in Amazon S3 at the following location. s3://datasets.elasticmapreduce/ngrams/books/20090715/<corpus>/<n>gram/data For example, you can find the American English 1-grams at the following location: s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data NOTE: These datasets are hosted in the us-east-1 region. If you process these from other regions you will be charged data transfer fees. == Dataset statistics == This table contains information about all available datasets. Data Rows Compressed Size English 1 gram 472,764,897 4.8 GB English One Million 1 gram 261,823,186 2.6 GB American English 1 gram 291,639,822 3.0 GB 2 gram 3,923,370,881 38.3 GB British English 1 gram 188,660,459 1.9 GB 2 gram 2,000,106,933 19.1 GB 3 gram 5,186,054,851 46.8 GB 4 gram 5,325,077,699 46.6 GB 5 gram 3,044,234,000 26.4 GB English Fiction 1 gram 191,545,012 2.0 GB 2 gram 2,516,249,717 24.3 GB Chinese 1 gram 7,741,178 0.1 GB 2 gram 209,624,705 2.2 GB 3 gram 701,822,863 7.2 GB 4 gram 672,801,944 6.8 GB 5 gram 325,089,783 3.4 GB French 1 gram 157,551,172 1.6 GB 2 gram 1,501,278,596 14.3 GB 3 gram 4,124,079,420 37.3 GB 4 gram 4,659,423,581 41.2 GB 5 gram 3,251,347,768 28.8 GB German 1 gram 243,571,225 2.5 GB 2 gram 1,939,436,935 18.3 GB 3 gram 3,417,271,319 30.9 GB 4 gram 2,488,516,783 21.9 GB 5 gram 1,015,287,248 8.9 GB Russian 1 gram 238,494,121 2.5 GB 2 gram 2,030,955,601 20.2 GB 3 gram 2,707,065,011 25.8 GB 4 gram 1,716,983,092 16.1 GB 5 gram 800,258,450 7.6 GB Spanish 1 gram 164,009,433 1.7 GB 2 gram 1,580,350,088 15.2 GB 5 gram 2,013,934,820 18.1 GB