Google to release 5-gram database
We all know that the denizens of the GooglePlex are all good, not evil, and I think this press release serves to drive the point home. As part of their ongoing work crawling the web, Google Research has created a database of of word sequences, or n-grams, accumulated from processing over a trillion words of input.
For people doing research in data compression, as well as other areas such as speech recognition, data mining, and even AI, this is a motherlode of valuable data - 6 DVDs worth of data, to be precise. Just as an example, if you had the RAM and the data structures to manage it, you could use this database to predict upcoming words in text compressors, probably with pretty good accuracy.
Of course, raw N-grams are good, but not everything. Claude Shannon famously used human prediction to estimate that English text could be encoded using roughly one bit per character , and to get there is going to take a lot more than a few Gigabytes of statistics - it’s going to take semantics as well. Which means computers are going to have to actually develop some understanding of the text they are processing.
No problem, I’m sure the Google Research people are hard at work on that problem as well.