The data compression world is all abuzz about Marcus Hutter’s recently announced 50,000 euro prize for record-breaking data compressors. Marcus, of the Swiss Dalle Molle Institute for Artificial Intelligence, apparently in cahoots with Florida compression maven Matt Mahoney , is offering cash prizes for what amounts to the most impressive ability to compress 100 MBytes of Wikipedia data. (Note that nobody is going to win exactly 50,000 euros - the prize amount is prorated based on how well you beat the current record.)

This prize differs considerably from my Million Digit Challenge , which is really nothing more than an attempt to silence people foolishly claiming to be able to compress random data. Marcus is instead looking for the most effective way to reproduce the Wiki data, and he’s putting up real money as an incentive. The benchmark that contestants need to beat is that set by Matt Mahoney’s paq8f, the current record holder at 18.3 MB. (Alexander Ratushnyak’s submission of a paq variant looks to clock in at a tidy 17.6 MB, and should soon be confirmed as the new standard.)

So why is an AI guy inserting himself into the world of compression? Well, Marcus realizes that good data compression is all about modeling the data. The better you understand the data stream, the better you can predict the incoming tokens in a stream. Claude Shannon empirically found that humans could model English text with an entropy of 0.6 to 1.3 bits per character, which at at best should mean that 100 MB of Wikipedia data could be reduced to 7.5 MB, with an upper bound of perhaps 16.25 MB. The theory is that reaching that 7.5 MB range is going to take such a good understanding of the data stream that it will amount to a demonstration of Artificial Intelligence.

At least, that’s my take on it. When the prize was first announced on comp.compression that line of thought immediately stirred up quite a bit of controversy, with the Hutter position being represented in a post by Mahoney:

There are many arguments that compression implies AI (but not vice versa).

  1. Hutter proved that the optimal behavior of a rational agent in a computable environment (to maximize an accumulated reward signal from the environment) is to assume the environment is controlled by the shortest program consistent with all observations so far.

  2. A natural language model (the ability to compute the probabilty p(s) of any text string s in human dialog) enables one to pass the Turing test for AI.

  3. Humans exceed machines in the ability to predict successive characters or words in running text, as measured by entropy bounds (estimated from the Shannon game, or compressed size, respectively). Text prediction requires vast, real-world knowledge that machines lack.

These arguments are described in more detail here.

So part of this is predicated on the assertion that passing the Turing Test is equivalent to AI, and that opens up whole new can of worms. (Try explaining why this is true to an undergraduate class on Intro to Cognitive Science if you want to get a handle on the passion it can raise.)

Personally, I think leadership in the Hutter Prize will be gained using many of the same techniques that have been used to develop winning chess programs. The best compression programs (such as paq8) are big bags of heuristics, hacks, and code tuned for specific models. If it turns out that intelligence is coded that way, I won’t be surprised, but those hoping for an elegant model will probably be disappointed.

To keep up with ongoing developments in the Hutter Prize chase, use Google Groups to subscribe to the newsgroup. Awards will be given every time someone leapfrogs the current winner, so expect interest in the prize to stay high for the forseeable future.