Note: As of 9 October 2012 the compression challenge has been updated. See the new post for details. The comment stream on this post is closed due to excessive BMI.
The Folly of Infinite Compression is one that crops up all too often in the world of Data Compression. The man-hours wasted on comp.compression arguing about magic compressors is now approaching that of the Manhattan Project.
In a nutshell, a magic compressor would be one that violates the Abraham Lincoln’s little-known compression invariant: “You can’t compress all of the files all of the time”. It’s trivial to prove, and I won’t do it here, that no single compressor can losslessly and reversibly compress every file.
The easiest way to foil most compressors is with a so-called “Random File”. In the world of Information Theory, Randomness is a tricky thing to define, but I like to fall back on Kolmogorov complexity , where we define the complexity or randomness of a file as the Komogorov complexity w/r/t a Turing Machine.
Several years ago I posted a challenge on comp.compression that I hoped could be used to silence any proponents of Magic Compression, and I’m reposting it here so I have a permanent location I can point to.
How does it work? I took a well-known file of random digits created by the RAND group (no doubt at taxpayer expense), and converted that file into binary format, which squeezed out all the representational fluff.
The result was AMillionRandomDigits.bin , 415,241 bytes of incompressible, dense data.
The challenge is to create a decompressor + data file that when executed together create a copy of this file.
The size of the decompressor and data file added together must be less than 415,241 bytes.
So far nobody has been able to do this. I wouldn’t have been surprised to see somebody get a byte or two, but that hasn’t happened.
The only real rule here is that we have to negotiate exactly how to measure the size of the program plus data file. I’m willing to exclude lots of stuff. For example, if you wrote the program in C++, how would we measure the size of the program?
In this case, I would measure it as length of the source file, after passing through a resaonable compactor. The run time libraries and operating system libraries could be reasonably excluded. But that’s just one example, the rest are all subject to public interpretation.
Really, the only rule we need is that the executable program can’t have free access to a copy of the million random digits file. For example, you can’t create a new programming language called JG++ that includes a copy of this file in its runtime library. You can’t hide the digits in file names. And so on.
As long as those rules are obeyed, the challenge will never be met.
This challenge has a special place for the variant of Magic Compressors known as Recursive Compressors. Some savants will claim that they have a compressor that can compress any file by a very small amount, say 1%. The beauty of this is that of course they can repeately use the output of one cycle as the input to another, compressing the file down to any size they wish.
The obvious absurdity to this is that if we compress every file to a single bit, it’s going to be kind of hard to represent more than two files using this algorithm. So most people in this subspecialty will claim that their process peters out around some fixed lower limit, say 512 bytes.
For those people, a working program should be able to meet the challenge quite easily. If their compressed data file is a mere 512 bytes, that leaves 400K of space for a decompressor that can be called repeatedly until the output is complete.
The first person to achieve the challenge, while staying within the spirit of the challenge, will receive a prize of $100.