Matt Mahoney teaches CS at Florida Tech, and is a big-time data compression guy. He’s also a voice of reason on the somewhat fractious comp.compression newsgroup, who can be counted on to resist the urge to smack down posters who provoke invective from those with less amenable dispositions.
Matt’s labor of love is the PAQ family of data compressors, which can match up with just about anything out there when it comes to compression ratios, in some measure due (IMO) due to a inelegant but very effective strategy of adapting the compressor to the input stream type.
So at some time in the last year so, Matt must have been struck with the idea that what the world needed was a really great text compression corpus. Yes, we have the Calgary Corpus , but it’s showing its age. Maybe the biggest problem with the English text portions of the Calgary Corpus are that they are just too small and lack variety. Matt set out to correct this.
So if you are looking for variety, size, and need your data to be usable under a reasonable license, a great source is the venerable Wikipedia, and that is where Matt went. Using a process described here, he snagged 1 GB of XML encoded English text, cleaned it up a bit, and called that the Large Text Compression Benchmark .
Some might cry foul play, but Matt’s tests against the benchmark show PAQ8H right there at the top of the list. If you don’t like it, I guess you have a few choices:
- Point out the flaws in the Large Text Compression Benchmark that allow it to skew towards PAQ
- Create your own data set that puts your compressor on top
- Post flames against Matt on comp.compression
So far, I haven’t seen any activity of this sort, the public comments have been mostly positive, with even a little bit of fawning thrown in.
It’s going to be a bit of a bummer if you are trying to test an exceptionally slow compressor, chewing through 1 GB of data might try your patience, but I think Matt has raised the bar a bit, and you’re just going to have to deal with it.