In The Random Compression Challenge Turns Ten I challenged programmers to compress a notoriously random file. A winning entry will consist of a program plus data file that when executed creates a perfect copy of AMillionRandomDigits.bin. If the size of the program plus data is less than the 415,241 bytes in that file, you will have demonstrated that you can compress it.

To date, no winners.

Why The Fine Print

One of the things that I have to insist on in the competition is that you can’t hide data in the filesystem. This can be done in various ways, and in some cases, programmers don’t even realize they are doing it. In a nutshell, to prevent this, a winning entry with a data file should be able to run like this:

cat data.bin | decompressor > output.bin

I recently had an entry from someone who compressed the data into five files. His decompressor didn’t care what the names of those five files were, but they had to be found in a specific order. He indicated that this was because each of the file files had a somewhat different algorithm applied against it.

So whether it is used or not, at a minimum this is hiding 16 bytes of data - each of the first four files has a specific length, and that is being used by decompressor - it tells the decompressor when to stop and switch to a new algorithm. If you were reading data from a stream, with no file system, it would have to look like this:

file-1 length | file-1 data | file-2 length | file-2 data ... | file-5 data

Each of those lengths are going to be four bytes long. (I’m not counting the length of file 5, because we take it as a given that the decompressor operating on just one file has that information.)

A Demonstration

Here’s a very simple algorithm that would win the competition if not for the fact that it is violating that rule. I created a short python script that reads the random file, then splits it every time it finds a 0 byte in the file. When writing the split file out, the zero is deleted from the end:

#! /usr/bin/env python

import sys

s = sys.stdin.read()
last = len(s) - 1
begin = 0
end = 0
file_no = 1000
while end != last:
  if s[end] == chr(0):
    print end
    f = open(str(file_no)+".bin","wb")
    f.write(s[begin:end])
    f.close()
    file_no = file_no + 1
    begin = end + 1
  end = end + 1
f = open("last.bin","wb")
f.write(s[begin:end+1])
f.close()

I execute this script on the random file:

cat AMillionRandomDigits.bin | ./compress.py

The result is list 1,641 4-digit .bin files, which need to be processed in the same order they were created. The total size of the files?

$ ls -l ????.bin | tr -s ' ' | cut -f 5 -d\  | awk '{s+=$1} END {print s}'
413601

I’ve saved 1640 bytes through this operation. Can I now decompress and get an exact copy? Here’s the script:

#!/usr/bin/env bash
echo -n > output.bin
for f in ????.bin
do
 echo -ne $f'\r'  >&2
 cat $f >> output.bin
 if [ $f != "last.bin" ]
 then
  echo -ne "\x00" >> output.bin
 fi
done
echo >&2

This 187 byte file creates output.bin, which is an identical copy of the random file. Which means I’ve compressed the file by 1,453 bytes! What information did I manage to conceal?

In this case, the answer is pretty simple. I need to restore 1,640 bytes, each with a value of 0, in the output file. The place they go is determined by the length of each of the first 1,640 bin files - whose length is hidden in the filesystem.

Avoiding the Problem

Unfortunately, it isn’t necessarily easy to determine that a program is using filesystem data - innocently or not. As a result, the only way to really accept an entry will be if it can read its input from stdin, or perhaps just one file with a specific name.

If you’re creating an entry, and you find that you need some sort of specific file structure, you probably need to analyze your algorithm a bit to see how much leverage you are getting from the filesystem. Odds are, it’s going to be enough to keep you from beating the challenge.