DDJ Cover from September, 1996, here for decorative purposes only. Dr. Dobb's Journal

September, 1996

The world is not yet exhausted; let me see something tomorrow which I never saw before. – Samuel Johnson

In all chaos there is a cosmos, in all disorder a secret order.– Carl Jung

Introduction

In mathematics, difficult problems can often be simplified by performing a transformation on a data set. For example, digital signal processing programs use the FFT to convert sets of sampled audio data points to equivalent sets of frequency data. Pumping up the bass or applying a notch filter is then just a matter of multiplying some frequency data points by a scaling factor. Perform an inverse FFT on the resulting points, and voila, you have a new audio waveform that has been transformed according to your specifications.

Michael Burrows and David Wheeler recently released the details of a transformation function that opens the door to some revolutionary new data compression techniques. The Burrows-Wheeler Transform, or BWT, transforms a block of data into a format that is extremely well suited for compression. It does such a good job at this that even the simple demonstration programs I’ll present here will outperform state-of-the-art programs.

In this article, I’ll first present the BWT, and demonstrate its reversibility. I’ll then show you how a block of data transformed by the BWT can be compressed using standard techniques. After discussing the advantages and disadvantages of this type of data compression, I’ll present my simple demonstration program. And no article on data compression would be complete without the pro forma comparison table, showing how BWT-based compression stacks up against other modern techniques.

The Burrows-Wheeler Transform

Michael Burrows and David Wheeler released a research report in 1994 discussing work they had been doing at the Digital Systems Research Center in Palo Alto, California. Their paper, “A Block-sorting Lossless Data Compression Algorithm” presented a data compression algorithm based on a previously unpublished transformation discovered by Wheeler in 1983.

While the paper discusses a complete set of algorithms for compression and decompression, the real heart of the paper consists of the disclosure of the BWT algorithm.

BWT Basics

The BWT is an algorithm that takes a block of data and rearranges it using a sorting algorithm. The resulting output block contains exactly the same data elements that it started with, differing only in their ordering. The transformation is reversible, meaning the original ordering of the data elements can be restored with no loss of fidelity.

The BWT is performed on an entire block of data at once. Most of today’s familiar lossless compression algorithms operate in streaming mode, reading a single byte or a few bytes at a time. But with this new transform, we want to operate on the largest chunks of data possible. Since the BWT operates on data in memory, you may encounter files too big to process in one fell swoop. In these cases, the file must be split up and processed a block at a time. The demo programs that accompany this article work comfortably with block sizes of 50Kbytes up to 250 Kbytes.

This figure shows a small sample dataset, just the letters 'DRDOBBS'.
Figure 1 — An sample data set

For purposes of illustration, I’ll start with a much smaller data set, shown in Figure 1. This string contains seven bytes of data. In order to perform the BWT, the first thing we do is treat a string S, of length N, as if it actually contains N different strings, with each character in the original string being the start of a specific string that is N bytes long. (In this case, the word string doesn’t have the C/C++ semantics of being a null terminated set of characters. A string is just a collection of bytes.) We also treat the buffer as if the last character wraps around back to the first.

This figure shows the seven circular permutations of the string, each starting on a different letter.
Figure 2 — The set of strings associated with the buffer

It’s important to remember at this point that we don’t actually make N -1 rotated copies of the input string. In the demonstration program, we just represent each of the strings by a pointer or an index into a memory buffer.

The next step in the BWT is to perform a lexicographical sort on the set of input strings. That is, we want to order the strings using a fixed comparison function. This means we should compare using a function similar to the C functions strcmp() or memcmp(). In this high level view of the algorithm the comparison function has to be able to wrap around when it reaches the end of the buffer, so a slightly modified comparison function would be needed.

After sorting, the set of strings is arranged as shown in Figure 3. There are two important points to note in this picture. First, the strings have been sorted, but we’ve kept track of which string occupied which position in the original set. So, we know that the String 0, the original unsorted string, has now moved down to row 4 in the array.

In this figure, the strings from the previous image have been sorted lexicographically. The resulting columns F and L are created from the first and last columns of the sorted matrix. F contains 'BBDDORS' and L contains 'OBRSDDB'
Figure 3 — The set of strings after sorting

Second, I’ve tagged the first and last columns in the matrix with the special designations F and L, for the first and last columns of the array. Column F contains all the characters in the original string in sorted order. So our original string “DRDOBBS” is represented in F as “BBDDORS”.

The characters in column L don’t appear to be in any particular order, but in fact they have an interesting property. Each of the characters in L is the prefix character to the string that starts in the same row in column F.

The actual output of the BWT, oddly enough, consists of two things: a copy of column L, and the primary index, an integer indicating which row contains the original first character of the buffer B. So performing the BWT on our original string generates the output string L which contains “OBRSDDB”, and a primary index of 5.

The integer 5 is found easily enough since the original first character of the buffer will always be found in column L in the row that contains S1. Since S1 is simply S0 rotated left by a single character position, the very first character of the buffer is rotated into the last column of the matrix. Therefore, locating S1 is equivalent to locating the buffer’s first character position in L.

Two obvious questions

At this point in the exposition, there are two obvious questions. First, it doesn’t seem possible that this is a reversible transformation. Generally, a sort() function doesn’t come with an unsort() partner that can restore your original ordering. In fact, it isn’t likely that you’ve ever even considered this as something you might like. And second, what possible good does this strange transformation do you?

I’ll defer the answer to the second question while I explain the reversibility of this transform. Unsorting column L requires the use of something called the transformation vector. The transformation vector is an array that defines the order in which the rotated strings are scattered throughout the rows of the matrix of Figure 3.

The transformation vector, T, is an array with one index for each row in column F. For a given row i, T[ i ] is defined as the row where S[ i + 1 ] is found. In Figure 3, row 3 contains S0, the original input string, and row 5 contains S1, the string rotated one character to the left. Thus, T[ 3 ] contains the value 5. S2 is found in row 2, so T[ 5 ] contains a 2. For this particular matrix, the transformation vector can be calculated to be { 1, 6, 4, 5, 0, 2, 3 }.

This figure shows the transofrmation vector, described in the text.
Figure 4 — The transformation vector routes S[ i ] to S[ i + 1]

Figure 4 shows how the transformation vector is used to walk through the various rows. For any row that contains S[ i ], the vector provides the value of the row where S[ i + 1 ] is found.

The Rosetta Vector

The reason the transformation vector is so important is that it provides the key to restoring L to its original order. Given L and the primary index, you can restore the original S0. For this example, the following code does the job:

int T[] = { 1, 6, 4, 5, 0, 2, 3 };
char L[] = "OBRSDDB";
int primary_index = 5;

void decode()
{
    int index = primary_index;
    for ( int i = 0 ; i < 7 ; i++ ) {
        cout << L[ index ];
        index = T[ index ];
    }
}

You can get there from here

So now we come to the core premise of the Burrows-Wheeler transform: given a copy of L, you can calculate the transformation vector for the original input matrix. And consequently, given the primary index, you can recreate S0, or the input string.

The key that makes this possible is that calculating the transformation vector requires only that you know the contents of the first and last columns of the matrix. And believe it or not, simply having a copy of L means that you do, in fact, have a copy of F as well.

The initial state of the decoding matrix given just the L string. The first column of the matrix contains L, all other columns have question marks indiciating that the decoding hasn't figured out their values yet.
Figure 5 — The known state of the matrix given L

Given just the copy of L, we don’t know much about the state of the matrix. Figure 5 shows L, which I’ve moved into the first column for purposes of illustration. In this figure, F is going to be in the next column. And fortunately for us, F has an important characteristic: it contains all of the characters from the input string in sorted order. Since L also contains all the same characters, we can determine the contents of F by simply sorting L!

The decoding matrix showing the F column in the recovered state. The L column, first, contains 'OBRSDDB', the F column, second, contains 'BBDDORS'
Figure 6 — The known state after recovering F

Now we can start working on calculating T. The character ‘O’ in row 0 clearly moves to row 4 in column F, which means T[ 4 ] = 0. But what about row 1? The ‘B’ could match up with either the ‘B’ in row 0 or row 1. Which do we select?

Fortunately, the choice here is not ambiguous, although the decision making process may not be intuitive. Remember that by definition, column F is sorted. This means that all the strings beginning with ‘B’ in column L also appear in sorted order. Why? They all start with the same character, and they are sorted on their second character, by virtue of their second characters appearing in column F.

Since by definition the strings in F must appear in sorted order, it means that all the strings that start with a common character in L appear in the same order in F, although not necessarily in the same rows. Because of this, we know that the ‘B’ in row 1 of L is going to move up to row 0 in F. The ‘B’ in row 6 of L moves to row 1 of F.

Once that difficulty is cleared, it’s a simple matter to recover the transformation matrix from the two columns. And once that is done, recovering the original input string is short work as well. Simply applying the C++ code shown earlier does the job.

The Punchline

The whole point to this BWT exercise is supposed be better data compression, so let’s take a look at how this works. Figure 7 shows a snippet of debug output taken from the test program BWT.CPP. Each line of output shows a prefix character followed by a sorted string. The prefix character is actually what is found in column L after the strings have been sorted.

t: hat acts like this:<13><10><1
t: hat buffer to the constructor
t: hat corrupted the heap, or wo
W: hat goes up must come down<13
t: hat happens, it isn't  likely
w: hat if you want to dynamicall
t: hat indicates an error.<13><1
t: hat it removes arguments from
t: hat looks like this:<13><10><
t: hat looks something like this
t: hat looks something like this
t: hat once I detect the mangled
Figure 7 — Debug output from BWT.CPP

This section of output consists of a group of sorted strings that all start with the characters “hat”. Not surprisingly, the prefix character for almost all of these strings is a lower case ‘t’. There are also a couple of appearances of the letter ‘w’.

Anytime you see repetition like this, you clearly have an opportunity for data compression. Sorting all of the strings gives rise to a set of prefix characters that has lots of little clumps where characters repeat.

Using the opportunity

We could take the output of the BWT and just apply a conventional compressor to it, but Burrows and Wheeler suggest an improved approach. They recommend using a Move to Front scheme, followed by an entropy encoder.

A Move to Front (or MTF) encoder is a fairly trivial piece of work. It simply keeps all 256 possible codes in a list. Each time a character is to be output, you send its position in the list, then move it to the front. In the above example, column L might start with the following sequence: “tttWtwttt”. If we encode that using an MTF scheme, we would output the following integers: { 116, 0, 0, 88, 1, 119, 1, 0, 0 }.

If the output of the BWT operation contains a high proportion of repeated characters, we can expect that applying the MTF encoder will give us a file filled with lots of zeros, and heavily weighted towards the lowest integers.

At that point, the file can finally be compressed using an entropy encoder, typically a Huffman or arithmetic encoder. The example programs that accompany this article use an adaptive order-0 arithmetic encoder as the final step of the process.

And the winner is…

Using the BWT as a front end for data compression has a result similar to that of simple statistical modeling programs. An order-n statistical model simply uses the previous n characters to predict the value of the next character in a string. Compression programs based on statistical modeling can give good results, at the expense of high memory requirements.

The BWT in effect uses the entire input string as its context. But instead of using a preceding run of characters to model the output, it uses the string following a character as a model. Because of the sorting method used, characters hundreds of bytes downstream can have an effect on the ordering of the BWT output.

This characteristic of the BWT has another side effect. It means that in general, the bigger the block size, the better the compression. As long as the data set is homogenous, bigger blocks will generate longer runs of repeats, leading to improved compression. Of course, the sorting operations needed at the front end of the BWT will usually have O( N * log(N) ) performance, meaning bigger blocks might slow things down considerably. (Although Burrows and Wheeler point out that a suffix tree sort can be done in linear time and space.)

Programmers today might also be impressed by the fact that the BWT is apparently not covered by any software patents. I asked Burrows and Wheeler if any patents had been initiated by DEC, and both indicated not. Since this algorithm was published in 1994, it may mean the technique is in the public domain. (Consult an attorney!)

Even if the BWT is in the public domain, programmers need to be careful when selecting an entropy encoder for the final stage of compression. Arithmetic coding offers excellent compression, but is covered by quite a few patents. Huffman coding is slightly less efficient, but seems less likely to provoke intellectual property fights.

A Reference Implementation

One of the nice things about BWT compression is that it’s easy to isolate each step of the process. I took advantage of that when writing the demo programs to accompany this article. The compression and decompression processes will consist of running several programs in sequence, piping the output of one program to the input of another, and repeating until the process is complete.

Compressing a file using the demo programs is done using the following command line:

RLE input-file | BWT | MTF | RLE | ARI > output-file

A brief description of each of the programs follows:

RLE.CPP This program implements a simple run-length encoder. If the input file has many long runs of 3 identical characters, the sorting procedure in the BWT can be degraded dramatically. The RLE front end prevents that from happening.
BWT.CPP The standard Burrows-Wheeler transform is done here. This program outputs repeated blocks consisting of a block size integer, a copy of L, the primary index, and a special last character index. This is repeated until BWT.EXE runs out of input data.
MTF.CPP The Move to Front encoder operates as described in the previous section.
RLE.CPP The fact that the output file is top-heavy with runs containing zeros means that applying another RLE pass to the output can improve overall compression. I believe that further processing of the MTF output will provide fertile ground for additional improvements.
ARI.CPP This is an order-0 adaptive arithmetic encoder, directly derived from the code published by Witten and Cleary in their 1987 CACM article.

The output of the compressor can be decompressed by applying the same algorithms in reverse. The command line that accomplishes this is:

UNARI input-file | UNRLE | UNMTF | UNBWT | UNRLE > output-file

Each of the programs here has a one-to-one correspondence with its namesake from the compression section.

The 10 source files that make up this program suite are all fairly short, and their inner workings should be fairly clear. The most complicated programs are the arithmetic encoder and decoder, which are well documented elsewhere.

Results

I used the standard Calgary Corpus to benchmark my version of BWT compression. The corpus is a set of 18 files with varying characteristics that can be found at:

http://corpus.canterbury.ac.nz/resources/calgary.zip

For purposes of comparison I show the raw size of the file, the compressed sizes using the BWT compressor, and the compressed sizes using PKZIP with the default compression level selected.

File Name Raw Size PKZIP Size PKZIP Bits/Byte BWT Size BWT Bits/Byte
bib 111,261 35,821 2.58 29,567 2.13
book1 768,771 315,999 3.29 275,831 2.87
book2 610,856 209,061 2.74 186,592 2.44
geo 102,400 68,917 5.38 62,120 4.85
news 377,109 146,010 3.10 134,174 2.85
obj1 21,504 10,311 3.84 10,857 4.04
obj2 246,814 81,846 2.65 81,948 2.66
paper1 53,161 18,624 2.80 17,724 2.67
paper2 82,199 29,795 2.90 26,956 2.62
paper3 46,526 18,106 3.11 16,995 2.92
paper4 13,286 5,509 3.32 5,529 3.33
paper5 11,954 4,962 3.32 5,136 3.44
paper6 38,105 13,331 2.80 13,159 2.76
pic 513,216 54,188 0.84 50,829 0.79
progc 39,611 13,340 2.69 13,312 2.69
progl 71,646 16,227 1.81 16,688 1.86
progp 49,379 11,248 1.82 11,404 1.85
trans 93,695 19,691 1.68 19,301 1.65
total: 3,251,493 1,072,986 2.64 978,122 2.41
Table 1 — Mandatory Comparison of Results

As Table 1 shows, this demonstration program manages to compete pretty well with the compression offered by PKWare’s commercial product, PKZip. I consider this particularly encouraging in light of the fact that there should be plenty of headroom left for tweaking additional percentage points from my code.

Acknowledgments

I had some excellent help from a few people who reviewed this article and the source code. My thanks go out to Nico de Vries, Leo Broukhis, Jean-loup Gailly, David Wheeler, and Robert Jung.

I also have to thank Quincey Koziol and Philipp Knirsch for the help they gave me with UNIX testing. Without their assistance, the source code for this article would have only been usable under MS-DOS and Windows.

Bibliography and Resources

You can find resources, additional source code, and commentary regarding this compression on my personal web page at https://marknelson.us, by going to the categories cloud and selecting the appropriate tags.

Burrows, M. and Wheeler, D.J. (1994) “A Block-sorting Lossless Data Compression Algorithm”, Digital Systems Research Center Research Report 124 “A Block-sorting Lossless Data Compression Algorithm”

Witten, I.H., Neal, R., and Cleary, J.G., (1987b) “Arithmetic coding for data compression,” Communications of the Association for Computing Machinery,30 (6) 520-540, June.

Nelson, Mark and Gailly, Jean-Loup, (1995) The Data Compression Book, second edition, M&TBooks, New York.

Source Code

rle.cpp The run length encoder program.
bwt.cpp Burrows-Wheeler transform code, implemented using the traditional C function `qsort()`.
bwta.cpp Burrows-Wheeler transform code, implemented using the STL.
mtf.cpp A Move To Front encoder.
ari.cpp An order-0 adaptive arithmetic encoder, using the code from the 1987 CACM article by Witten, et.al.
unrle.cpp Run Length decoder, undoes the encoding performed by rle.cpp.
unbwt.cpp Inverse Burrows-Wheeler transform, using the format used in bwt.cpp and bwta.cpp.
unmtf.cpp The Move to Front decoder, companion to mtf.cpp.
unari.cpp An order-0 adaptive arithmetic decoder, using the code from the 1987 CACM article by Witten, et.al.
readme.txt Build instructions, short and sweet! Please read!
bwtcode.zip The whole package in a ZIP file.

Updated Notes

The Burrows-Wheeler Transform: Theory and Practice (1999) Giovanni Manzini