Dr. Dobb's Journal
September, 1996 
The world is not yet exhausted; let me see something tomorrow which I never saw before.  Samuel Johnson
In all chaos there is a cosmos, in all disorder a secret order. Carl Jung
Introduction
In mathematics, difficult problems can often be simplified by performing a transformation on a data set. For example, digital signal processing programs use the FFT to convert sets of sampled audio data points to equivalent sets of frequency data. Pumping up the bass or applying a notch filter is then just a matter of multiplying some frequency data points by a scaling factor. Perform an inverse FFT on the resulting points, and voila, you have a new audio waveform that has been transformed according to your specifications.
Michael Burrows and David Wheeler recently released the details of a transformation function that opens the door to some revolutionary new data compression techniques. The BurrowsWheeler Transform, or BWT, transforms a block of data into a format that is extremely well suited for compression. It does such a good job at this that even the simple demonstration programs I'll present here will outperform state of the art programs.
In this article, I'll first present the BWT, and demonstrate its reversibility. I'll then show you how a block of data transformed by the BWT can be compressed using standard techniques. After discussing the advantages and disadvantages of this type of data compression, I'll present my simple demonstration program. And no article on data compression would be complete without the pro forma comparison table, showing how BWTbased compression stacks up against other modern techniques.
The BurrowsWheeler Transform
Michael Burrows and David Wheeler released a research report in 1994 discussing work they had been doing at the Digital Systems Research Center in Palo Alto, California. Their paper, "A Blocksorting Lossless Data Compression Algorithm" presented a data compression algorithm based on a previously unpublished transformation discovered by Wheeler in 1983.
While the paper discusses a complete set of algorithms for compression and decompression, the real heart of the paper consists of the disclosure of the BWT algorithm.
BWT Basics
The BWT is an algorithm that takes a block of data and rearranges it using a sorting algorithm. The resulting output block contains exactly the same data elements that it started with, differing only in their ordering. The transformation is reversible, meaning the original ordering of the data elements can be restored with no loss of fidelity.
The BWT is performed on an entire block of data at once. Most of today's familiar lossless compression algorithms operate in streaming mode, reading a single byte or a few bytes at a time. But with this new transform, we want to operate on the largest chunks of data possible. Since the BWT operates on data in memory, you may encounter files too big to process in one fell swoop. In these cases, the file must be split up and processed a block at a time. The demo programs that accompany this article work comfortably with block sizes of 50Kbytes up to 250 Kbytes.
Figure 1
An sample data set
For purposes of illustration, I'll start with a much smaller data set, shown in Figure 1. This string contains seven bytes of data. In order to perform the BWT, the first thing we do is treat a string S, of length N, as if it actually contains N different strings, with each character in the original string being the start of a specific string that is N bytes long. (In this case, the word string doesn't have the C/C++ semantics of being a null terminated set of characters. A string is just a collection of bytes.) We also treat the buffer as if the last character wraps around back to the first.
Figure 2
The set of strings associated with the buffer
It's important to remember at this point that we don't actually make N 1 rotated copies of the input string. In the demonstration program, we just represent each of the strings by a pointer or an index into a memory buffer.
The next step in the BWT is to perform a lexicographical sort on the set of input strings. That is, we want to order the strings using a fixed comparison function. This means we should compare using a function similar to the C functions strcmp() or memcmp(). In this high level view of the algorithm the comparison function has to be able to wrap around when it reaches the end of the buffer, so a slightly modified comparison function would be needed.
After sorting, the set of strings is arranged as shown in Figure 3. There are two important points to note in this picture. First, the strings have been sorted, but we've kept track of which string occupied which position in the original set. So, we know that the String 0, the original unsorted string, has now moved down to row 4 in the array.
Figure 3
The set of strings after sorting
Second, I've tagged the first and last columns in the matrix with the special designations F and L, for the first and last columns of the array. Column F contains all the characters in the original string in sorted order. So our original string "DRDOBBS" is represented in F as "BBDDORS".
The characters in column L don't appear to be in any particular order, but in fact they have an interesting property. Each of the characters in L is the prefix character to the string that starts in the same row in column F.
The actual output of the BWT, oddly enough, consists of two things: a copy of column L, and the primary index, an integer indicating which row contains the original first character of the buffer B. So performing the BWT on our original string generates the output string L which contains "OBRSDDB", and a primary index of 5.
The integer 5 is found easily enough since the original first character of the buffer will always be found in column L in the row that contains S1. Since S1 is simply S0 rotated left by a single character position, the very first character of the buffer is rotated into the last column of the matrix. Therefore, locating S1 is equivalent to locating the buffer's first character position in L.
Two obvious questions
At this point in the exposition, there are two obvious questions. First, it doesn't seem possible that this is a reversible transformation. Generally, a sort() function doesn't come with an unsort() partner that can restore your original ordering. In fact, it isn't likely that you've ever even considered this as something you might like. And second, what possible good does this strange transformation do you?
I'll defer the answer to the second question while I explain the reversibility of this transform. Unsorting column L requires the use of something called the transformation vector. The transformation vector is an array that defines the order in which the rotated strings are scattered throughout the rows of the matrix of Figure 3.
The transformation vector, T, is an array with one index for each row in column F. For a given row i, T[ i ] is defined as the row where S[ i + 1 ] is found. In Figure 3, row 3 contains S0, the original input string, and row 5 contains S1, the string rotated one character to the left. Thus, T[ 3 ] contains the value 5. S2 is found in row 2, so T[ 5 ] contains a 2. For this particular matrix, the transformation vector can be calculated to be { 1, 6, 4, 5, 0, 2, 3 }.
Figure 4
The transformation vector routes S[ i ] to S[ i + 1]
Figure 4 shows how the transformation vector is used to walk through the various rows. For any row that contains S[ i ], the vector provides the value of the row where S[ i + 1 ] is found.
The Rosetta Vector
The reason the transformation vector is so important is that it provides the key to restoring L to its original order. Given L and the primary index, you can restore the original S0. For this example, the following code does the job:

int T[] = { 1, 6, 4, 5, 0, 2, 3 };

char L[] = "OBRSDDB";

int primary_index = 5;


void decode()

{

int index = primary_index;

for ( int i = 0 ; i <7 ; i++ ) {

cout <<L[ index ];

index = T[ index ];

}

}
You can get there from here
So now we come to the core premise of the BurrowsWheeler transform: given a copy of L, you can calculate the transformation vector for the original input matrix. And consequently, given the primary index, you can recreate S0, or the input string.
The key that makes this possible is that calculating the transformation vector requires only that you know the contents of the first and last columns of the matrix. And believe it or not, simply having a copy of L means that you do, in fact, have a copy of F as well.
Figure 5
The known state of the matrix given L
Given just the copy of L, we don't know much about the state of the matrix. Figure 5 shows L, which I've moved into the first column for purposes of illustration. In this figure, F is going to be in the next column. And fortunately for us, F has an important characteristic: it contains all of the characters from the input string in sorted order. Since L also contains all the same characters, we can determine the contents of F by simply sorting L!
Figure 6
The known state after recovering F
Now we can start working on calculating T. The character 'O' in row 0 clearly moves to row 4 in column F, which means T[ 4 ] = 0. But what about row 1? The 'B' could match up with either the 'B' in row 0 or row 1. Which do we select?
Fortunately, the choice here is not ambiguous, although the decision making process may not be intuitive. Remember that by definition, column F is sorted. This means that all the strings beginning with 'B' in column L also appear in sorted order. Why? They all start with the same character, and they are sorted on their second character, by virtue of their second characters appearing in column F.
Since by definition the strings in F must appear in sorted order, it means that all the strings that start with a common character in L appear in the same order in F, although not necessarily in the same rows. Because of this, we know that the 'B' in row 1 of L is going to move up to row 0 in F. The 'B' in row 6 of L moves to row 1 of F.
Once that difficulty is cleared, it's a simple matter to recover the transformation matrix from the two columns. And once that is done, recovering the original input string is short work as well. Simply applying the C++ code shown earlier does the job.
The Punchline
The whole point to this BWT exercise is supposed be better data compression, so let's take a look at how this works. Figure 7 shows a snippet of debug output taken from the test program BWT.CPP. Each line of output shows a prefix character followed by a sorted string. The prefix character is actually what is found in column L after the strings have been sorted.

t: hat acts like this:<13><10><1

t: hat buffer to the constructor

t: hat corrupted the heap, or wo

W: hat goes up must come down<13

t: hat happens, it isn't likely

w: hat if you want to dynamicall

t: hat indicates an error.<13><1

t: hat it removes arguments from

t: hat looks like this:<13><10><

t: hat looks something like this

t: hat looks something like this

t: hat once I detect the mangled
Figure 7
Debug output from BWT.CPP
This section of output consists of a group of sorted strings that all start with the characters "hat". Not surprisingly, the prefix character for almost all of these strings is a lower case 't'. There are also a couple of appearances of the letter 'w'.
Anytime you see repetition like this, you clearly have an opportunity for data compression. Sorting all of the strings gives rise to a set of prefix characters that has lots of little clumps where characters repeat.
Using the opportunity
We could take the output of the BWT and just apply a conventional compressor to it, but Burrows and Wheeler suggest an improved approach. They recommend using a Move to Front scheme, followed by an entropy encoder.
A Move to Front (or MTF) encoder is a fairly trivial piece of work. It simply keeps all 256 possible codes in a list. Each time a character is to be output, you send its position in the list, then move it to the front. In the above example, column L might start with the following sequence: "tttWtwttt". If we encode that using an MTF scheme, we would output the following integers: { 116, 0, 0, 88, 1, 119, 1, 0, 0 }.
If the output of the BWT operation contains a high proportion of repeated characters, we can expect that applying the MTF encoder will give us a file filled with lots of zeros, and heavily weighted towards the lowest integers.
At that point, the file can finally be compressed using an entropy encoder, typically a Huffman or arithmetic encoder. The example programs that accompany this article use an adaptive order0 arithmetic encoder as the final step of the process.
And the winner is…
Using the BWT as a front end for data compression has a result similar to that of simple statistical modeling programs. An ordern statistical model simply uses the previous n characters to predict the value of the next character in a string. Compression programs based on statistical modeling can give good results, at the expense of high memory requirements.
The BWT in effect uses the entire input string as its context. But instead of using a preceding run of characters to model the output, it uses the string following a character as a model. Because of the sorting method used, characters hundreds of bytes downstream can have an effect on the ordering of the BWT output.
This characteristic of the BWT has another side effect. It means that in general, the bigger the block size, the better the compression. As long as the data set is homogenous, bigger blocks will generate longer runs of repeats, leading to improved compression. Of course, the sorting operations needed at the front end of the BWT will usually have O( N * log(N) ) performance, meaning bigger blocks might slow things down considerably. (Although Burrows and Wheeler point out that a suffix tree sort can be done in linear time and space.)
Programmers today might also be impressed by the fact that the BWT is apparently not covered by any software patents. I asked Burrows and Wheeler if any patents had been initiated by DEC, and both indicated not. Since this algorithm was published in 1994, it may mean the technique is in the public domain. (Consult an attorney!)
Even if the BWT is in the public domain, programmers need to be careful when selecting an entropy encoder for the final stage of compression. Arithmetic coding offers excellent compression, but is covered by quite a few patents. Huffman coding is slightly less efficient, but seems less likely to provoke intellectual property fights.
A Reference Implementation
One of the nice things about BWT compression is that it's easy to isolate each step of the process. I took advantage of that when writing the demo programs to accompany this article. The compression and decompression processes will consist of running several programs in sequence, piping the output of one program to the input of another, and repeating until the process is complete.
Compressing a file using the demo programs is done using the following command line:

RLE inputfile  BWT  MTF  RLE  ARI> outputfile
A brief description of each of the programs follows:
RLE.CPP  This program implements a simple runlength encoder. If the input file has many long runs of identical characters, the sorting procedure in the BWT can be degraded dramatically. The RLE front end prevents that from happening. 
BWT.CPP  The standard BurrowsWheeler transform is done here. This program outputs repeated blocks consisting of a block size integer, a copy of L, the primary index, and a special last character index. This is repeated until BWT.EXE runs out of input data. 
MTF.CPP  The Move to Front encoder operates as described in the previous section. 
RLE.CPP  The fact that the output file is topheavy with runs containing zeros means that applying another RLE pass to the output can improve overall compression. I believe that further processing of the MTF output will provide fertile ground for additional improvements. 
ARI.CPP  This is an order0 adaptive arithmetic encoder, directly derived from the code published by Witten and Cleary in their 1987 CACM article. 
The output of the compressor can be decompressed by applying the same algorithms in reverse. The command line that accomplishes this is:

UNARI inputfile  UNRLE  UNMTF  UNBWT  UNRLE> outputfile
Each of the programs here has a onetoone correspondence with its namesake from the compression section.
The 10 source files that make up this program suite are all fairly short, and their inner workings should be fairly clear. The most complicated programs are the arithmetic encoder and decoder, which are well documented elsewhere.
Results
I used the standard Calgary Corpus to benchmark my version of BWT compression. The corpus is a set of 18 files with varying characteristics that can be found via anonymous ftp at:
ftp.cpsc.ucalgary.ca/pub/projects/text.compression.corpus
For purposes of comparison I show the raw size of the file, the compressed sizes using the BWT compressor, and the compressed sizes using PKZIP with the default compression level selected.
File Name  Raw Size  PKZIP Size  PKZIP Bits/Byte  BWT Size  BWT Bits/Byte 
bib  111,261  35,821  2.58  29,567  2.13 
book1  768,771  315,999  3.29  275,831  2.87 
book2  610,856  209,061  2.74  186,592  2.44 
geo  102,400  68,917  5.38  62,120  4.85 
news  377,109  146,010  3.10  134,174  2.85 
obj1  21,504  10,311  3.84  10,857  4.04 
obj2  246,814  81,846  2.65  81,948  2.66 
paper1  53,161  18,624  2.80  17,724  2.67 
paper2  82,199  29,795  2.90  26,956  2.62 
paper3  46,526  18,106  3.11  16,995  2.92 
paper4  13,286  5,509  3.32  5,529  3.33 
paper5  11,954  4,962  3.32  5,136  3.44 
paper6  38,105  13,331  2.80  13,159  2.76 
pic  513,216  54,188  0.84  50,829  0.79 
progc  39,611  13,340  2.69  13,312  2.69 
progl  71,646  16,227  1.81  16,688  1.86 
progp  49,379  11,248  1.82  11,404  1.85 
trans  93,695  19,691  1.68  19,301  1.65 
total:  3,251,493  1,072,986  2.64  978,122  2.41 
Table 1
Mandatory Comparison of Results
As Table 1 shows, this demonstration program manages to compete pretty well with the compression offered by PKWare's commercial product, PKZip. I consider this particularly encouraging in light of the fact that there should be plenty of headroom left for tweaking additional percentage points from my code.
Acknowledgments
I had some excellent help from a few people who reviewed this article and the source code. My thanks go out to Nico de Vries, Leo Broukhis, Jeanloup Gailly, David Wheeler, and Robert Jung.
I also have to thank Quincey Koziol and Philipp Knirsch for the help they gave me with UNIX testing. Without their assistance, the source code for this article would have only been usable under MSDOS and Windows.
Bibliography and Resources
You can find resources, additional source code, and commentary regarding this compression on my personal web page at http://www.dogma.net/markn, under the "Magazine Articles" section.
Burrows, M. and Wheeler, D.J. (1994) "A Blocksorting Lossless Data Compression Algorithm", Digital Systems Research Center Research Report 124,
http://gatekeeper.dec.com/pub/DEC/SRC/researchreports/abstracts/srcrr124.html.
Witten, I.H., Neal, R., and Cleary, J.G., (1987b) "Arithmetic coding for data compression," Communications of the Association for Computing Machinery,30 (6) 520540, June.
Nelson, Mark and Gailly, JeanLoup, (1995) The Data Compression Book, second edition, M&TBooks, New York.
Source Code
rle.cpp  The run length encoder program. 
bwt.cpp  BurrowsWheeler transform code, implemented using the traditional C function qsort(). 
bwta.cpp  BurrowsWheeler transform code, implemented using the STL. 
mtf.cpp  A Move To Front encoder. 
ari.cpp  An order0 adaptive arithmetic encoder, using the code from the 1987 CACM article by Witten, et.al. 
unrle.cpp  Run Length decoder, undoes the encoding performed by rle.cpp. 
unbwt.cpp  Inverse BurrowsWheeler transform, using the format used in bwt.cpp and bwta.cpp. 
unmtf.cpp  The Move to Front decoder, companion to mtf.cpp. 
unari.cpp  An order0 adaptive arithmetic decoder, using the code from the 1987 CACM article by Witten, et.al. 
readme.txt  Build instructions, short and sweet! Please read! 
bwtcode.zip  The whole package in a ZIP file. 
Updated Notes
The BurrowsWheeler Transform: Theory and Practice (1999) Giovanni Manzini