|
Dr. Dobb's Journal
September, 1996 |
The world is not yet exhausted; let me see something tomorrow which I never saw before. -- Samuel Johnson
In all chaos there is a cosmos, in all disorder a secret order.-- Carl Jung
Introduction
In mathematics, difficult problems can often be simplified by performing a transformation on a data set. For example, digital signal processing programs use the FFT to convert sets of sampled audio data points to equivalent sets of frequency data. Pumping up the bass or applying a notch filter is then just a matter of multiplying some frequency data points by a scaling factor. Perform an inverse FFT on the resulting points, and voila, you have a new audio waveform that has been transformed according to your specifications.
Michael Burrows and David Wheeler recently released the details of a transformation function that opens the door to some revolutionary new data compression techniques. The Burrows-Wheeler Transform, or BWT, transforms a block of data into a format that is extremely well suited for compression. It does such a good job at this that even the simple demonstration programs I'll present here will outperform state of the art programs.
In this article, I'll first present the BWT, and demonstrate its reversibility. I'll then show you how a block of data transformed by the BWT can be compressed using standard techniques. After discussing the advantages and disadvantages of this type of data compression, I'll present my simple demonstration program. And no article on data compression would be complete without the pro forma comparison table, showing how BWT-based compression stacks up against other modern techniques.
The Burrows-Wheeler Transform
Michael Burrows and David Wheeler released a research report in 1994 discussing work they had been doing at the Digital Systems Research Center in Palo Alto, California. Their paper, "A Block-sorting Lossless Data Compression Algorithm" presented a data compression algorithm based on a previously unpublished transformation discovered by Wheeler in 1983.
While the paper discusses a complete set of algorithms for compression and decompression, the real heart of the paper consists of the disclosure of the BWT algorithm.
BWT Basics
The BWT is an algorithm that takes a block of data and rearranges it using a sorting algorithm. The resulting output block contains exactly the same data elements that it started with, differing only in their ordering. The transformation is reversible, meaning the original ordering of the data elements can be restored with no loss of fidelity.
The BWT is performed on an entire block of data at once. Most of today's familiar lossless compression algorithms operate in streaming mode, reading a single byte or a few bytes at a time. But with this new transform, we want to operate on the largest chunks of data possible. Since the BWT operates on data in memory, you may encounter files too big to process in one fell swoop. In these cases, the file must be split up and processed a block at a time. The demo programs that accompany this article work comfortably with block sizes of 50Kbytes up to 250 Kbytes.

Figure 1
An sample data set
For purposes of illustration, I'll start with a much smaller data set, shown in Figure 1. This string contains seven bytes of data. In order to perform the BWT, the first thing we do is treat a string S, of length N, as if it actually contains N different strings, with each character in the original string being the start of a specific string that is N bytes long. (In this case, the word string doesn't have the C/C++ semantics of being a null terminated set of characters. A string is just a collection of bytes.) We also treat the buffer as if the last character wraps around back to the first.

Figure 2
The set of strings associated with the buffer
It's important to remember at this point that we don't actually make N -1 rotated copies of the input string. In the demonstration program, we just represent each of the strings by a pointer or an index into a memory buffer.
The next step in the BWT is to perform a lexicographical sort on the set of input strings. That is, we want to order the strings using a fixed comparison function. This means we should compare using a function similar to the C functions strcmp() or memcmp(). In this high level view of the algorithm the comparison function has to be able to wrap around when it reaches the end of the buffer, so a slightly modified comparison function would be needed.
After sorting, the set of strings is arranged as shown in Figure 3. There are two important points to note in this picture. First, the strings have been sorted, but we've kept track of which string occupied which position in the original set. So, we know that the String 0, the original unsorted string, has now moved down to row 4 in the array.

Figure 3
The set of strings after sorting
Second, I've tagged the first and last columns in the matrix with the special designations F and L, for the first and last columns of the array. Column F contains all the characters in the original string in sorted order. So our original string "DRDOBBS" is represented in F as "BBDDORS".
The characters in column L don't appear to be in any particular order, but in fact they have an interesting property. Each of the characters in L is the prefix character to the string that starts in the same row in column F.
The actual output of the BWT, oddly enough, consists of two things: a copy of column L, and the primary index, an integer indicating which row contains the original first character of the buffer B. So performing the BWT on our original string generates the output string L which contains "OBRSDDB", and a primary index of 5.
The integer 5 is found easily enough since the original first character of the buffer will always be found in column L in the row that contains S1. Since S1 is simply S0 rotated left by a single character position, the very first character of the buffer is rotated into the last column of the matrix. Therefore, locating S1 is equivalent to locating the buffer's first character position in L.
Two obvious questions
At this point in the exposition, there are two obvious questions. First, it doesn't seem possible that this is a reversible transformation. Generally, a sort() function doesn't come with an unsort() partner that can restore your original ordering. In fact, it isn't likely that you've ever even considered this as something you might like. And second, what possible good does this strange transformation do you?
I'll defer the answer to the second question while I explain the reversibility of this transform. Unsorting column L requires the use of something called the transformation vector. The transformation vector is an array that defines the order in which the rotated strings are scattered throughout the rows of the matrix of Figure 3.
The transformation vector, T, is an array with one index for each row in column F. For a given row i, T[ i ] is defined as the row where S[ i + 1 ] is found. In Figure 3, row 3 contains S0, the original input string, and row 5 contains S1, the string rotated one character to the left. Thus, T[ 3 ] contains the value 5. S2 is found in row 2, so T[ 5 ] contains a 2. For this particular matrix, the transformation vector can be calculated to be { 1, 6, 4, 5, 0, 2, 3 }.

Figure 4
The transformation vector routes S[ i ] to S[ i + 1]
Figure 4 shows how the transformation vector is used to walk through the various rows. For any row that contains S[ i ], the vector provides the value of the row where S[ i + 1 ] is found.
The Rosetta Vector
The reason the transformation vector is so important is that it provides the key to restoring L to its original order. Given L and the primary index, you can restore the original S0. For this example, the following code does the job:
-
int T[] = { 1, 6, 4, 5, 0, 2, 3 };
-
char L[] = "OBRSDDB";
-
int primary_index = 5;
-
-
void decode()
-
{
-
int index = primary_index;
-
for ( int i = 0 ; i <7 ; i++ ) {
-
cout <<L[ index ];
-
index = T[ index ];
-
}
-
}
You can get there from here
So now we come to the core premise of the Burrows-Wheeler transform: given a copy of L, you can calculate the transformation vector for the original input matrix. And consequently, given the primary index, you can recreate S0, or the input string.
The key that makes this possible is that calculating the transformation vector requires only that you know the contents of the first and last columns of the matrix. And believe it or not, simply having a copy of L means that you do, in fact, have a copy of F as well.

Figure 5
The known state of the matrix given L
Given just the copy of L, we don't know much about the state of the matrix. Figure 5 shows L, which I've moved into the first column for purposes of illustration. In this figure, F is going to be in the next column. And fortunately for us, F has an important characteristic: it contains all of the characters from the input string in sorted order. Since L also contains all the same characters, we can determine the contents of F by simply sorting L!

Figure 6
The known state after recovering F
Now we can start working on calculating T. The character 'O' in row 0 clearly moves to row 4 in column F, which means T[ 4 ] = 0. But what about row 1? The 'B' could match up with either the 'B' in row 0 or row 1. Which do we select?
Fortunately, the choice here is not ambiguous, although the decision making process may not be intuitive. Remember that by definition, column F is sorted. This means that all the strings beginning with 'B' in column L also appear in sorted order. Why? They all start with the same character, and they are sorted on their second character, by virtue of their second characters appearing in column F.
Since by definition the strings in F must appear in sorted order, it means that all the strings that start with a common character in L appear in the same order in F, although not necessarily in the same rows. Because of this, we know that the 'B' in row 1 of L is going to move up to row 0 in F. The 'B' in row 6 of L moves to row 1 of F.
Once that difficulty is cleared, it's a simple matter to recover the transformation matrix from the two columns. And once that is done, recovering the original input string is short work as well. Simply applying the C++ code shown earlier does the job.
The Punchline
The whole point to this BWT exercise is supposed be better data compression, so let's take a look at how this works. Figure 7 shows a snippet of debug output taken from the test program BWT.CPP. Each line of output shows a prefix character followed by a sorted string. The prefix character is actually what is found in column L after the strings have been sorted.
-
t: hat acts like this:<13><10><1
-
t: hat buffer to the constructor
-
t: hat corrupted the heap, or wo
-
W: hat goes up must come down<13
-
t: hat happens, it isn't likely
-
w: hat if you want to dynamicall
-
t: hat indicates an error.<13><1
-
t: hat it removes arguments from
-
t: hat looks like this:<13><10><
-
t: hat looks something like this
-
t: hat looks something like this
-
t: hat once I detect the mangled
Figure 7
Debug output from BWT.CPP
This section of output consists of a group of sorted strings that all start with the characters "hat". Not surprisingly, the prefix character for almost all of these strings is a lower case 't'. There are also a couple of appearances of the letter 'w'.
Anytime you see repetition like this, you clearly have an opportunity for data compression. Sorting all of the strings gives rise to a set of prefix characters that has lots of little clumps where characters repeat.
Using the opportunity
We could take the output of the BWT and just apply a conventional compressor to it, but Burrows and Wheeler suggest an improved approach. They recommend using a Move to Front scheme, followed by an entropy encoder.
A Move to Front (or MTF) encoder is a fairly trivial piece of work. It simply keeps all 256 possible codes in a list. Each time a character is to be output, you send its position in the list, then move it to the front. In the above example, column L might start with the following sequence: "tttWtwttt". If we encode that using an MTF scheme, we would output the following integers: { 116, 0, 0, 88, 1, 119, 1, 0, 0 }.
If the output of the BWT operation contains a high proportion of repeated characters, we can expect that applying the MTF encoder will give us a file filled with lots of zeros, and heavily weighted towards the lowest integers.
At that point, the file can finally be compressed using an entropy encoder, typically a Huffman or arithmetic encoder. The example programs that accompany this article use an adaptive order-0 arithmetic encoder as the final step of the process.
And the winner is…
Using the BWT as a front end for data compression has a result similar to that of simple statistical modeling programs. An order-n statistical model simply uses the previous n characters to predict the value of the next character in a string. Compression programs based on statistical modeling can give good results, at the expense of high memory requirements.
The BWT in effect uses the entire input string as its context. But instead of using a preceding run of characters to model the output, it uses the string following a character as a model. Because of the sorting method used, characters hundreds of bytes downstream can have an effect on the ordering of the BWT output.
This characteristic of the BWT has another side effect. It means that in general, the bigger the block size, the better the compression. As long as the data set is homogenous, bigger blocks will generate longer runs of repeats, leading to improved compression. Of course, the sorting operations needed at the front end of the BWT will usually have O( N * log(N) ) performance, meaning bigger blocks might slow things down considerably. (Although Burrows and Wheeler point out that a suffix tree sort can be done in linear time and space.)
Programmers today might also be impressed by the fact that the BWT is apparently not covered by any software patents. I asked Burrows and Wheeler if any patents had been initiated by DEC, and both indicated not. Since this algorithm was published in 1994, it may mean the technique is in the public domain. (Consult an attorney!)
Even if the BWT is in the public domain, programmers need to be careful when selecting an entropy encoder for the final stage of compression. Arithmetic coding offers excellent compression, but is covered by quite a few patents. Huffman coding is slightly less efficient, but seems less likely to provoke intellectual property fights.
A Reference Implementation
One of the nice things about BWT compression is that it's easy to isolate each step of the process. I took advantage of that when writing the demo programs to accompany this article. The compression and decompression processes will consist of running several programs in sequence, piping the output of one program to the input of another, and repeating until the process is complete.
Compressing a file using the demo programs is done using the following command line:
-
RLE input-file | BWT | MTF | RLE | ARI> output-file
A brief description of each of the programs follows:
| RLE.CPP | This program implements a simple run-length encoder. If the input file has many long runs of identical characters, the sorting procedure in the BWT can be degraded dramatically. The RLE front end prevents that from happening. |
| BWT.CPP | The standard Burrows-Wheeler transform is done here. This program outputs repeated blocks consisting of a block size integer, a copy of L, the primary index, and a special last character index. This is repeated until BWT.EXE runs out of input data. |
| MTF.CPP | The Move to Front encoder operates as described in the previous section. |
| RLE.CPP | The fact that the output file is top-heavy with runs containing zeros means that applying another RLE pass to the output can improve overall compression. I believe that further processing of the MTF output will provide fertile ground for additional improvements. |
| ARI.CPP | This is an order-0 adaptive arithmetic encoder, directly derived from the code published by Witten and Cleary in their 1987 CACM article. |
The output of the compressor can be decompressed by applying the same algorithms in reverse. The command line that accomplishes this is:
-
UNARI input-file | UNRLE | UNMTF | UNBWT | UNRLE> output-file
Each of the programs here has a one-to-one correspondence with its namesake from the compression section.
The 10 source files that make up this program suite are all fairly short, and their inner workings should be fairly clear. The most complicated programs are the arithmetic encoder and decoder, which are well documented elsewhere.
Results
I used the standard Calgary Corpus to benchmark my version of BWT compression. The corpus is a set of 18 files with varying characteristics that can be found via anonymous ftp at:
ftp.cpsc.ucalgary.ca/pub/projects/text.compression.corpus
For purposes of comparison I show the raw size of the file, the compressed sizes using the BWT compressor, and the compressed sizes using PKZIP with the default compression level selected.
| File Name | Raw Size | PKZIP Size | PKZIP Bits/Byte | BWT Size | BWT Bits/Byte |
| bib | 111,261 | 35,821 | 2.58 | 29,567 | 2.13 |
| book1 | 768,771 | 315,999 | 3.29 | 275,831 | 2.87 |
| book2 | 610,856 | 209,061 | 2.74 | 186,592 | 2.44 |
| geo | 102,400 | 68,917 | 5.38 | 62,120 | 4.85 |
| news | 377,109 | 146,010 | 3.10 | 134,174 | 2.85 |
| obj1 | 21,504 | 10,311 | 3.84 | 10,857 | 4.04 |
| obj2 | 246,814 | 81,846 | 2.65 | 81,948 | 2.66 |
| paper1 | 53,161 | 18,624 | 2.80 | 17,724 | 2.67 |
| paper2 | 82,199 | 29,795 | 2.90 | 26,956 | 2.62 |
| paper3 | 46,526 | 18,106 | 3.11 | 16,995 | 2.92 |
| paper4 | 13,286 | 5,509 | 3.32 | 5,529 | 3.33 |
| paper5 | 11,954 | 4,962 | 3.32 | 5,136 | 3.44 |
| paper6 | 38,105 | 13,331 | 2.80 | 13,159 | 2.76 |
| pic | 513,216 | 54,188 | 0.84 | 50,829 | 0.79 |
| progc | 39,611 | 13,340 | 2.69 | 13,312 | 2.69 |
| progl | 71,646 | 16,227 | 1.81 | 16,688 | 1.86 |
| progp | 49,379 | 11,248 | 1.82 | 11,404 | 1.85 |
| trans | 93,695 | 19,691 | 1.68 | 19,301 | 1.65 |
| total: | 3,251,493 | 1,072,986 | 2.64 | 978,122 | 2.41 |
Table 1
Mandatory Comparison of Results
As Table 1 shows, this demonstration program manages to compete pretty well with the compression offered by PKWare's commercial product, PKZip. I consider this particularly encouraging in light of the fact that there should be plenty of headroom left for tweaking additional percentage points from my code.
Acknowledgments
I had some excellent help from a few people who reviewed this article and the source code. My thanks go out to Nico de Vries, Leo Broukhis, Jean-loup Gailly, David Wheeler, and Robert Jung.
I also have to thank Quincey Koziol and Philipp Knirsch for the help they gave me with UNIX testing. Without their assistance, the source code for this article would have only been usable under MS-DOS and Windows.
Bibliography and Resources
You can find resources, additional source code, and commentary regarding this compression on my personal web page at http://www.dogma.net/markn, under the "Magazine Articles" section.
Burrows, M. and Wheeler, D.J. (1994) "A Block-sorting Lossless Data Compression Algorithm", Digital Systems Research Center Research Report 124,
http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-124.html.
Witten, I.H., Neal, R., and Cleary, J.G., (1987b) "Arithmetic coding for data compression," Communications of the Association for Computing Machinery,30 (6) 520-540, June.
Nelson, Mark and Gailly, Jean-Loup, (1995) The Data Compression Book, second edition, M&TBooks, New York.
Source Code
| rle.cpp | The run length encoder program. |
| bwt.cpp | Burrows-Wheeler transform code, implemented using the traditional C function qsort(). |
| bwta.cpp | Burrows-Wheeler transform code, implemented using the STL. |
| mtf.cpp | A Move To Front encoder. |
| ari.cpp | An order-0 adaptive arithmetic encoder, using the code from the 1987 CACM article by Witten, et.al. |
| unrle.cpp | Run Length decoder, undoes the encoding performed by rle.cpp. |
| unbwt.cpp | Inverse Burrows-Wheeler transform, using the format used in bwt.cpp and bwta.cpp. |
| unmtf.cpp | The Move to Front decoder, companion to mtf.cpp. |
| unari.cpp | An order-0 adaptive arithmetic decoder, using the code from the 1987 CACM article by Witten, et.al. |
| readme.txt | Build instructions, short and sweet! Please read! |
| bwtcode.zip | The whole package in a ZIP file. |
Updated Notes
The Burrows-Wheeler Transform: Theory and Practice (1999) Giovanni Manzini
97 users commented in " Data Compression with the Burrows-Wheeler Transform "
Follow-up comment rss or Leave a Trackbacki would like to ask you a question that is it the same to use this algorithm to compress a text file or a binary file?
i hope i can express clearly~
yes, the algorithm will work equally well on text or binary data. The same sort of redundancy can be exploited in each kind of file.
Recently I tested the compression ability of bzip2(using BWT and Huffman), and i found that it appears better when the compressed file is smaller. The fact is that when the file is bigger than 20M the compression ratio is very low, but the ratio is much higher by using Winrar.
did you ever found this problem? And would you give me some advice to improve it? Thank you very much~
My guess would be that the problem you are seeing is a variation in the content of the file - for example, the data stream in an EXE file will change dramatically as you move from code to data to resources.
If bzip2 uses huge blocks, it will be modeling the code and data together, which will hurt compression.
Programs like zip use small blocks, so they don't run into this problem. Each 32K block is compressed independently. This can be good or bad... it all depends on the data being compressed.
Sorry for my poor english-_-b
Did you mean that the blocksize 100K in bzip2 was huge ?I shall do some tests more~~
Is it true that the compression one achieves is dependent on the degree of correlation between byte N and byte N 1? Or conversely, without any byte-to-byte correlation of the stream to be encoded, (i.e. if byte[n] is uncorrelated with byte[n 1]), there will be no compression.
Correlation is where you find it - you could conceivably write a data compression program that only seeks to find correlation between byte N and N 5. I don't think you can apply any hard and fast rule to say that there has to be correlation between any two specific bytes to achieve compression.
A good example where this is not the case might be when you are compressing a stream of 80 bit floating point numbers. The numbers will appear in 10 bit blocks, and there will probaly be more correlation between bytes separated by 10 then bytes separateed by one.
That said, if we look at the specific case of the BTW transform followed by the MTF algorithm, I think you will find that strongly exploits unbounded correlation. In other words, BWT can find and expoit correlation between bytes separated by various distances.
Of course, in English text, I think the strongest predictory of a byte will always be the neighboring byte.
Hi Mark,
I think you might be wrong about the complexity of the BWT algorithm, as presented in this article.
I think the algorithm you describe above is really O(N^2 logN), and here's why:
At one point in the algorithm you need to do a lexographical sort of all N the possible rotations of the input strings, correct? This sort is required to generate the 'F' and the 'L' columns needed for the latter stages in the algorithm.
At any rate, this lexographical sort is of N strings, __each string being of length N__. This is an important point.
The best sorting algorithms have a worst-case time complexity of O(N log N). *BUT* this complexity assumes that an elementwise compare takes O(1).
However, in this algorithm you need to compare *strings* to each other, and each string is of length N, so elementwise compares are O(N). This means that the complexity of sorting for a BWT is O(N log N * N), or O(N^2 log N).
What do you think of that? I *do* think that there are faster implementations that don't have this type of complexity -- but they rely on building suffix trees in O(N) time in order to generate the 'F' and consequently the 'L' column of the BWT.
I don't think you are correct on this point, and here is why.
What is the average number of tests needed to compare two strings of length N? Your argument is that it takes N comparisons, and this is surely not even close.
For English text (which tends to not have a lot of repeated data) the number of comparisons is some value K, and my guess would be that K is very small, perhaps 2 or 3. K is also relatively independent of N, the length of the strings - no matter how long the source strings are, you will tend to find a mismatch quite early in the the two.
This can be tested empirically, of course.
In the case where there are pathological texts with lots of repeated data, you would find K suddenly shooting up - which is why most BWT implementations put their data through an RLE filter first.
Given the RLE filter, I think it would be extremely unusual to find a high value of K.
But... the thruth is that K is somewhat data dependent, so it is possible for it to be high in some types of data. But I do believe it will always be constant for a given data stream, and not related to N.
Hi Mark,
I really am very glad you are taking the time to discuss this in the comments section!
I think you are right about the value K, actually. I naively thought it was always average case N/2 but really it depends highly on the input text. Unfortunately, K seems to be rather large for genomic data for some reason -- and now discussing with you I think I know why.
Here's my problem:
It turns out I need to do a BWT on large texts (genomic data) in order to generate a suffix array for fast and efficient substring searches (of said genomic data).
I wrote an algorithm that does essentially what you describe in this article, and actually you are right -- I couldn't explain why on stuff like English text it appeared to be orders of magnitude faster than on (some) genomic data. It turns out some genomic data has lots of redundancies and recurring substrings, so you end up with some quite expensive comparisons. Also, genomic data has pathological runs of the character 'N'! (N signifies not sequenced or unknown).
I have yet to try an RLE filter but I suspect that will help fix things *a lot*. If you think about it pathological runs of the same character are really going to raise K to N/2 in the average case (where N is the length of the run) for that substring!
It never occured to me that this what's breaking the sorting speed -- but it must be! Thanks so much for taking the time to discuss this with me and I'll post a comment again if it turns out to drastically improve my performance numbers -- as it should!
-Calin
Yeah, I think you are right that genomic data is going to have some trouble spots for sorting, including a small alphabet and those annoying repeated sequences. (Which really don't do us any good anyway, do they? Intelligent design?!?)
I've seen discussions on this before (don't have any specific references) and the best idea for improving sorting would be to do some sort of tokenization of the input - turning those sequences into single symbols. The trick is to determine the correct symbol set and I really don't have any good idea how one would do this.
Algorithms like LZ77 and LZ78 are very good at greedily identifying common subsequences, and they operate pretty quickly. One really interesting approach would be to use one of those algorithms instead of RLE - the output would be a stream of 16 bit symbols, and see how that affects sorting speed.
Thank u for this article....... is there any way to improve the compression ratio proposed in this article?
how to implement full text indexing in bwt?
There is a huge amount of work going on in this area, and certainly there are many, many ways to improve on this program. As a starting point, you might want to look at comp.compression and groups.google. and search on BWT. Lots to see.
how do we calculate the complexity of an algorithm generally , and also the complexity of bwt algorithm ?
well babu, that sounds a lot like a homework problem, and it's been a long time since i took that algorithms course.
1) How do we calculate the complexity of an algorithm generally?
This can be easy or hard depending on the algorithm, and of course you have to calculate both the time and space complexity. People make a career thinking about this question.
2) complexity of BWT?
I think you can find this online with some search.
one final question ... is there a way to search a substring in the compressed text or from the output of any of the stages from bwt , mtf , rle , as i am learning about the compression based on bwt and its applications .... any suggestions please
not easily. most compression algorithms don't provide for any way to search for substrings in compressed text. there are a few that do, but they tend to be specialized versions of existing algorithms. "search compressed text" on google and you'll get lots of hits.
hi your article is of great help in understanding BWT....i have to write a technical paper on this which should involve some innovation also....what to do? i have explained the algorithm but for the innovative part i don't understand what should i do? i have read somewhere that new techniques are replacing MTF......what are these and can i write on that?
Parul, the word "innovate" means you come up with new ideas - not that I tell you new stuff.
I would suggest that you look at the proceedings of the data compression conference for the past five years or so. There are many interesting papers being published on block sorting algorithms.
Failing that, try going to scholar.google.com and search on Burrows Wheeler - you'll find a lot to read.
See this interesting current summary when i download, burn and uncompress big files:
* linux-2.6.20.4.tar.bz2 43'373'415 bytes (BWT MT Huffman)
* linux-2.6.20.4.tar.gz 54,553,981 bytes ( 25.78% of size!!!) (LZ77 Huffman)
* bzip2: ( time ( bzip2 -vt linux-2.6.20.4.tar.bz2 ) ) 2>&1 | grep user | tr '\t' ' ' | cut -d' ' -f2
0m23.040s ( bzip2 is slow uncompressing !!! )
* gzip: ( time ( gzip -vt linux-2.6.20.4.tar.gz ) ) 2>&1 | grep user | tr '\t' ' ' | cut -d' ' -f2
0m2.580s ( gzip is 8.93 times more speed uncompressing than bzip2 !!! )
But the LZ77 algorithm is reaching the size to its limit and can't speedup more.
However, can we perform more the BWT without consuming more memory?
BWT does a nice job of compression, but it is going to use a lot of memory, no way around it.
Hello, Mark!
In December, I will deliver my compressor based on BWT, but with higher compression rates.
Sidney,
Higher compression rates than what?
I am looking for using BWT for very low freq band signals of abt 0-100 hz multichannel signal only can i use the same techniques or will i have to change some thing
Like any compression algorithm,the behavior of BWT will vary widely depending on the characteristics of the input data.
If your input data is byte oriented, and responds strongly to an order-1 model, BWT will probably do very well as is.
Without knowing any more about your data I am not really in a position to guess. I will say that if it compresses will with the deflate algorithm (Zip), it is very likely to compress better with BWT.
If your data does not compress well with deflate, it may well need some sort of preprocessing before it can be effectively compressed with BWT.
Mark,
Do you think the combination of BWT MTFT algorithmic encoding is a suitably effective basis to use when designing compression schemes for use on specific data sets?
I suspect it is, but since the BWT becomes less effective depending on local block size, as discussed above (compare with http://portal.acm.org/citation.cfm?id=5688&dl=GUIDE, ), it is plausible that it might pose significant risks to the programmer (and her project!) who implements it, indiscriminately, on every data set.
I don't have an ACM web account, so I can't read the paper.
However, I think in BWT actually has some nice characteristics for exploiting blocked data, particulary on large block sizes. For example, if a block size is 128K, a program like deflate will have a hard time exploting it, since its buffer is limited to 32K bytes. BWT on the other hand, will be pulling in multi-megabyte chunks, and thus will easily see the repeated patterns that appear in those blocks.
I'm not sure you would easily create a realistic data set where BTW perfomed WORSE than deflate.
That said, structured data can often be preprocessed to better exploit its compressibility, and this is often really helfpul.
Very interesting. I like that the algorithm can be partitioned into pipelined stages. That's excellent for multi-core processors.
Other notes:
1) The results would be much easier to interpret if compressed size were given as a percentage instead of raw numbers.
2) What about execution time compared to PKZIP? I understand that you probably didn't even think about optimizing for that, but it would be interesting.
Steve,
>The results would be much easier to interpret if compressed size were given
>as a percentage instead of raw numbers
That was the intent of the bits/byte column - image compression stats are often given in bits/pixel. But I think you're right, just showing a percentage of original size would be an improvement.
>What about execution time compared to PKZIP?
PKZIP or any program using zlib's deflate algorithm has one huge advantage: the amount of memory used in the compression cycle is miniscule for deflate when compared to PKZip.
Compression with BWT usually turns out to be slower. To get the best compression, it is to your advantage to use big blocks of memory. But sorting those big blocks can be slow, and of course sort algorithms are always worse than O(N).
A good comparison of two nicely optimized programs can be done by comparing Zip vs. bzip2 - ports available for most operating systems.
Mark,
Nicely written article. I am interested in compressing very large files of integer data, with values ranging up to 4 digits. (1) Is this type of file conducive to BWT processing, (2) is qsort() appropriate, or is there a better sorting function.
Thanks.
@RW:
Yes, you can certainly use BWT to compress a sequence of integers. You can compress either as a sequence of bytes, or as a sequence of integers. In many cases you will do better by compressing based on the natural storage unit, in this case the integer.
The qsort() routine in the C standard library is fine for sorting the data. If your blocks of input are very large, you probably want to apply RLE encoding in advance to avoid the possibility of long runs of repeated data.
is it better than lz77 or not?
@adzo:
That's not really a yes/no question. When comparing data compression algorithms, you have to consider performance on different types of data sets, speed of the algorithm, memory and other resources used, and so on. There aren't usually any easy answers.
[...] Nelson, Data Compression with the Burrows-Wheeler Transform, Dr. Dobb’s Journal, [...]
May i know how to perform lexicographically sorting? what is the concept behind it?
@zola:
lexicographical sorting simply means sorting in alphabetical order. You can do it with qsort() or other sorts of your choice.
What makes it a little tricker for BWT is the fact that you are sorting rotations of a buffer. You don't want to actually have to perform the rotation, so you need an iterator that walks in a circle through the buffer, past the end, and back to the beginnined. The same iterator needs to stop just before the position where it started.
That's not so easy to do with qsort().
hmm sorry about it.. but can u show me some example on how to do a lexicography sort? i study a few example but still get the idea behind it.. many thz
and btw i m doing in java so i don't think there is any sort method to do that..
Why use the BWT and arithmetic compression, when you could just simply skip the BWT and move-to-front algorithm, and just go straight to the arithmetic compression part? Arithmetic compression is not affected by "clumping of data", all that matters then is frequency of the data, no matter where it is at in the data. Of course Huffman compression would be a different story, it just is the choice of arithmetic compression is unclear, since arithmetic compression is known for already giving higher compression ratios than PKZIP, but is rather slow in implementing. Have you ever done a comparison of the arithmetic compressor to the BWT compression using arithmetic compression?
Since the BWT tends to clump byte pairs and not individual bytes, wouldn't you think that BWT would do wonders paired up with Byte Pair Encoding (BPE)? Together they would be very fast and efficient, but unfortunately I believe BPE still has a patent against it.
@ar18:
You're a little mixed up here. First of all, there really is no such thing as "arithmetic compression." You can use an arithmetic coding stage as part of a data compression program, which is what the BTW program does.
There are many other types of compressors that use arithmetic coding for their output stage, so it's not really clear what you are trying to compare BWT agains. However, let's assume you are talking about an order-n PPM system.
You can find good comparison results fairly easily. In most cases, if you want to outperform BWT in terms of compression ratios, you will find that the resulting compressor is perhaps order-4 or order-5, has very steep memory requirements, and runs fairly slowly.
Not that there is anything wrong with PPM. There are many ways to skin this cat. PPM and BWT are two of them, and they are both used in various places effectively.
>Arithmetic compression is not affected by
>"clumping of data", all that matters then is
>frequency of the data, no matter where it is
>at in the data.
Sounds again like you're talking about PPM. This is of course a weakness, not a strength. Frequency distributions change over files or a distribution of files, and PPM doesn't do much in the way of adaptation. This prevents it from getting better results.
- Mark
I'm not mixed up Mark, arithmetic coding and arithmetic compression are one and the same thing, just looked at from different angles. There are compressors that use nothing but arithmetic coding to compress the data and they all can outperform anything out there, including PKZIP, except when it comes to speed.
So the issue is, how much better is BWT and arithmetic coding compared to just arithmetic coding alone? That is a lot of overhead to add to something that already is slow and compresses data well so what is the benefit of one over the other?
No, I am not talking about PPM but about variable-length entropy encoding, like Huffman, which is just a specialized case of arithmetic coding anyway.
So we are back to my question. I have observed the output of BWT and it tends to clump data in pairs. Data pairs are more quickly and efficiently coded by BPE than by arithmetic coding. BPE, by itself can outperform PKZIP (especially in speed during decompression), so BWT would improve that ever more and it would be faster than arithmetic coding to boot.
Hi Mark,
a very interesting article. Mark, i am looking at stream compression, and the data that will be passed on to the BWT compressor will be around 1K only, and is mostly binary and with very high entropy. Is there some parameter i can tweak in BWT, the block size etc. to get better compression. The standard(bzip) one is not giving me any improvement. As far as i know it does bwt, mtf and huffman. The data is so random, that i don't think having RLE to start will be of any use. Will appreciate any suggestion.
@variac:
The key problem is your input "the data is so random". No algorithm is going to be able to compress data that appears random - there has to be some sort of pattern to pick up on. If you know something about the stream you might be able to preprocess it so a general purpose compressor can do better. But if it's just plain random, well, it's not going to compress.
Thanks Mark, but does BW transform, by any chance help in reducing the randmoness, as in bringing similar characters together.
hi Mark, it is said that BWT is a lossless compression algorithm.But it gives an output size that is same as that of input. then how come its a compression algorithm?
@Naga:
I think I explain this in a huge amount of detail. Did you read the article?
If you did, and still have the same question, either I have failed or you need to brush up on some basics.
BWT is a lossless data transform that takes an input text and produces an output of the same size that is more compressible using certain simple techniques.
Thus data is never compressed solely by BWT - it is only compressed when used in conjunction with some other operations.
- Mark
Thanks for the (re-published) article Mark.
This is indeed very clear writing. It helped me *a lot* to understand this mystery. BWT is no longer "a strange thing".
Something that still puzzles me after this reading is the speed/compression ratio of this technique. This is maybe not a question for a "compression only matters" program, but when trying to optimise the right compression/speed/memory combination, we also have to consider the time spent.
As shown in your examples, BWT advantage compared to Zip is not very huge, it is not even "guaranteed" positive.
What's more, the comparison is not "fair", meaning that we compare 2 completely different things, so we don't exactly know which part of the compression comes from "BWT effect", and which one comes from ARI alone.
A possible comparison test to help understanding this could be to compare your encoding chain :
RLE input-file | BWT | MTF | RLE | ARI> output-file
with a BWT-less one :
RLE input-file | MTF | RLE | ARI> output-file
or even (maybe RLE is useless or counter-productive with ARI).
input-file | MTF | ARI> output-file
(for example)
Both timing and compression result would be interesting.
Well, maybe there are already other places on the web around that have looked into this question ?
Best Regards and thanks for your explanations Mark
@Yann:
Well these are interesting experiments, and easy enough to run. Give it a shot!
- Mark
[...] [upmod] [downmod] Data Compression with the Burrows-Wheeler Transform (marknelson.us) 1 points posted 6 months ago by jeethu tags burrows compression reference [...]
I will replace huffman with arithmetic coding in bzip2 to see what happen...
Does anyone know if BWT patent is still valid? When is the expiration for this patent? Thanks.
@eric:
"the BWT patent?"
The basic BWT technology was not patented by the inventors, and their publication of the basics would have prevented any other parties from acquiring a patent.
I am sure there are various block sorting patents covering refinements and tuning, but the basics of the algorithm have never been patented.
- Mark
Thanks Mark for your prompt reply.
Actually I am curious on the performance for BWT compression. Let say for the final stage compression, if we use only RLE or only Huffman coding, wil there be much differ on compression ratio wise?
Because currently I am having a BWT source code which compress the input texts in this sequence. bwtsort->mtf->huffman. So I am thinking will it impact much on the compression ratio if I remove the huffman process and replace with RLE.
-Eric
At the same time, do anyone has C source code for BWT? Could you send it to me or post the link? Thanks.
Here is my email address:-
alienchean@gmail.com
@eric:
RLE will not have a large effect on compression - it is in place mostly to prevent pathological sort times. If you leave it out, a file with enormously long repeated strings will take a long time to sort, but will still compress well.
Doing RLE only and not Huffman makes no sense.
- Mark
You said in your document that huffman cannot compress better than arithmetic coding but then take a look in this stackoverflow question
http://stackoverflow.com/questions/1755624/request-for-comments-on-huffman-compression
@arabcoder:
I don't see much interesting in the stackoverflow comment thread.
And yes, arithmetic coding will either equal or outperform Huffman coding.
- Mark
thank you for your answer, maybe I am doing something wrong
@arabcoder:
I'm not really sure what you are doing or what your question is.
- Mark
Can anyone please let me know how to run this whole set of codes to compress a file. I really want to compress a set of files using this bwt. But when I try executing it, I have no option of inputting files. Please help me out somehow. I'm really in need of it as I'm doing a dissertation based on it. Please people. Help me out.
Regards,
Mohan
Hi mark,
can you explain to me the reasons why the BWT is implemented in this manner with the MTF?
why is it done this way rather than in any other way?
what are the advantages to it?
@John:
If you look at the raw output of the BWT encoder, you will see that the data stream contains frequent repetitions of a small set of characters. This is the natural result of the sorting process. If we are looking at the characters that precede the string 'he boy got out of bed', you are going to see a preponderance of 'T' and 't'.
So MTF provides a handy way to take advantage of it. If you had, for example, a string like 'tttttTtTtttt' you might have an MTF output of 0 0 0 0 0 0 0 1 0 1 0 0 0 ....
This compresses very well using conventional entropy encoders.
- Mark
Thanks mark, i can see why there are advantages to it now using the BWT with the MTF.
can you have a look at the program which i had sent to you by email and tell me what futher improvements could be done to it, and why it is designed this way rather than any other ways, but if there are better ways in creating the compression program, what would they be?
thank you again
@John:
No, I don't really have the time to evaluate somebody else's program. Perhaps you should check with the original author?
- Mark
ok, dont worry mark, i will try, i will find it out. However, i have one more question, when using the BWT with the MTF, why is it that the decoder is very slow, is there a reason for this? is it because of the block size?
@John:
I can't really answer that - it may be the particular file you are using, it may be the compiler you used to build the code, or it may be inefficiencies in my code. As stated in the source, my code was written for clarity, not speed.
For a more representative test, see how the same files work with bzip2, which is much more optimized.
- Mark
ok, thanks for the quick replies mark, greatly appreciated
john
Hi mark,
sorry to bother you again, but i have found out why the decompression speed is slower when using BWT with the MTF, i read somewhere that if the program does not contain a linear algorithm in it, then the block size will affect the compression and decompression speed. This will not be the case for a linear algorithm though.
can the BWT contain different non-linear and linear algorthims in it?
if so, can you give me an example of a non-linear and linear?
thanks
Good job Mark !
I think BWT isn't complicated theoretically, but when in implementation sorting the rotated strings of BWT is one of the essential thing, cos it needs much time.
How actually to sort the strings? Few days ago I've tried by using Bubble Sort and other algorithm, but it obliged me to holds all rotated strings first in to an array and then sort them using sorting algorithm. As we know saving elements till thousands even only hundreds requires much time. And I think that was bad practice for BWT.
Would you give a concept how sort the rotated strings of BWT?
Thank in Advance.
(Please notify me by my email : penn.thompson@gmail.com)
@Frost:
There is a trivially easy way to sort the strings. Copy each string of length N into an array, then use the standard sort function in whatever language you are using to sort them. This works easily, but it has the disadvantage of creating N copies of the entire block of N characters, which can be huge.
To sort efficiently, do this:
1) Create an array of indices into the block of N characters. Each element in the array points to the starting position of a string. So you can initialize it like this:
for ( int i = 0 ; i array[ j ] )
... do something
You need to change this in your algorithm so that it sorts two strings in a block of N, one starting at position i, one starting at position j, and that wrap around the block and continue until they are length N. This is a slightly tricky piece of work, but it's really the only part of the sort that is difficult.
- Mark
Thank Mark!
That was very help
Hi Mark!
Awesome article!
I think this is the best article for BWT so far, there are a lot of comments too.
Mark, few days ago I implement the BWT based on your article. Instead of qsort() I make my own sorting function using Merge Sort algorithm. I start playing with file from 20kb - 100kb, everything is ok. But, when I'm dealing with file > 100kb, it consume much time (until > 15 minutes), then my computer hang and can't perform the rotating any more. It turns out that "sorting" is the most much time consuming. So, I perform some efforts like RLE my file first, but it doesn't take too much effects.
What's wrong with my own sorting function?? Do you have any idea to improve the sorting speed..?? If I'm not wrong the solution is we must re-size the block size. But, I don't know exactly what the idea behind it.
This is my snippet works for my bwt.
that is my comparing function that compare two rotated strings that go in the x and y position. It reads a byte at a time in buffer until there's the difference byte or the length buffer reached. Would return TRUE if x < y, otherwise FALSE.
Then I pass the comparePresorted() function to MergeSort algorithm to get the suffix array.
As I said, it just ok for a file 100 kb (hang)
Would you give me some suggestions to improve my sorting in order I can dealing with file > 100 kb ???
Thank in Advance Mark!
- Philips Tel
You didn't say much about what language you are using or what your code looks like. There's no mergesort algorithm in the C++ library, so I don't know what language you are implmeenting.
However, you are right that sorting is the most expensive part of the BWT algorithm. You may decide that you want to compress blocks of 100K and limit it to that size.
For more ideas on how your sort can be done better, you might want to look through the source code of bzip2.
- Mark
Hi Mark,
Thank for fast response!
I'm using Delphi. And my comparePresorted function look like this:
That function works by comparing the string a byte at a time based on the starting point that passed in x and y which means "if character in position[x] < character in position[y] would return TRUE, otherwise FALSE";
example : string "BANANA" (characters position from 0,1,3..5)
if we call comparePresorted("BANANA", 2, 4), it would compare strings from position 2,3,4,5,0,1 and 4,5,0,1,2,3.
In this case the comparePresorted will stop the comparison when x = 4 and y = 0 (since position 4 is 'N' and 0 is 'B' and they are difference)
In the original MergeSort algorithm, it compares two numbers to sort the sequence of numbers. It something like this:
Then I modified it by passing the comparePresorted() function in it. So, it would something like this:
After it, we would get a sorted indicies.
But, as I said it just works for file 100 kb (very slow)
Is it good statistic Mark for BWT?
I don't know how improve it, but based on Burrows' paper, he suggest by using an EOF ($). So, our previous "BANANA" would "BANANA$". But I don't know what is the benefit of EOF '$' in sorting the rotated strings.
Would you tell me mark what's actually the using of EOF $ ???
It's really confused.
Thank in Advance
- Philips Telaumbanua
There is a big advantage to placing an EOF character at the end of the string. If the character is not found anywhere else in the input text, it means that you can use normal string compares to sort the strings - so instead of checking to see if the index wraps around at N, you can be guaranteed that the two strings won't match when one index reaches N.
That speeds things up a lot, and it means you can use a native sort, which might even be faster.
The only disadvantage is that you have to take the EOF characters out of the block when you decompress. No problem.
I'd suggest continued study of the Burrows Wheeler suggestions for improving the sort speed, there is a lot of good in there.
- Mark
Thank you Mark for response again!
Yeah, I just realized that the EOF character should be less than every characters in the block. So, it would reduce the comparison. In other words it wouldn't continue the comparison if x or y = N (x or y = EOF), since EOF character always < other characters so the comparison can be terminated if one of them reached EOF. A good example is when we'd like to transform string "aaaaa$".
I think I've got the point.
Thank you very much Mark!
Ooo,,,I just want to tell you that your articles is very great, good written, neat, and clear. (article + source code = perfect). It has been a huge help.
Thank again Mark...
- Philips Tel
HI mark could you put up a java implementation?
Or anyone who has doen it please let me know... would be of great help.
@Shoaib:
Google is your friend.
After reading the posts here, I think BWT can be understood in a much simpler way as follows:
1. the first sorted column is embedded in all columns
2. the adjacent relationships in rows are preserved in any permutations of the rows
3 two adjacent columns and an index to the original row are enough to decode. The last and first column are adjacent in the cyclic view.
4. sorting speed is important in the encoding
Damn.....
Where the hell is my post about the link that i suggest...
Okay, it doesn't matter mark...!!!
lol....
I AM TESTING BWT WITH MTF WHICH USES THE LOCALITY OF THE ALPH-BETICAL SET OF THE LAST COLUMN. THEREFORE bwt-mtf -> bwt-mtf MAY PRODUCE BETTER RESULTS. THEN lzw OR ARITHMETIC OR HUFFMAN IN THE FINAL!
Can anyone help me find the links for BWT Compression Algorithm code in MATLAB??
[...] a lot of better articles about this algorithm out there. I found Michael’s O’Stuff or Dr Dobb’s Journal is a good article so far. I highly recommend you to read them if you are serious studying about [...]
Hi Mark)
What would you say if I tell you that I also came up with an algorithm BWT, it was in 2000. Compression thought, but could not come up with decompression, probably funny to you)
I still did not have internet, it was not necessary, I just programmed for pleasure. Then when he came, I was surprised, much of what I came up with has already been published. And decompression BWT, too)
Data compression - a narrow niche, it's hard to come up with a new one. 1-2% improvement in compression, but not more than 5% of 10% is to dream on a particular algorithm.
Can I claim a copyright on BWT,? I'm also smart, come up with increased compression BWT,)))
And let's tell the truth - no better arithmetic coder Huffman, a slower and a bit more on a percent.
Sant, Russia)
Decompression is the hard part on that one!
- Mark
Mark. Tell others in the book - I rewrote QuickSort in Assembler, the rate increased to 10 times the minimum and the maximum of 100 times, people need to know that)
It certainly is for normal text.
Maybe you also did not know that, I spent a lot of tests. )
@Santa:
Sorting those big blocks is what kills BWT performance, so anything you can do to optimize it is going to be a big deal...
- Mark
Mark)
For blocks of text? There easier.
But this is true for binary data, the compression of the data - a problem. Sorting plays a second role, whether it makes sense to improve the sort of binary data, if improved compression algorithm has the advantage of not more than 3-5%? Deadlock.
How does Burrows-Wheeler transform work?...
I think you will find these articles good for burrow wheeler transform 1.http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform 2.http://marknelson.us/1996/09/01/bwt/ 3.http://www.cs.princeton.edu/courses/archive/spr03/cs226/assignments/burrows...
I know how the BWT, the problem is that there is no solution for the binary files, compression is very bad for them. As an option is a variable-length blocks, but how to calculate? Wikipedia link - is a sign of bad taste. )
BWT will compress binary data very well, as long as there are correlations between bytes that work well with the BWT model. If not, it doesn't matter whether it is binary or not.
- Mark
You did not answer. I think this is the development of BWT not a dead end. Let's talk seriously. Not? Prompt improving compression by 10% for binar )
> think this is the development of BWT not a dead end.
If you are suggesting that BWT doesn't work for binary files, I would just say that bzip2 does a pretty good job.
I'm not sure what else you are suggesing.
- Mark
bzip2 is not the best compressor. I have not figured out that he had done before the ARI. But I'm a little Ari improved by putting other other figures .. Although .. of one percent ...
BWT - a dead end. Or am I wrong? )
[...] Data Compression with the Burrows-Wheeler Transform, http://marknelson.us/1996/09/01/bwt/ [...]
Thank you Mark.
I am now studying BWT. I will be combing through this reference and although I work with C rather than C++ I hope to see functioning code soon.
Leave A Reply
You can insert source code in your comment without fear of too much mangling by tagging it to use the iG:Syntax Hiliter plugin. A typical usage of this plugin will look like this:[c]
Note that tags are enclosed in square brackets, not angle brackets. Tags currently supported by this plugin are: as (ActionScript), asp, c, cpp, csharp, css, delphi, html, java, js, mysql, perl, python, ruby, smarty, sql, vb, vbnet, xml, code (Generic). If you post your comment and you aren't happy with the way it looks, I will do everything I can to edit it to your satisfaction.int main()
{
printf( "Hello, world!\n" );
return 1;
}
[/c]