LZW Data Compression Revisited
One of the first articles I wrote for Dr. Dobb’s Journal, LZW Data Compression turned out to be very popular, and still generates a fair amount of traffic and email over twenty years later.
One of the reasons for its popularity seems to be that LZW compression is a popular homework assignment for CS students around the world. And that audience sometimes found the article to be bit of a struggle. My code was modeled on the UNIX compress program, which was written in terse C for maximum efficiency. And sometimes optimization comes at the expense of comprehension.
By using C++ data structures I can model the algorithm in a much more straightforward way - the language doesn’t get in the way of a clear implementation. And after 20 years of answering puzzled queries I think I can improve on the overall explanation of just how LZW works.
In this updated look at LZW, I will first give a description of how LZW works, then describe the core C++ code that I use to implement the algorithm. I’ll then walk you through the use of the algorithm with a few varieties of I/O. Finally, I’ll show you some benchmarks.
I’m hoping that this version of the article will be good enough to last for another 20 years.
LZW Basics
LZW compression works by reading a sequence of symbols, grouping the symbols into strings, and converting the strings into codes. Because the codes take up less space than the strings they replace, we get compression.
My implementation of LZW uses the C++ char
as its symbol type, the C++
std::string
as its string type, and unsigned int
as its code type.
The tables of codes and strings are implemented using unordered_map
, the C++
library’s hash table data structure. By using the native types and standard library
data structures the representation in the program is straightforward and easy to follow.
Encoding/Decoding
Rather than jumping directly into a full implementation, I’m going to work my way up to LZW one step at a time.
The first step is getting a clear understanding of how the encoding and decoding process works. As I said earlier, LZW compression converts strings of symbols into integer codes. Decompression converts codes back into strings, returning the same text that we started with.
LZW is a greedy algorithm - it tries to find the longest possible string that it has a code for, then outputs that string. The code below is not quite LZW, but it shows you the basic idea of how a greedy encoder can work:
The greedy encoder reads characters in from the uncompressed stream, and appends them one by one
to the variable current_string
. Each time it lengthens the string by one character,
it checks to see if it still has a valid code for that string in the dictionary.
This continues until we eventually add a character that forms a string that isn’t in the dictionary. So we then erase the last character from that string, and issue the code for the resulting string - the string from the previous pass through the loop.
The value of current_string
is then initialized with the character that broke the
camel’s back, and the algorithm continues in the loop, building new strings until it runs out of
input characters. At that point it outputs the last remaining code and exits.
As an example of how this would work, imagine I have the input stream ACABCA
, and my
code dictionary looks like this:
String | Code |
A | 1 |
B | 2 |
C | 3 |
AB | 4 |
ABC | 5 |
If you follow the algorithm above, you’ll see that the code output has to be 1 3 5 1
.
If this wasn’t a greedy algorithm, 1 3 4 3 1
would have been another valid output.
Decoding the stream in a system like this is very straightforward:
Remember, the decoder shown above is just a hypothetical sample - we’re still working our way up to the full LZW decoder.
The LZW Encoder
The encoder shown above works okay, but there is one missing ingredient: management of the code dictionary. If you think about it, you’ll see that we only achieve reasonable compression when we are able to build up longer strings and find them in the dictionary. Building a useful dictionary is referred to in the data compression world as modeling.
But our management of the dictionary is constrained by an important requirement: the encoder and decoder both have to be working with the same copy of the dictionary. If they have different dictionaries, the encoder might send a string that the decoder can’t resolve.
Some data compression algorithms solve this problem by using a predefined dictionary that both the encoder and the decoder know in advance. But LZW builds a dictionary on the fly, using an adaptive method that ensures both the encoder and decoder are in sync.
LZW manages this in an effective and provably correct fashion. First, both the encoder and decoder initialize the dictionary with all possible single digit strings. For the compressor, that looks like this:
This insures that we can encode all possible streams. No matter what, we can always break a stream down into single digits and encode these, knowing that the decoder has the same strings in its dictionary with values 0-255.
Then comes the key component of the LZW algorithm. If you go back to the greedy encoding loop above, you’ll see that I keep adding input symbols to a string until I find a string that isn’t in the dictionary. This string has the characteristic of being composed of a string that currently exists in the dictionary, with one additional character.
LZW then takes that new string and adds it to the dictionary, creating a new code. The strings are added to the table with code values that increment by one with each new entry.
The resulting code is just a slightly modified version of the encoder that I listed above. It still only outputs codes for values that are in the dictionary, but now the dictionary is being updated with a new string every time an existing code is sent:
The code above constitutes a more or less complete LZW encoder. I’ve only made a couple of additions to the previous encoder:
-
The initialization of codes 0-255 with all possible single character strings.
The insertion of the newly discovered string into the string table, generating a new code.
(One item of note in this code: you might wonder why next_code
is initialized to 257,
when 256 is the first free code. This is because I reserve code 256 for an EOF marker. More on
this in a later section.)
Just to make sure this all adds up, I’ll walk through the steps the encoder takes as it processes a
string from a simple two letter alphabet: ABBABBBABBA
. There are a lot of steps shown
below, but working through the process in detail is a great way to be sure you understand
it:
Input Symbol | Action(s) | New Code | Output Code |
---|---|---|---|
read 'A' - set current_string to 'A' 'A' is in the dictionary, so continue | |||
read 'B' - set current_string to 'AB' 'AB' is not in the dictionary, add it with code 257 output the code for 'A' - 65 set current_string to 'B' | 257 (AB) | 65 (A) | |
read 'B' - set current_string to 'BB' 'BB' is not in the dictionary, add it with code 258 output the code for 'B' - 66 set current_string to 'B' | 258 (BB) | 66 (B) | |
read 'A' - set current_string to 'BA' 'BA' is not in the dictionary - add it with code 259 output the code for 'B' - 66 set current_string to 'A' | 259 (BA) | 66 (B) | |
read 'B' - set current_string to 'AB' 'AB' is in the dictionary, so continue | |||
read 'B' - set current_string to 'ABB' 'ABB' is not in the dictionary - add it with code 260 output the code for 'AB' - 257 set current_string to 'B' | 260 (ABB) | 257 (AB) | |
read 'B' - set current_string to 'BB' 'BB' is in the dictionary, so continue | |||
read 'A' - set current_string to 'BBA' 'BBA' is not in the dictionary - add it with code 261 output the code for 'BB' - 258 set current_string to 'A' | 261 (BBA) | 258 (BB) | |
read 'B' - set current_string to 'AB' 'AB' is in the dictionary, so continue | |||
read 'B' - set current_string to 'ABB' 'ABB' is in the dictionary, so continue | |||
read 'A' - set current_string to 'ABBA' 'ABBA' is not in the dictionary - add it with code 262 output the code for 'ABB' - 260 set current_string to 'A' | 262 (ABBA) | 260 (ABB) | |
end of the input stream - exit loop current string is 'A' output the code for 'A' - 65 | 65 (A) |
After processing string ABBABBBABBA
, the output codes are 65,66,66,257,258,260,65
.
The dictionary at this point is:
String | Code |
AB | 257 |
BB | 258 |
BA | 259 |
ABB | 260 |
BBA | 261 |
ABBA | 262 |
ABBABBBABBA
(Entries 0-255 not shown for brevity)
Looking at the above table, you can see a few interesting things happening. First, every time the algorithm outputs a code, it also adds a new code to the dictionary.
More importantly, as the dictionary grows, it starts to hold longer and longer strings. And the longer the string, the the more compression we can get. If the algorithm starts emitting integer codes for strings of length 10 or more, there is no doubt that we are going to get good compression.
As an example of how this works on real data, here are some entries from the dictionary created when compressing Alice’s Adventures in Wonderland:
34830 : 'even\n' 34831 : '\nwith t' 34832 : 'the dr' 34833 : 'ream ' 34834 : ' of Wo' 34835 : 'onderl' 34836 : 'land' 34837 : 'd of l' 34838 : 'long ag' 34839 : 'go:'
These strings have an average length of almost six characters. If we are writing the integer codes to a file using 16 bit binary integers, these entries offer the possibility of 3:1 compression.
The word adaptive is used to describe a compression algorithm that adapts to the type of text it is processing. LZW does an excellent job of this. If a string is seen repeatedly in the text, it will show up in longer and longer entries in the dictionary. If a string is seen rarely, it will not be the foundation for a large batch of longer strings, and thus won’t waste space in the dictionary.
The LZW Decoder
The change made to the basic encoder to accommodate the LZW algorithm was really very simple. One small batch of code that initializes the dictionary, and another few lines of code to add every new unseen string to the dictionary.
As you might suspect, the changes to the decoder will be fairly simple as well. The first change is that the dictionary must be initialized with the same 256 single-symbol strings that the encoder uses.
Once the decoder starts running, each time it reads in a code, it must add a new value to the dictionary. And what is that value? The entire content of the previously decoded string, plus the first letter of the currently decoded string. This is exactly what the encoder does to create a new string, and the decoder must following the same steps:
I won’t do a walk-through of the the decoder - you should be able to take the codes output from the encoder, shown above, and run them through the decoder to see that the output stream is what we expect.
The important thing is to understand the logic behind the decoder. When the encoder encounters a string that isn’t in the dictionary, it breaks it into two pieces: a root string and an appended character. It outputs the code for the root string, and adds the root string + appended character to the dictionary. It then starts building a new string that starts with the appended character.
So every time the decoder uses a code to extract a string from the dictionary, it knows that the first character in that string was the appended character of the string just added to the dictionary by the encoder. And the root of the string added to the dictionary? That was the previously decoded string. This line of code implements that logic:
It adds a new string to the dictionary, composed of the previously seen string, and the first character of the current string. Thus, the decoder is adding strings to the dictionary just one step behind the encoder.
You might note one curious point in the decoder. Instead of always adding the string to the dictionary, it is only done conditionally:
The only time that previous_string.size()
is 0 is on the very first pass through the
loop. And on the first pass through the loop, we don’t have a previous string yet, so the decoder
can’t build a new dictionary entry. Again, the decoder is always one step behind the encoder,
which is a key point in the next section, which puts the final touches on the algorithm.
The Catch
So far the LZW algorithm we’ve seen seems very elegant - that’s a characteristic we associate with algorithms that can be expressed in just a few lines of code.
Unfortunately, there is one small catch in this perceived elegance - the algorithm as I’ve shown it to you has a bug.
The bug in the algorithm relates to the fact that the encoder is always one step ahead of the decoder. When the encoder adds a string with code N to the table, it sends enough information to the decoder to allow the decoder to figure out the value of the string denoted by code N-1. The decoder won’t know what the value of the string corresponding to code N is until it receives code N+1.
This makes sense if you recall the key line of code from the decoder. It calculates the value of the string encoded by N-1 by looking at the string it received on the previous iteration, plus the first character of the current string. And that current string is the one that was sent after encoding N.
So how can this get us in trouble? The encoder is always one entry ahead of the decoder - it has entry N in its dictionary, and the decoder has entry N-1. So if the encoder ever sends code N, the decoder will look in its table and come up empty-handed, unable to do its job of decoding.
A simple example will show you how this can happen. Let’s look at the state of the encoder after
it has sent the first five symbols in a stream: ABABA
:
Input Symbol | Action(s) | New Code | Output Code |
---|---|---|---|
read 'A' - set current_string to 'A' 'A' is in the dictionary, so continue | |||
read 'B' - set current_string to 'AB' 'AB' is not in the dictionary, add it with code 257 output the code for 'A' - 65 set current_string to 'B' | 257 (AB) | 65 (A) | |
read 'A' - set current_string to 'BA' 'BA' is not in the dictionary, add it with code 258 output the code for 'B' - 66 set current_string to 'A' | 258 (BA) | 66 (B) | |
read 'B' - set current_string to 'AB' 'AB' is in the dictionary, so continue | |||
read 'A' - set current_string to 'ABA' 'ABA' is not in the dictionary, add it with code 259 output the code for 'AB' - 257 set current_string to 'A' | 259 (ABA) | 257 (AB) |
Now we are set for trouble. The encoder has symbol 259 in its dictionary, while the decoder has only gotten to 258. If the encoder were to send a code of 259 for its next output, the decoder would not be able to find it in its dictionary. Can this happen?
Yes, if the next two characters in the stream are BA
, the next code output by the
encoder will be 259, and the decoder will be lost.
In general, this can happen when a dictionary entry exists that consists of a string plus a
character, and the encoder encounters the sequence string+character+string+character+string
.
In the example above, the value of string is A
, and the value of
character is B
. After the encoder counters AB
, it has
string+character
in the dictionary, so if the following sequence is
ABABA
, we will emit code N. (This can happen at the start of a file, when the
string is empty, if you simply have a string of three identical characters.)
Whether this is likely to happen or not is not too important, what is important is that it most definitely can happen, and the decoder has to be aware of it. And it will happen repeatedly in the pathological case: a stream that consists of a single symbol, repeated on end.
The good news is that the problem is easily solved. When the decoder receives a code, and finds that this code is not present in its dictionary, it knows right away that the code must be the one that it will add next to its decoder. And because this only happens when we are encoding the sequence discussed above, the decoder knows that instead of using this value for that code:
it can instead use this value:
The result of this is the insertion of just two lines of code at the start of the decompress loop, giving a loop that now looks like this:
And with that, you have a complete implementation of the LZW encoder and decoder.
Implementation
Now that I’ve shown you the algorithm, the next step is to take that code and add turn it into a working program. Without changing the algorithm itself, I’m going to take you through four different customizations that work as follows:
-
LZW-A reads and writes code values rendered in text mode, which is great for debugging. It
means you can view the output of the encoder in a text editor.
LZW-B reads and writes code values as 16-bit binary integers. This is fast and efficient,
and usually results in significant data compresion.
LZW-C reads and writes code values as N-bit binary integers, where N is determined by the
maximum code size. Performing I/O on codes that are not aligned on byte boundaries complicates
the code somewhat, but allows for greater efficiency and better compression.
LZW-D reads and writes code values as variable-length binary integers, starting with 9-bit
codes and gradually increasing as the dictionary grows. This gives the maximum compression.
Before launching into these implementations, the code I showed above needs some minor tweaking to solve a couple of problems.
The first problem we have to deal with is the ever-expanding dictionary. In the algorithm I’ve presented, we keep adding new codes to the dictionary without end. This needs to be changed for a couple of reasons.
First, we don’t have unlimited memory, so the dictionary simply can’t grow forever. Second, practical experience shows that compression ratios don’t improve as dictionary sizes grow without bound. As the dictionary grows, code sizes get larger and larger, and so they take up more space in the compressed stream, which can reduce compression efficiency.
To resolve this problem, I just add an additional argument to the encoder and decoder that sets the maximum code value that will be added to the dictionary. The function signatures now look like this:
Implementing it means one small change in the encoder:
And a corresponding change in the decoder:
Input and Output
Finally, I need to give the algorithm a decent way to perform input and output - and this is where C++ offers a huge amount of help.
When writing generic compression code that you intend to use in multiple contexts, one of the more difficult things to deal with is I/O. People using your code might want to compress data in memory, stored in files, or streaming in from sockets or other sources. Some input data sources might be of unknown length (data coming from a TCP socket, for example), while others will be of a prescribed length. Back in the days of C, it was particularly difficult to make your compression code both generic, so it would work with all types of data streams, and efficient, so that I/O doesn’t take any more time than it has to.
With the advent of C++, we have a new tool that can help in this quest - templates. Templates are designed to solve this problem in an efficient way, and I take advantage of this in my sample code. The code below shows the final version of the compressor and decompressor that are are used in all four versions of the implementation. There are two final changes made to the routines shown previously. First, both C++ functions are now function templates, parameterized on the the types being used for input and output. Second, the actual input and output is done through four newly introduced template classes:
What exactly is the effect of implementing this algorithm using a pair of function templates, parameterized on the the types of the input and output objects? What this means is that you can call these compression routines with any type of I/O object you can throw at them. It can work with C++ iostreams, C FILE * objects, raw blocks of memory, whatever you want.
But there’s a catch to that flexibility - you have to implement some basic I/O routines for whatever type you are using. Fortunately, this is not too hard.
The actual I/O that is done in the compression routines is defined by four template classes I
created. These classes are defined in lzw_streambase.h
. These classes don’t have
implementations, but they do define the methods you need to implement to work with the compressor
and decompressor. The four classes are:
input_symbol_stream<T>
ouput_symbol_stream<T>
input_code_stream<T>
output_code_stream<T>
The first two classes are the symbol input and output classes. These are normally going to be very
simple implementations, as they just have to read single characters to and from streams, while
checking for errors or ends of streams. I use the same versions of these classes in all four
implementations, so the code in lzw-a.h
is unchanged in the other three header files.
The input_symbol_stream<T>
class has one member function: the extraction
operator, which reads a character from the stream and returns a boolean true or false. You’ll see
later in this section that the implementation of this for types such as std::istream
is trivial.
The output_symbol_stream<T>
class uses the insertion operator to write strings
instead of individual characters - because that is what is stored in the dictionary.
The C++ std::string
class makes a perfectly good container for any variety of symbols,
including binary data, and unlike the alternative vector<char>
, it comes with
hash functions and iostream
operators.
The input_code_stream<T>
class reads codes, normally unsigned integers, from
some type of stream. In my implementations, this class also returns false if it encounters the
EOF_CODE
in the stream of incoming codes. Removing the responsibility for EOF
detection from the decompressor makes the code a bit simpler and more versatile.
The formatting of the integer is entirely up to the implementor, but the most common approach will probably be variable length codes ranging from 9 to 16 or so bits.
The output_code_stream<T>
class writes codes, usually unsigned integers,
to some type of stream. Whatever class you implement for this function must agree with the
implementation for input_code_stream<T>
.
You can see that at the top of the compressor and decompressor, I instantiate objects of these types, then use the standard insertion and extraction operators to read and write from these objects.
LZW-A
In my sample windows program, I include lzw_streambase.h
and lzw.h
,
which accounts for all of the code you have seen so far. I have the following lines that perform
compression and decompression:
If I try to build this project as-is, I get a nasty list of eight linker errors:
Visual Studio 10 Error Messages
If you have the fortitude to crawl through those link errors, you will see that what is missing
are the implementations of the four classes parameterized on std::ostream
and
std::istream
. Each of the four classes needs the implementation of a constructor and
either an insertion or extraction operator. And with no class definitions at all, that adds up
to eight missing functions. To get us started on performing actual LZW compression, I’ve created
the first implementation of these four classes in lzw-a.h
. Let’s take a look at each
of these in turn.
It’s tempting to try to read characters using the ifstream
extraction operator, as in
m_impl >> c
, but that operator skips over whitespace, so we don’t get an exact
copy of the input stream. Using get()
works around this problem. Below is the
complete definition of input_symbol_stream<std::istream>
used in all four LZW
implementations in this article:
Using the insertion operator to output strings seems to work properly, even when the strings contain binary data, so the implementation of the class used to output symbols is as simple as we could hope for. Again, this exact code is used in all four implementations in this article:
LZW-A prints the text values of integers to the output stream, and reads them back in that format.
This is not efficient at all, but it is a great aid in debugging. If you are having a problem with
the algorithm, this provides a nice way to examine your stream. The implementation of this is very
simple - just use the std::ostream
insertion operator, and follow each code by a
newline so it can be properly parsed on input, as well as be easily loaded into a text editor.
One important thing to notice in this class: the presence of a destructor that prints the
EOF_CODE
. Since this object goes out of scope as the compressor exits, this insures
that every code stream will end with this special code. Putting the onus on the I/O routines to
deal with EOF issues simplifies the algorithm itself. (It also means that you can implement
versions of LZW that don’t use an EOF in the code stream.)
The corresponding version of the input class just reads in the white-space separated codes. If
there is an error or an EOF_CODE
encountered in the stream, the extraction operator
returns false, which allows the decompressor to know when it is time to stop processing.
By including lzw-a.h
along with the other two header files, I can now create a
program that compiles, links, and is able to test the algorithm. Using my UNIX test program, I
compress the demo string from earlier in this article, and I see the output as it is sent directly
to stdout
:
Compressing
ABBABBBABBA
Fortunately, the output is identical to what was shown earlier, with the addition of the final
EOF_CODE
used to delimit the end of the code stream.
LZW-B
The header file lzw-b.h
implements specialized classes that replace the text-mode
output of the codes in lzw-a.h
with binary codes stored in a short integer - two bytes.
The classes that read and write symbols are unchanged, but reading and writing codes has to change in order to do this new binary output.
Writing the codes to std::ostream
as binary values requires breaking the integer
code into two bytes and writing the bytes one at a time. There are more efficient ways to write
the complete short integer in one function call, but they raise code portability problems, as
we don’t always know what order bytes will be written in.
Like the code stream output object in lzw-a.h
, this version of the code output class
has a destructor that outputs an EOF_CODE
value:
Reading the codes requires reading the two bytes that make up the short integer, then combining
them. While reading, if the routine detects an EOF_CODE
, it returns false, which
tells the decompressor to stop processing. It also returns false if there is an error on the input
code stream.
The most exciting thing about lzw-b.h
is that you can now see data compression taking
place. The figure below shows a sample run of this implementation against the
Canterbury Corpus,
a standard set of files used to test compression. A run with my Windows test program shows
that the files are compressing quite nicely:
Compressing the Canterbury Corpus with
lzw-b.h
LZW-C
The third I/O implentation, defined in lzw-c.h
, writes binary codes like
lzw-b.h
, but with one crucial difference. Instead of being hard coded to 16 bit
codes, lzw-c.h
determines the maximum code size needed based on the maximum code
value passed as an argument to compress()
and decompress()
. It then
writes codes based on that width, which will normally be something in the range of 9-18 bits
wide.
Since these values are not aligned with byte boundaries, there are some issues writing them to
streams that expect to read and write bytes. However, it is definitely worth all the bit shifting,
ORing, and ANDing, because when the size is 12 bites, we are going to save four bits per code when
compared to using lzw-b.h
. But every read and write potentially starts somewhere in
the middle of a byte, so the I/O classes have to do some extra work - mostly involved with
shifting bits to the correct position in the output stream.
Note that the code to read and write symbols is unchanged from lzw-a.h
and lzw-b.h
.
Many of the CS students who read my earlier article on LZW ran into a brick wall when they started trying to understand the code that performs I/O on codes of variable bit lengths. Obviously, writing 11 bit codes when your file system is oriented around eight-bit bytes involves a lot of bit twiddling, and I’m afraid that many novices are woefully deficient in this department. Not just in understanding the bitwise operators in C, such as shifting, masking, etc., but in understanding binary arithmetic in general.
That’s why I’ve structured the code and this article a bit differently this time around.
If the I/O operations in lzw-c.h
and lzw-d.h
are bewildering, well, no
worries. They have absolutely nothing to do with the LZW algorithm itself. You can investigate and
explore the algorithm completely using lzw-a.h
and lzw-b.h
, and just
forget about the last two I/O implementations. They provide additional efficiency, but as I have
said, have nothing to do with the algorithm itself.
Further, once you use lzw-a.h
to debug and understand the algorithm, you can
certainly plug in lzw-c.h
and lzw-d.h
and take advantage of their
improved compression, even if you don’t follow all the code.
It might be appropriate to add a sidebar or another section to explain the variable bit length I/O in detail, but this article is quite long already, and there are numerous other resources for the interested reader to explore the details. (But if you find yourself deficient in this area, you owe it to yourself to hit the books and get to the point where these operations make sense. This won’t be the last time you need to understand bitwise operators.)
For those who are ready to tackle this more complicated I/O procedure, we will look first at
the output_code_stream<std::ostream>
class. Here, the first thing to
understand is that the constructor has to initialize the number of bits in the code.
This value is calculated from the max_code
parameter, and is stored in member
m_code_size
, where it is used frequently.
Next, the insertion operator. Output of codes proceeds as follows. Member
m_pending_bits
tells us how many bits are pending output while sitting in
member m_pending_output
. These bits are right justified, and the count will always be
less than eight. When the new code is written, it is inserted into m_pending_output
after being left shifted so it will be laid down just past the pending bits. After doing that,
we presumably have some bytes to output - the exact number depends on various factors.
The flush()
routine is called, and it flushes all complete bytes out.
When it completes, there can be anywhere from zero to seven bits still waiting to be output,
and they will be right justified in m_pending_output
.
In the destructor, we output an EOF_CODE
, and then do a flush as well. But in this
case, we flush all possible bits, not just the complete bytes. There are two good reasons for this.
First, we don’t care if the last bits that are flushed out are only part of a code -
the code will be EOF_CODE
, and that is the last one. And second, if we don’t flush
those final bits out in the destructor, they will never be sent to the output stream. This means
the decoder will not see those bits, and we will most likely break the decompress process.
Like the output code class, the input code class has to calculate the code size for this
decompression based on the max_code
value passed in the function call.
When an attempt is made to read a code, there must be a minimum of m_code_size
bits in member m_pending_input
. If there aren’t, new bytes are read in one at a time,
and inserted into m_pending_input
after having been shifted left
the appropriate amount. Once m_pending_input
contains at least
m_code_size
bits, the code is extracted from m_pending_input
using the appropriate mask, the count in m_pending_input
is reduced, and
m_pending_input
is shifted right by m_code_size
bits.
The table below shows the results of a test run comparing LZW-B and LZW-C run with a maximum code of 4095. With this maximum value, all codes fit in a 12-bit integer. Since LZW-B will use a 16-bit integer to store the code values, and LZW-C will use 12-bits, there should be a 4:3 ratio between the ratio of the file sizes when compressed using the two algorithms, and this looks to be the case:
File Name | Original Size | Compressed LZW-B | Compressed LZW-C | Ratio |
---|---|---|---|---|
alice29.txt | 152089 | 96428 | 72322 | 0.750 |
alphabet.txt | 100000 | 4538 | 3404 | 0.750 |
asyoulik.txt | 125179 | 83966 | 62975 | 0.750 |
bib | 111261 | 71792 | 53845 | 0.750 |
bible.txt | 4047392 | 2468326 | 1851245 | 0.750 |
It looks like things are working as expected.
LZW-D
The code in lzw-d.h
represents the final and most efficient version of I/O for the
LZW code streams. It builds on the code in lzw-c.h
- at its core it is a variable
bit-length I/O stream. However, there is one crucial difference from lzw-c.h
: the
code I/O in lzw-d.h
starts at the smallest possible code size, nine bits,
and increases the code size as needed, until it reaches the maximum value for this compression
session. The maximum value is the parameter passed in to the invocation of compress()
or decompress()
.
The logic behind this is pretty simple. Even if we are going to use 16-bit codes in an LZW program, when the program first starts, the maximum possible code the program can emit is 256, which only needs nine bits to encode. And each time we output a new symbol, that maximum possible code value only increases by one, which means that the first 256 codes output by the encoder can all fit in nine bits.
So the LZW-D encoder starts encoding using nine-bit code widths, and then bumps the value to ten as soon as the highest possible output code reaches 512. This process continues, incrementing the code size until the maximum code size is reached. At that point the code size stays fixed, as no new codes are being added to the dictionary.
The decoder follows exactly the same process - reading in the first code with a width of nine bits, then bumping to ten when the maximum possible input code reaches 512.
The code for this class is built on that from lzw-c.h
, with some added complexity.
Due to its increasing length, and the fact that it doesn’t add too much to the discussion of
LZW, I’ve omitted the listing, and instead refer you to the download available at the end of
the article.
The Windows Test Program
When you develop compression code, there are a few different common tasks you are likely to want to perform:
-
Check your code for correctness, often through bulk testing.
Check your compression ratios against standard benchmarks.
Analyze your program's performance so as to make it more efficient and locate bottlenecks.
My Windows app is designed to help with all of these tasks. It basically allows you to select a single directory, set a maximum code size, then perform compression and decompression of all the files in the directory. An optional checkbox lets you include files in all directories under the test directory as well.
The application was built using Visual Studio 10, and it is a simple MFC Dialog-based application. It allows you to select a base directory, a maximum code size, and then compress all the files in that directory. If you select the recursion check box, you will also compress all the files in the entire tree of subdirectories below it.
Each file is compressed to a temporary location, then decompressed in a temporary location. The size of the compressed file is saved, and then a comparison is done to ensure that the original and expanded files are identical.
To help with data collection, after running a test, you can press the copy button and get the results of the test stuffed into your clipboard. Although it isn’t visible in the display, the data stored in your clipboard includes the full path name of the original file, not just the basename.
This Visual Studio project takes advantage of a number of C++11 features, and as a result it will
need some modification to work with earlier versions. Any version that supports
unordered_map
can be made to build without too many changes. And if you are going
way back in time, you could replace unordered_map
with map
.
As shipped, the test program uses lzw-d.h
. To use any of the other three other
versions of I/O discussed in this article, just modify the include file selected at the top of
LzwTestDlg.cpp. The figure below shows what the app looks like after running through some data:
The Windows test app after a test run
After pressing the copy button at the bottom of the dialog, you can paste the data into a spreadsheet and then crunch it to your heart’s content:
Copying the data into a spreadsheet
The Linux Test Program
The LZW code is platform independent, and will build and run just fine on UNIX or Linux systems.
The Linux test program, lzw.cpp
, allows you to compress or decompress files from the
command line. It builds just fine with g++ 4.5, as long as you use the -std=c++0x
switch to turn on the latest language features. Compiling with earlier versions will require a
few minor modifications.
The command line interface to the test program is not too complicated, and is probably best documented by looking at the usage output:
mrn@ubuntu:~/LzwTest$ g++ -std=c++0x lzw.cpp -o lzw mrn@ubuntu:~/LzwTest$ ./lzw Usage: lzw [-max max_code] -c input output #compress file input to file output lzw [-max max_code] -c - output #compress stdin to file otuput lzw [-max max_code] -c input #compress file input to stdout lzw [-max max_code] -c #compress stdin to stdout lzw [-max max_code] -d input output #decompress file input to file output lzw [-max max_code] -d - output #decompress stdin to file otuput lzw [-max max_code] -d input #decompress file input to stdout lzw [-max max_code] -d #decompress stdin to stdout mrn@ubuntu:~/LzwTest$
Like the Windows test program, the command line program is built by default with
lzw-d.h
. Replacing this algorithm with any of the three others requires a minor
change to the source code.
With the default build, the program produces output nearly identical to UNIX compress. The one difference is that UNIX compress monitors the compression ratio after the dictionary is full, and clears the dictionary if the ratio starts to deteriorate (which it almost always does.) I include a benchmark program that tests UNIX compress against the command line test program, and the results show that for small files, the file size is almost identical:
mrn@ubuntu:~/LzwTest$ ./benchmark.sh 65535 16 canterbury | head -n 15 | column -t Filename Original-size LZW-size Compress-size -------- ------------- -------- ------------- canterbury/aaa.txt 33406 320 321 canterbury/alice29.txt 152089 62247 62247 canterbury/alphabet.txt 100000 3052 3053 canterbury/asyoulik.txt 125179 54989 54990 canterbury/a.txt 1 3 5 canterbury/bib 111261 46527 46528 canterbury/bible.txt 4047392 1417735 1377093 canterbury/book1 768771 317133 317133 canterbury/book2 610856 247593 251289 canterbury/cp.html 24603 11315 11317 canterbury/E.coli 4638690 1213579 1218349 canterbury/fields.c 11150 4963 4964 canterbury/geo 102400 77777 77777
You can see in this test that LZW-D and UNIX compress perform nearly identically for all but the largest files in the test sample. If I modify UNIX compress to not monitor compression ratios, the difference seen with larger files goes away:
mrn@ubuntu:~/LzwTest$ ./benchmark.sh 65535 16 canterbury | head -n 15 | column -t Filename Original-size LZW-size Compress-size -------- ------------- -------- ------------- canterbury/aaa.txt 33406 320 321 canterbury/alice29.txt 152089 62247 62247 canterbury/alphabet.txt 100000 3052 3053 canterbury/asyoulik.txt 125179 54989 54990 canterbury/a.txt 1 3 5 canterbury/bib 111261 46527 46528 canterbury/bible.txt 4047392 1417735 1417735 canterbury/book1 768771 317133 317133 canterbury/book2 610856 247593 247593 canterbury/cp.html 24603 11315 11317 canterbury/E.coli 4638690 1213579 1213579 canterbury/fields.c 11150 4963 4964 canterbury/geo 102400 77777 77777
That provides some support for the notion that the algorithm shown here behaves properly.
Your Program
If you want to build your own program and use these classes, all you need is a C++11 compiler, or an earlier version and a willingness to make a few changes.
To use the classes, include in order lzw_streambase.h
, one of the four
implementation files for iostreams
, preferably lzw-d.h
, and finally,
lzw.h
. Because the significant code in these files is all implemented as template
functions or classes, there is no library to include in your project, and no C++ source you have
to compile separately.
All of the code in these header files has been hoisted into the lzw
namespace,
so you will either have to explicitly use the namespace when you invoke compress()
and decompress()
, or insert this line into your program:
One thing to note about the I/O routines I have defined. The template functions are specialized
on std::istream
and std::ostream
. If you innocently pass in an object
such as an std::ifstream
, you will get compile time errors. This is because
C++ template matching is done on a very strict basis - the compiler won’t generally try to
figure out that std::ifstream
is derived from std::istream
,
and use the existing class. So instead, you will need to cast your arguments to the types
defined in the header files. (Or write your own implementations.)
Your rights to use this code are covered by my liberal code use policy. As I have mentioned before, this is teaching code, if you decide to use it in a production system, there are many optimizations you might want to perform.
Benchmarks
So how does LZW do when it comes to compression? LZW’s original strength was its combination of good compression ratios with high speed compression. The UNIX compress program is still nice and fast, and Terry Welch’s original application for LZW was in disk controllers. Because my program is a teaching program, it won’t be nearly as fast as compress, but it’s still useful to compare it to the de facto standard for lossless compression: the deflate algorithm.
We can compare LZW against deflate by a small modification of my benchmark script that uses gzip instead of compress. The table below shows the average compression ratios for the files in the canterbury corpus when compressed using maximum code widths of 15-18 bits. (The ratio is defined as 100*compressed_size/uncompressed_size, so 0% is perfect compression and 100% is no compression.)
gzip | LZW 15 bits | LZW 16 bits | LZW 17 bits | LZW 18 bits |
---|---|---|---|---|
32.7% | 43.2% | 42.6% | 42.5% | 42.3% |
You can see that LZW does do a good job of compressing data, but the deflate algorithm used by gzip manages to squeeze an additional 10%, more or less, out of the files it compresses. The gap between LZW and deflate is larger on some types of files, and smaller on others, but deflate will almost always show a noticeable difference in compression ratios.
Variations
There are many variations on the code I’ve presented here that make sense.
One obvious change is to eliminate the special EOF_CODE
used to delimit the end of
the code stream. If the code stream is a file or other stream with an inherent EOF condition,
there is no need for an EOF_CODE
- simply reaching the end of the input stream will
properly signal the end of the decoded material. Freeing up this one code will make a
microscopically small improvement in the compression ratios of the product.
If you want to mimic the output of the compress program, you need to remove the
EOF_CODE
, and replace it with a CLEAR_CODE
that has a value of 256.
The compress program monitors the compression ratios it achieves after its dictionary is full, and
when the ratio starts to decay, it issues the CLEAR_CODE
. That code tells the decoder
to clear its dictionary and make a fresh start with new nine-bit codes.
Once you get the hang of LZW, a good exercise to make sure you have it working properly is to create a GIF encoder and decoder. GIF uses LZW to losslessly compress images with a constrained palette, and after all these years is still somewhat of a standard on the web.
History
Usually the history lesson on an algorithm is at the start of the article, but this is a how-to piece, and I feel like the trip down memory lane is not as important as understanding how the algorithm works.
The roots of LZW were set down in 1978 when Jacob Ziv and Abraham Lempel published the second of their two seminal works on data compression, “Compression of Individual Sequences via Variable-Rate Coding”. This paper described a general approach to data compression that involved building dictionaries of previously seen strings.
Ziv and Lempel’s work was targeted at an academic audience, and it wasn’t truly popularized until 1984 when Terry Welch published A Technique for High-Performance Data Compression. Welch’s paper took the somewhat abstract Information Theory work of Ziv and Lempel and reduced it to practice in such a way that others could easily implement it.
UNIX compress was probably the first popular program that used LZW compression, and it very quickly became a standard utility on UNIX systems. The freely available code for compress was incorporated into ARC, one of the first archiving programs for PCs. In addition, the algorithm was used in the GIF file format, originally created by Compuserve in 1987.
LZW’s popularity waned in the 1990s for two important reasons. First, Unisys began enforcing their patents that covered LZW compression, demanding and receiving royalties from various software companies. Not only did this make developers think twice about the liability they could incur while using LZW, it resulted in a general public relations backlash against using patented technology.
Secondly, the LZW algorithm was eclipsed on the desktop by deflate, as popularized by PKZIP. Not only did deflate outperform LZW, it was unencumbered by patents, and eventually had a very reliable and free open source implementation in zlib, a library written by a team led by Marc Adler and Jean-loup Gailly. I don’t know if there is any way to actually quantify this, but I think one could speculate that zlib is currently installed on more computer systems than any other software package in existence.
So LZW has settled down to an existence out of the limelight. It is still an important algorithm, used in quite a few file formats, and as this article shows, its simplicity makes it an excellent learning tool.
Downloads
- LzwTest.zip - source for the Windows test app.
- LzwExe.zip - the Windows test app executable.
- lzw.tgz - source for the UNIX test app.