One of the first articles I wrote for Dr. Dobb’s Journal, LZW Data Compression, turned out to be very popular, and still generates a fair amount of traffic and email over twenty years later.

One of the reasons for its popularity seems to be that LZW compression is a popular homework assignment for CS students around the world. And that audience sometimes found the article to be bit of a struggle. My code was modeled on the UNIX compress program, which was written in terse C for maximum efficiency. And sometimes optimization comes at the expense of comprehension.

By using C++ data structures I can model the algorithm in a much more straightforward way – the language doesn’t get in the way of a clear implementation. And after 20 years of answering puzzled queries I think I can improve on the overall explanation of just how LZW works.

In this updated look at LZW, I will first give a description of how LZW works, then describe the core C++ code that I use to implement the algorithm. I’ll then walk you through the use of the algorithm with a few varieties of I/O. Finally, I’ll show you some benchmarks.

I’m hoping that this version of the article will be good enough to last for another 20 years.

LZW Basics

LZW compression works by reading a sequence of symbols, grouping the symbols into strings, and converting the strings into codes. Because the codes take up less space than the strings they replace, we get compression.

My implementation of LZW uses the C++ char as its symbol type, the C++ std::string as its string type, and unsigned int as its code type. The tables of codes and strings are implemented using unordered_map, the C++ library’s hash table data structure. By using the native types and standard library data structures the representation in the program is straightforward and easy to follow.

Encoding/Decoding

Rather than jumping directly into a full implementation, I’m going to work my way up to LZW one step at a time.

The first step is getting a clear understanding of how the encoding and decoding process works. As I said earlier, LZW compression converts strings of symbols into integer codes. Decompression converts codes back into strings, returning the same text that we started with.

LZW is a greedy algorithm – it tries to find the longest possible string that it has a code for, then outputs that string. The code below is not quite LZW, but it shows you the basic idea of how a greedy encoder can work:

void encode( input_stream in, output_stream out )
{
  //
  // This hash table contains a list of codes, indexed
  // by the string that corresponds to the code.
  //
  std::unordered_map<std::string,unsigned int> codes;
  //
  // There is presumably some code here that initializes
  // the dictionary with a set of codes based on whatever
  // algorithm we are implementing.
  //
  ...initialize the dictionary
  //
  // With codes in the dictionary, encoding is 
  // now ready to begin.
  //
  std::string current_string;
  char c;
  while ( in >> c ) {
    current_string = current_string + c;
    if ( codes.find(current_string) == codes.end() ) {
      current_string.erase(current_string.size()-1);
      out << codes[current_string];
      current_string = c;
    }
  }
  out << codes[current_string];
}

The greedy encoder reads characters in from the uncompressed stream, and appends them one by one to the variable current_string. Each time it lengthens the string by one character, it checks to see if it still has a valid code for that string in the dictionary.

This continues until we eventually add a character that forms a string that isn’t in the dictionary. So we then erase the last character from that string, and issue the code for the resulting string – the string from the previous pass through the loop.

The value of current_string is then initialized with the character that broke the camel’s back, and the algorithm continues in the loop, building new strings until it runs out of input characters. At that point it outputs the last remaining code and exits.

As an example of how this would work, imagine I have the input stream ACABCA, and my code dictionary looks like this:

String Code
A 1
B 2
C 3
AB 4
ABC 5

A sample dictionary


If you follow the algorithm above, you’ll see that the code output has to be 1 3 5 1. If this wasn’t a greedy algorithm, 1 3 4 3 1 would have been another valid output.

Decoding the stream in a system like this is very straightforward:

void decode( input_stream in, output_stream out )
{
  std::unordered_map<unsigned int,std::string> strings;
  //
  // Initialize the code table with the same set of codes and strings 
  // that the encoder used for your algorithm.
  //
  ...initialize the dictionary
  //
  // With codes in the dictionary, decoding is now 
  // ready to begin.
  //
  unsigned int code;
  while ( in >> code ) 
    out << strings[code];
}

Remember, the decoder shown above is just a hypothetical sample - we're still working our way up to the full LZW decoder.

The LZW Encoder

The encoder shown above works okay, but there is one missing ingredient: management of the code dictionary. If you think about it, you'll see that we only achieve reasonable compression when we are able to build up longer strings and find them in the dictionary. Building a useful dictionary is referred to in the data compression world as modeling.

But our management of the dictionary is constrained by an important requirement: the encoder and decoder both have to be working with the same copy of the dictionary. If they have different dictionaries, the encoder might send a string that the decoder can't resolve.

Some data compression algorithms solve this problem by using a predefined dictionary that both the encoder and the decoder know in advance. But LZW builds a dictionary on the fly, using an adaptive method that ensures both the encoder and decoder are in sync.

LZW manages this in an effective and provably correct fashion. First, both the encoder and decoder initialize the dictionary with all possible single digit strings. For the compressor, that looks like this:

for ( unsigned int i = 0 ; i < 256 ; i++ )
    codes[std::string(1,(char)i)] = i;

This insures that we can encode all possible streams. No matter what, we can always break a stream down into single digits and encode these, knowing that the decoder has the same strings in its dictionary with values 0-255.

Then comes the key component of the LZW algorithm. If you go back to the greedy encoding loop above, you'll see that I keep adding input symbols to a string until I find a string that isn't in the dictionary. This string has the characteristic of being composed of a string that currently exists in the dictionary, with one additional character.

LZW then takes that new string and adds it to the dictionary, creating a new code. The strings are added to the table with code values that increment by one with each new entry.

The resulting code is just a slightly modified version of the encoder that I listed above. It still only outputs codes for values that are in the dictionary, but now the dictionary is being updated with a new string every time an existing code is sent:

void compress( input_stream in, output_stream out )
{
  std::unordered_map<std::string,unsigned int> codes;
  for ( unsigned int i = 0 ; i < 256 ; i++ )
    codes[std::string(1,(char)i)] = i;
  unsigned int next_code = 257;
  std::string current_string;
  char c;
  while ( in >> c ) {
    current_string = current_string + c;
    if ( codes.find(current_string) == codes.end() ) {
      codes[ current_string ] = next_code++;
      current_string.erase(current_string.size()-1);
      out << codes[current_string];
      current_string = c;
    }
  }
  out << codes[current_string];
}

The code above constitutes a more or less complete LZW encoder. I've only made a couple of additions to the previous encoder:

  • The initialization of codes 0-255 with all possible single character strings.
  • The insertion of the newly discovered string into the string table, generating a new code.

(One item of note in this code: you might wonder why next_code is initialized to 257, when 256 is the first free code. This is because I reserve code 256 for an EOF marker. More on this in a later section.)

Just to make sure this all adds up, I'll walk through the steps the encoder takes as it processes a string from a simple two letter alphabet: ABBABBBABBA. There are a lot of steps shown below, but working through the process in detail is a great way to be sure you understand it:


Input
Symbol
Action(s) New
Code
Output
Code
A
read 'A' - set current_string to 'A'
'A' is in the dictionary, so continue
   
B
read 'B' - set current_string to 'AB'
'AB' is not in the dictionary, add it with code 257
output the code for 'A' - 65
set current_string to 'B'
257 (AB) 65 (A)
B
read 'B' - set current_string to 'BB'
'BB' is not in the dictionary, add it with code 258
output the code for 'B' - 66
set current_string to 'B'
258 (BB) 66 (B)
A
read 'A' - set current_string to 'BA'
'BA' is not in the dictionary - add it with code 259
output the code for 'B' - 66
set current_string to 'A'
259 (BA) 66 (B)
B
read 'B' - set current_string to 'AB'
'AB' is in the dictionary, so continue
   
B
read 'B' - set current_string to 'ABB'
'ABB' is not in the dictionary - add it with code 260
output the code for 'AB' - 257
set current_string to 'B'
260 (ABB) 257 (AB)
B
read 'B' - set current_string to 'BB'
'BB' is in the dictionary, so continue
   
A
read 'A' - set current_string to 'BBA'
'BBA' is not in the dictionary - add it with code 261
output the code for 'BB' - 258
set current_string to 'A'
261 (BBA) 258 (BB)
B
read 'B' - set current_string to 'AB'
'AB' is in the dictionary, so continue
   
B
read 'B' - set current_string to 'ABB'
'ABB' is in the dictionary, so continue
   
A
read 'A' - set current_string to 'ABBA'
'ABBA' is not in the dictionary - add it with code 262
output the code for 'ABB' - 260
set current_string to 'A'
262 (ABBA) 260 (ABB)
EOF
end of the input stream - exit loop
current string is 'A'
output the code for 'A' - 65
  65 (A)


After processing string ABBABBBABBA, the output codes are 65,66,66,257,258,260,65. The dictionary at this point is:

String Code
AB 257
BB 258
BA 259
ABB 260
BBA 261
ABBA 262

The dictionary generated for ABBABBBABBA
(Entries 0-255 not shown for brevity)


Looking at the above table, you can see a few interesting things happening. First, every time the algorithm outputs a code, it also adds a new code to the dictionary.

More importantly, as the dictionary grows, it starts to hold longer and longer strings. And the longer the string, the the more compression we can get. If the algorithm starts emitting integer codes for strings of length 10 or more, there is no doubt that we are going to get good compression.

As an example of how this works on real data, here are some entries from the dictionary created when compressing Alice's Adventures in Wonderland:

34830 : 'even\n'
34831 : '\nwith t'
34832 : 'the dr'
34833 : 'ream '
34834 : ' of Wo'
34835 : 'onderl'
34836 : 'land'
34837 : 'd of l'
34838 : 'long ag'
34839 : 'go:'

These strings have an average length of almost six characters. If we are writing the integer codes to a file using 16 bit binary integers, these entries offer the possibility of 3:1 compression.

The word adaptive is used to describe a compression algorithm that adapts to the type of text it is processing. LZW does an excellent job of this. If a string is seen repeatedly in the text, it will show up in longer and longer entries in the dictionary. If a string is seen rarely, it will not be the foundation for a large batch of longer strings, and thus won't waste space in the dictionary.

The LZW Decoder

The change made to the basic encoder to accommodate the LZW algorithm was really very simple. One small batch of code that initializes the dictionary, and another few lines of code to add every new unseen string to the dictionary.

As you might suspect, the changes to the decoder will be fairly simple as well. The first change is that the dictionary must be initialized with the same 256 single-symbol strings that the encoder uses.

Once the decoder starts running, each time it reads in a code, it must add a new value to the dictionary. And what is that value? The entire content of the previously decoded string, plus the first letter of the currently decoded string. This is exactly what the encoder does to create a new string, and the decoder must following the same steps:

void decompress( input_stream in, output_stream out )
{
  std::unordered_map<unsigned int,std::string> strings;
  for ( int unsigned i = 0 ; i < 256 ; i++ )
    strings[i] = std::string(1,i);
  std::string previous_string;
  unsigned int code;
  unsigned int next_code = 257;
  while ( in >> code ) {
    out << strings[code];
    if ( previous_string.size() )
      strings[next_code++] = previous_string + strings[code][0];
    previous_string = strings[code];
  }
}

I won't do a walk-through of the the decoder - you should be able to take the codes output from the encoder, shown above, and run them through the decoder to see that the output stream is what we expect.

The important thing is to understand the logic behind the decoder. When the encoder encounters a string that isn't in the dictionary, it breaks it into two pieces: a root string and an appended character. It outputs the code for the root string, and adds the root string + appended character to the dictionary. It then starts building a new string that starts with the appended character.

So every time the decoder uses a code to extract a string from the dictionary, it knows that the first character in that string was the appended character of the string just added to the dictionary by the encoder. And the root of the string added to the dictionary? That was the previously decoded string. This line of code implements that logic:

    strings[next_code++] = previous_string + strings[code][0];

It adds a new string to the dictionary, composed of the previously seen string, and the first character of the current string. Thus, the decoder is adding strings to the dictionary just one step behind the encoder.

You might note one curious point in the decoder. Instead of always adding the string to the dictionary, it is only done conditionally:

if ( previous_string.size() )
  strings[next_code++] = previous_string + strings[code][0];

The only time that previous_string.size() is 0 is on the very first pass through the loop. And on the first pass through the loop, we don't have a previous string yet, so the decoder can't build a new dictionary entry. Again, the decoder is always one step behind the encoder, which is a key point in the next section, which puts the final touches on the algorithm.

The Catch

So far the LZW algorithm we've seen seems very elegant - that's a characteristic we associate with algorithms that can be expressed in just a few lines of code.

Unfortunately, there is one small catch in this perceived elegance - the algorithm as I've shown it to you has a bug.

The bug in the algorithm relates to the fact that the encoder is always one step ahead of the decoder. When the encoder adds a string with code N to the table, it sends enough information to the decoder to allow the decoder to figure out the value of the string denoted by code N-1. The decoder won't know what the value of the string corresponding to code N is until it receives code N+1.

This makes sense if you recall the key line of code from the decoder. It calculates the value of the string encoded by N-1 by looking at the string it received on the previous iteration, plus the first character of the current string. And that current string is the one that was sent after encoding N.

So how can this get us in trouble? The encoder is always one entry ahead of the decoder - it has entry N in its dictionary, and the decoder has entry N-1. So if the encoder ever sends code N, the decoder will look in its table and come up empty-handed, unable to do its job of decoding.

A simple example will show you how this can happen. Let's look at the state of the encoder after it has sent the first five symbols in a stream: ABABA:


Input
Symbol
Action(s) New
Code
Output
Code
A
read 'A' - set current_string to 'A'
'A' is in the dictionary, so continue
   
B
read 'B' - set current_string to 'AB'
'AB' is not in the dictionary, add it with code 257
output the code for 'A' - 65
set current_string to 'B'
257 (AB) 65 (A)
A
read 'A' - set current_string to 'BA'
'BA' is not in the dictionary, add it with code 258
output the code for 'B' - 66
set current_string to 'A'
258 (BA) 66 (B)
B
read 'B' - set current_string to 'AB'
'AB' is in the dictionary, so continue
   
A
read 'A' - set current_string to 'ABA'
'ABA' is not in the dictionary, add it with code 259
output the code for 'AB' - 257
set current_string to 'A'
259 (ABA) 257 (AB)


Now we are set for trouble. The encoder has symbol 259 in its dictionary, while the decoder has only gotten to 258. If the encoder were to send a code of 259 for its next output, the decoder would not be able to find it in its dictionary. Can this happen?

Yes, if the next two characters in the stream are BA, the next code output by the encoder will be 259, and the decoder will be lost.

In general, this can happen when a dictionary entry exists that consists of a string plus a character, and the encoder encounters the sequence string+character+string+character+string. In the example above, the value of string is A, and the value of character is B. After the encoder counters AB, it has string+character in the dictionary, so if the following sequence is ABABA, we will emit code N. (This can happen at the start of a file, when the string is empty, if you simply have a string of three identical characters.)

Whether this is likely to happen or not is not too important, what is important is that it most definitely can happen, and the decoder has to be aware of it. And it will happen repeatedly in the pathological case: a stream that consists of a single symbol, repeated on end.

The good news is that the problem is easily solved. When the decoder receives a code, and finds that this code is not present in its dictionary, it knows right away that the code must be the one that it will add next to its decoder. And because this only happens when we are encoding the sequence discussed above, the decoder knows that instead of using this value for that code:

    strings[next_code++] = previous_string + strings[code][0];

it can instead use this value:

    strings[ code ] = previous_string + previous_string[0];

The result of this is the insertion of just two lines of code at the start of the decompress loop, giving a loop that now looks like this:


while ( in >> code ) {
  if ( strings.find( code ) == strings.end() ) 
    strings[ code ] = previous_string + previous_string[0];
  out << strings[code];
  if ( previous_string.size() )
    strings[next_code++] = previous_string + strings[code][0];
  previous_string = strings[code];
}

And with that, you have a complete implementation of the LZW encoder and decoder.

Implementation

Now that I've shown you the algorithm, the next step is to take that code and add turn it into a working program. Without changing the algorithm itself, I'm going to take you through four different customizations that work as follows:

  • LZW-A reads and writes code values rendered in text mode, which is great for debugging. It means you can view the output of the encoder in a text editor.
  • LZW-B reads and writes code values as 16-bit binary integers. This is fast and efficient, and usually results in significant data compresion.
  • LZW-C reads and writes code values as N-bit binary integers, where N is determined by the maximum code size. Performing I/O on codes that are not aligned on byte boundaries complicates the code somewhat, but allows for greater efficiency and better compression.
  • LZW-D reads and writes code values as variable-length binary integers, starting with 9-bit codes and gradually increasing as the dictionary grows. This gives the maximum compression.

Before launching into these implementations, the code I showed above needs some minor tweaking to solve a couple of problems.

The first problem we have to deal with is the ever-expanding dictionary. In the algorithm I've presented, we keep adding new codes to the dictionary without end. This needs to be changed for a couple of reasons.

First, we don't have unlimited memory, so the dictionary simply can't grow forever. Second, practical experience shows that compression ratios don't improve as dictionary sizes grow without bound. As the dictionary grows, code sizes get larger and larger, and so they take up more space in the compressed stream, which can reduce compression efficiency.

To resolve this problem, I just add an additional argument to the encoder and decoder that sets the maximum code value that will be added to the dictionary. The function signatures now look like this:

void compress( input_string input, 
               output_stream output, 
               const unsigned int max_code = 32767 );
void decompress( input_string input, 
                 output_stream output, 
                 const unsigned int max_code = 32767 );

Implementing it means one small change in the encoder:

if ( next_code <= max_code )
  codes[ current_string ] = next_code++;

And a corresponding change in the decoder:

if ( previous_string.size() && next_code <= max_code )
  codes[ current_string ] = next_code++;

Input and Output

Finally, I need to give the algorithm a decent way to perform input and output - and this is where C++ offers a huge amount of help.

When writing generic compression code that you intend to use in multiple contexts, one of the more difficult things to deal with is I/O. People using your code might want to compress data in memory, stored in files, or streaming in from sockets or other sources. Some input data sources might be of unknown length (data coming from a TCP socket, for example), while others will be of a prescribed length. Back in the days of C, it was particularly difficult to make your compression code both generic, so it would work with all types of data streams, and efficient, so that I/O doesn't take any more time than it has to.

With the advent of C++, we have a new tool that can help in this quest - templates. Templates are designed to solve this problem in an efficient way, and I take advantage of this in my sample code. The code below shows the final version of the compressor and decompressor that are are used in all four versions of the implementation. There are two final changes made to the routines shown previously. First, both C++ functions are now function templates, parameterized on the the types being used for input and output. Second, the actual input and output is done through four newly introduced template classes:

template<class INPUT, class OUTPUT>
void compress( INPUT &input, OUTPUT &output, const unsigned int max_code = 32767 )
{
  input_symbol_stream<INPUT> in( input );
  output_code_stream<OUTPUT> out( output, max_code );

  std::unordered_map<std::string, unsigned int> codes( (max_code * 11)/10 );
  for ( unsigned int i = 0 ; i < 256 ; i++ )
    codes[std::string(1,i)] = i;
  unsigned int next_code = 257;
  std::string current_string;
  char c;
  while ( in >> c ) {
    current_string = current_string + c;
    if ( codes.find(current_string) == codes.end() ) {
      if ( next_code <= max_code )
        codes[ current_string ] = next_code++;
      current_string.erase(current_string.size()-1);
      out << codes[current_string];
      current_string = c;
    }
  }
  if ( current_string.size() )
    out << codes[current_string];
}

template<class INPUT, class OUTPUT>
void decompress( INPUT &input, OUTPUT &output, const unsigned int max_code = 32767  )
{
  input_code_stream<INPUT> in( input, max_code );
  output_symbol_stream<OUTPUT> out( output );

  std::unordered_map<unsigned int,std::string> strings( (max_code * 11) / 10 );
  for ( int unsigned i = 0 ; i < 256 ; i++ )
    strings[i] = std::string(1,i);
  std::string previous_string;
  unsigned int code;
  unsigned int next_code = 257;
  while ( in >> code ) {
    if ( strings.find( code ) == strings.end() ) 
      strings[ code ] = previous_string + previous_string[0];
    out << strings[code];
    if ( previous_string.size() && next_code <= max_code )
      strings[next_code++] = previous_string + strings[code][0];
    previous_string = strings[code];
  }
}

What exactly is the effect of implementing this algorithm using a pair of function templates, parameterized on the the types of the input and output objects? What this means is that you can call these compression routines with any type of I/O object you can throw at them. It can work with C++ iostreams, C FILE * objects, raw blocks of memory, whatever you want.

But there's a catch to that flexibility - you have to implement some basic I/O routines for whatever type you are using. Fortunately, this is not too hard.

The actual I/O that is done in the compression routines is defined by four template classes I created. These classes are defined in lzw_streambase.h. These classes don't have implementations, but they do define the methods you need to implement to work with the compressor and decompressor. The four classes are:

  • input_symbol_stream<T>
  • ouput_symbol_stream<T>
  • input_code_stream<T>
  • output_code_stream<T>

The first two classes are the symbol input and output classes. These are normally going to be very simple implementations, as they just have to read single characters to and from streams, while checking for errors or ends of streams. I use the same versions of these classes in all four implementations, so the code in lzw-a.h is unchanged in the other three header files.

The input_symbol_stream<T> class has one member function: the extraction operator, which reads a character from the stream and returns a boolean true or false. You'll see later in this section that the implementation of this for types such as std::istream is trivial.

template<typename T>
class input_symbol_stream
{
public :
    input_symbol_stream( T & );
    bool operator>>( char &c );
};

The output_symbol_stream<T> class uses the insertion operator to write strings instead of individual characters - because that is what is stored in the dictionary. The C++ std::string class makes a perfectly good container for any variety of symbols, including binary data, and unlike the alternative vector<char>, it comes with hash functions and iostream operators.

template<typename T>
class output_symbol_stream
{
public :
    output_symbol_stream( T &  );
    void operator<<( const std::string &s );
};

The input_code_stream<T> class reads codes, normally unsigned integers, from some type of stream. In my implementations, this class also returns false if it encounters the EOF_CODE in the stream of incoming codes. Removing the responsibility for EOF detection from the decompressor makes the code a bit simpler and more versatile.

The formatting of the integer is entirely up to the implementor, but the most common approach will probably be variable length codes ranging from 9 to 16 or so bits.

template<typename T>
class input_code_stream
{
public :
    input_code_stream( T &, unsigned int );
    bool operator>>( unsigned int &i );
};

The output_code_stream<T> class writes codes, usually unsigned integers, to some type of stream. Whatever class you implement for this function must agree with the implementation for input_code_stream<T>.

template<typename T>
class output_code_stream 
{
public :
    output_code_stream( T &, unsigned int );
    void operator<<( const unsigned int i );
};

You can see that at the top of the compressor and decompressor, I instantiate objects of these types, then use the standard insertion and extraction operators to read and write from these objects.

LZW-A

In my sample windows program, I include lzw_streambase.h and lzw.h, which accounts for all of the code you have seen so far. I have the following lines that perform compression and decompression:

std::ifstream in( name, std::ios_base::binary );
std::ofstream lzw_out( temp_name_lzw, std::ios_base::binary );
compress( (std::istream &) in, (std::ostream&) lzw_out, pDlg->m_MaxCodeSize );
.
.
.
std::ifstream lzw_in( temp_name_lzw, std::ios_base::binary );
std::fstream out( temp_name_out, 
                  std::fstream::in    | 
                  std::fstream::out   | 
                  std::fstream::binary );
decompress( (std::istream &) lzw_in, (std::ostream&) out, pDlg->m_MaxCodeSize );

If I try to build this project as-is, I get a nasty list of eight linker errors:

Visual Studio 10 Error Messages


If you have the fortitude to crawl through those link errors, you will see that what is missing are the implementations of the four classes parameterized on std::ostream and std::istream. Each of the four classes needs the implementation of a constructor and either an insertion or extraction operator. And with no class definitions at all, that adds up to eight missing functions. To get us started on performing actual LZW compression, I've created the first implementation of these four classes in lzw-a.h. Let's take a look at each of these in turn.

It's tempting to try to read characters using the ifstream extraction operator, as in m_impl >> c, but that operator skips over whitespace, so we don't get an exact copy of the input stream. Using get() works around this problem. Below is the complete definition of input_symbol_stream<std::istream> used in all four LZW implementations in this article:

template<>
class input_symbol_stream<std::istream> {
public :
    input_symbol_stream( std::istream &input ) 
        : m_input( input ) {}
    bool operator>>( char &c )
    {
        if ( !m_input.get( c ) )
            return false;
        else
            return true;
    }
private :
    std::istream &m_input;
};

Using the insertion operator to output strings seems to work properly, even when the strings contain binary data, so the implementation of the class used to output symbols is as simple as we could hope for. Again, this exact code is used in all four implementations in this article:

template<>
class output_symbol_stream<std::ostream> {
public :
    output_symbol_stream( std::ostream &output ) 
        : m_output( output ) {}
    void operator<<( const std::string &s )
    {
        m_output << s;
    }
private :
    std::ostream &m_output;
};

LZW-A prints the text values of integers to the output stream, and reads them back in that format. This is not efficient at all, but it is a great aid in debugging. If you are having a problem with the algorithm, this provides a nice way to examine your stream. The implementation of this is very simple - just use the std::ostream insertion operator, and follow each code by a newline so it can be properly parsed on input, as well as be easily loaded into a text editor.

One important thing to notice in this class: the presence of a destructor that prints the EOF_CODE. Since this object goes out of scope as the compressor exits, this insures that every code stream will end with this special code. Putting the onus on the I/O routines to deal with EOF issues simplifies the algorithm itself. (It also means that you can implement versions of LZW that don't use an EOF in the code stream.)

template<>
class output_code_stream<std::ostream> {
public :
    output_code_stream( std::ostream &output, const unsigned int ) 
        : m_output( output ) {}
    void operator<<( unsigned int i )
    {
        m_output << i << '\n';
    }
    ~output_code_stream()
    {
        *this << EOF_CODE;
    }
private :
    std::ostream &m_output;
};

The corresponding version of the input class just reads in the white-space separated codes. If there is an error or an EOF_CODE encountered in the stream, the extraction operator returns false, which allows the decompressor to know when it is time to stop processing.

template<>
class input_code_stream<std::istream> {
public :
    input_code_stream( std::istream &input, unsigned int ) 
        : m_input( input ) {}
    bool operator>>( unsigned int &i )
    {
        m_input >> i;
        if ( !m_input || i == EOF_CODE )
            return false;
        else
            return true;
    }
private :
    std::istream &m_input;
};

By including lzw-a.h along with the other two header files, I can now create a program that compiles, links, and is able to test the algorithm. Using my UNIX test program, I compress the demo string from earlier in this article, and I see the output as it is sent directly to stdout:

Compressing ABBABBBABBA


Fortunately, the output is identical to what was shown earlier, with the addition of the final EOF_CODE used to delimit the end of the code stream.

LZW-B

The header file lzw-b.h implements specialized classes that replace the text-mode output of the codes in lzw-a.h with binary codes stored in a short integer - two bytes.

The classes that read and write symbols are unchanged, but reading and writing codes has to change in order to do this new binary output.

Writing the codes to std::ostream as binary values requires breaking the integer code into two bytes and writing the bytes one at a time. There are more efficient ways to write the complete short integer in one function call, but they raise code portability problems, as we don't always know what order bytes will be written in.

Like the code stream output object in lzw-a.h, this version of the code output class has a destructor that outputs an EOF_CODE value:

template<>
class output_code_stream<std::ostream> {
public :
    output_code_stream( std::ostream &output, const unsigned int ) 
        : m_output( output ) {}
    void operator<<( unsigned int i )
    {
        m_output.put( i & 0xff );
        m_output.put( (i>>8) & 0xff);
    }
    ~output_code_stream()
    {
        *this << EOF_CODE;
    }
private :
    std::ostream &m_output;
};

Reading the codes requires reading the two bytes that make up the short integer, then combining them. While reading, if the routine detects an EOF_CODE, it returns false, which tells the decompressor to stop processing. It also returns false if there is an error on the input code stream.

template<>
class input_code_stream<std::istream> {
public :
    input_code_stream( std::istream &input, unsigned int ) 
        : m_input( input ) {}
    bool operator>>( unsigned int &i )
    {
        char c;
        if ( !m_input.get(c) )
            return false;
        i = c & 0xff;
        if ( !m_input.get(c) )
            return false;
        i |= (c & 0xff) << 8;
        if ( i == EOF_CODE )
            return false;
        else
            return true;
    }
private :
    std::istream &m_input;
};

The most exciting thing about lzw-b.h is that you can now see data compression taking place. The figure below shows a sample run of this implementation against the Canterbury Corpus, a standard set of files used to test compression. A run with my Windows test program shows that the files are compressing quite nicely:

Compressing the Canterbury Corpus with lzw-b.h

LZW-C

The third I/O implentation, defined in lzw-c.h, writes binary codes like lzw-b.h, but with one crucial difference. Instead of being hard coded to 16 bit codes, lzw-c.h determines the maximum code size needed based on the maximum code value passed as an argument to compress() and decompress(). It then writes codes based on that width, which will normally be something in the range of 9-18 bits wide.

Since these values are not aligned with byte boundaries, there are some issues writing them to streams that expect to read and write bytes. However, it is definitely worth all the bit shifting, ORing, and ANDing, because when the size is 12 bites, we are going to save four bits per code when compared to using lzw-b.h. But every read and write potentially starts somewhere in the middle of a byte, so the I/O classes have to do some extra work - mostly involved with shifting bits to the correct position in the output stream.

Note that the code to read and write symbols is unchanged from lzw-a.h and lzw-b.h.

Many of the CS students who read my earlier article on LZW ran into a brick wall when they started trying to understand the code that performs I/O on codes of variable bit lengths. Obviously, writing 11 bit codes when your file system is oriented around eight-bit bytes involves a lot of bit twiddling, and I'm afraid that many novices are woefully deficient in this department. Not just in understanding the bitwise operators in C, such as shifting, masking, etc., but in understanding binary arithmetic in general.

That's why I've structured the code and this article a bit differently this time around. If the I/O operations in lzw-c.h and lzw-d.h are bewildering, well, no worries. They have absolutely nothing to do with the LZW algorithm itself. You can investigate and explore the algorithm completely using lzw-a.h and lzw-b.h, and just forget about the last two I/O implementations. They provide additional efficiency, but as I have said, have nothing to do with the algorithm itself.

Further, once you use lzw-a.h to debug and understand the algorithm, you can certainly plug in lzw-c.h and lzw-d.h and take advantage of their improved compression, even if you don't follow all the code.

It might be appropriate to add a sidebar or another section to explain the variable bit length I/O in detail, but this article is quite long already, and there are numerous other resources for the interested reader to explore the details. (But if you find yourself deficient in this area, you owe it to yourself to hit the books and get to the point where these operations make sense. This won't be the last time you need to understand bitwise operators.)

For those who are ready to tackle this more complicated I/O procedure, we will look first at the output_code_stream<std::ostream> class. Here, the first thing to understand is that the constructor has to initialize the number of bits in the code. This value is calculated from the max_code parameter, and is stored in member m_code_size, where it is used frequently.

Next, the insertion operator. Output of codes proceeds as follows. Member m_pending_bits tells us how many bits are pending output while sitting in member m_pending_output. These bits are right justified, and the count will always be less than eight. When the new code is written, it is inserted into m_pending_output after being left shifted so it will be laid down just past the pending bits. After doing that, we presumably have some bytes to output - the exact number depends on various factors. The flush() routine is called, and it flushes all complete bytes out. When it completes, there can be anywhere from zero to seven bits still waiting to be output, and they will be right justified in m_pending_output.

In the destructor, we output an EOF_CODE, and then do a flush as well. But in this case, we flush all possible bits, not just the complete bytes. There are two good reasons for this. First, we don't care if the last bits that are flushed out are only part of a code - the code will be EOF_CODE, and that is the last one. And second, if we don't flush those final bits out in the destructor, they will never be sent to the output stream. This means the decoder will not see those bits, and we will most likely break the decompress process.

template<>
class output_code_stream<std::ostream>
{
public :
    output_code_stream( std::ostream &out, unsigned int max_code ) 
        : m_output( out ),
          m_pending_bits(0),
          m_pending_output(0),
          m_code_size(1)
    {
        while ( max_code >>= 1 )
            m_code_size++;
    }
    ~output_code_stream()
    {
        *this << EOF_CODE;
        flush(0);
    }
    void operator<<( const unsigned int &i )
    {
        m_pending_output |= i << m_pending_bits;
        m_pending_bits += m_code_size;
        flush( 8 );
    }
private :
    void flush( const int val )
    {
        while ( m_pending_bits >= val ) {
            m_output.put( m_pending_output & 0xff );
            m_pending_output >>= 8;
            m_pending_bits -= 8;
        }
    }
    std::ostream &m_output;
    int m_code_size;
    int m_pending_bits;
    unsigned int m_pending_output;
};

Like the output code class, the input code class has to calculate the code size for this decompression based on the max_code value passed in the function call.

When an attempt is made to read a code, there must be a minimum of m_code_size bits in member m_pending_input. If there aren't, new bytes are read in one at a time, and inserted into m_pending_input after having been shifted left the appropriate amount. Once m_pending_input contains at least m_code_size bits, the code is extracted from m_pending_input using the appropriate mask, the count in m_pending_input is reduced, and m_pending_input is shifted right by m_code_size bits.

template<>
class input_code_stream<std::istream>
{
public :
    input_code_stream( std::istream &in, unsigned int max_code ) 
        : m_input( in ),
          m_available_bits(0),
          m_pending_input(0),
          m_code_size(1)
    {
        while ( max_code >>= 1 )
            m_code_size++;
    }
    bool operator>>( unsigned int &i )
    {
        while ( m_available_bits < m_code_size )
        {
            char c;
            if ( !m_input.get(c) )
                return false;
            m_pending_input |= (c & 0xff) << m_available_bits;
            m_available_bits += 8;
        }
        i = m_pending_input & ~(~0 << m_code_size);
        m_pending_input >>= m_code_size;
        m_available_bits -= m_code_size;
        if ( i == EOF_CODE )
            return false;
        else
            return true;
}
private :
    std::istream &m_input;
    int m_code_size;
    int m_available_bits;
    unsigned int m_pending_input;
};

The table below shows the results of a test run comparing LZW-B and LZW-C run with a maximum code of 4095. With this maximum value, all codes fit in a 12-bit integer. Since LZW-B will use a 16-bit integer to store the code values, and LZW-C will use 12-bits, there should be a 4:3 ratio between the ratio of the file sizes when compressed using the two algorithms, and this looks to be the case:

File Name Original
Size
Compressed
LZW-B
Compressed
LZW-C
Ratio
alice29.txt 152089 96428 72322 0.750
alphabet.txt 100000 4538 3404 0.750
asyoulik.txt 125179 83966 62975 0.750
bib 111261 71792 53845 0.750
bible.txt 4047392 2468326 1851245 0.750

Comparing 12-bit compression between LZW-B and LZW-C


It looks like things are working as expected.

LZW-D

The code in lzw-d.h represents the final and most efficient version of I/O for the LZW code streams. It builds on the code in lzw-c.h - at its core it is a variable bit-length I/O stream. However, there is one crucial difference from lzw-c.h: the code I/O in lzw-d.h starts at the smallest possible code size, nine bits, and increases the code size as needed, until it reaches the maximum value for this compression session. The maximum value is the parameter passed in to the invocation of compress() or decompress().

The logic behind this is pretty simple. Even if we are going to use 16-bit codes in an LZW program, when the program first starts, the maximum possible code the program can emit is 256, which only needs nine bits to encode. And each time we output a new symbol, that maximum possible code value only increases by one, which means that the first 256 codes output by the encoder can all fit in nine bits.

So the LZW-D encoder starts encoding using nine-bit code widths, and then bumps the value to ten as soon as the highest possible output code reaches 512. This process continues, incrementing the code size until the maximum code size is reached. At that point the code size stays fixed, as no new codes are being added to the dictionary.

The decoder follows exactly the same process - reading in the first code with a width of nine bits, then bumping to ten when the maximum possible input code reaches 512.

The code for this class is built on that from lzw-c.h, with some added complexity. Due to its increasing length, and the fact that it doesn't add too much to the discussion of LZW, I've omitted the listing, and instead refer you to the download available at the end of the article.

The Windows Test Program

When you develop compression code, there are a few different common tasks you are likely to want to perform:

  • Check your code for correctness, often through bulk testing.
  • Check your compression ratios against standard benchmarks.
  • Analyze your program's performance so as to make it more efficient and locate bottlenecks.

My Windows app is designed to help with all of these tasks. It basically allows you to select a single directory, set a maximum code size, then perform compression and decompression of all the files in the directory. An optional checkbox lets you include files in all directories under the test directory as well.

The application was built using Visual Studio 10, and it is a simple MFC Dialog-based application. It allows you to select a base directory, a maximum code size, and then compress all the files in that directory. If you select the recursion check box, you will also compress all the files in the entire tree of subdirectories below it.

Each file is compressed to a temporary location, then decompressed in a temporary location. The size of the compressed file is saved, and then a comparison is done to ensure that the original and expanded files are identical.

To help with data collection, after running a test, you can press the copy button and get the results of the test stuffed into your clipboard. Although it isn't visible in the display, the data stored in your clipboard includes the full path name of the original file, not just the basename.

This Visual Studio project takes advantage of a number of C++11 features, and as a result it will need some modification to work with earlier versions. Any version that supports unordered_map can be made to build without too many changes. And if you are going way back in time, you could replace unordered_map with map.

As shipped, the test program uses lzw-d.h. To use any of the other three other versions of I/O discussed in this article, just modify the include file selected at the top of LzwTestDlg.cpp. The figure below shows what the app looks like after running through some data:

The Windows test app after a test run


After pressing the copy button at the bottom of the dialog, you can paste the data into a spreadsheet and then crunch it to your heart's content:

Copying the data into a spreadsheet

The Linux Test Program

The LZW code is platform independent, and will build and run just fine on UNIX or Linux systems. The Linux test program, lzw.cpp, allows you to compress or decompress files from the command line. It builds just fine with g++ 4.5, as long as you use the -std=c++0x switch to turn on the latest language features. Compiling with earlier versions will require a few minor modifications.

The command line interface to the test program is not too complicated, and is probably best documented by looking at the usage output:

mrn@ubuntu:~/LzwTest$ g++ -std=c++0x lzw.cpp -o lzw
mrn@ubuntu:~/LzwTest$ ./lzw
Usage:
lzw [-max max_code] -c input output #compress file input to file output
lzw [-max max_code] -c - output     #compress stdin to file otuput
lzw [-max max_code] -c input        #compress file input to stdout
lzw [-max max_code] -c              #compress stdin to stdout
lzw [-max max_code] -d input output #decompress file input to file output
lzw [-max max_code] -d - output     #decompress stdin to file otuput
lzw [-max max_code] -d input        #decompress file input to stdout
lzw [-max max_code] -d              #decompress stdin to stdout
mrn@ubuntu:~/LzwTest$ 

Like the Windows test program, the command line program is built by default with lzw-d.h. Replacing this algorithm with any of the three others requires a minor change to the source code.

With the default build, the program produces output nearly identical to UNIX compress. The one difference is that UNIX compress monitors the compression ratio after the dictionary is full, and clears the dictionary if the ratio starts to deteriorate (which it almost always does.) I include a benchmark program that tests UNIX compress against the command line test program, and the results show that for small files, the file size is almost identical:

mrn@ubuntu:~/LzwTest$ ./benchmark.sh 65535 16 canterbury | head -n 15 | column -t
Filename                 Original-size  LZW-size  Compress-size
--------                 -------------  --------  -------------
canterbury/aaa.txt       33406          320       321
canterbury/alice29.txt   152089         62247     62247
canterbury/alphabet.txt  100000         3052      3053
canterbury/asyoulik.txt  125179         54989     54990
canterbury/a.txt         1              3         5
canterbury/bib           111261         46527     46528
canterbury/bible.txt     4047392        1417735   1377093
canterbury/book1         768771         317133    317133
canterbury/book2         610856         247593    251289
canterbury/cp.html       24603          11315     11317
canterbury/E.coli        4638690        1213579   1218349
canterbury/fields.c      11150          4963      4964
canterbury/geo           102400         77777     77777

You can see in this test that LZW-D and UNIX compress perform nearly identically for all but the largest files in the test sample. If I modify UNIX compress to not monitor compression ratios, the difference seen with larger files goes away:

mrn@ubuntu:~/LzwTest$ ./benchmark.sh 65535 16 canterbury | head -n 15 | column -t
Filename                 Original-size  LZW-size  Compress-size
--------                 -------------  --------  -------------
canterbury/aaa.txt       33406          320       321
canterbury/alice29.txt   152089         62247     62247
canterbury/alphabet.txt  100000         3052      3053
canterbury/asyoulik.txt  125179         54989     54990
canterbury/a.txt         1              3         5
canterbury/bib           111261         46527     46528
canterbury/bible.txt     4047392        1417735   1417735
canterbury/book1         768771         317133    317133
canterbury/book2         610856         247593    247593
canterbury/cp.html       24603          11315     11317
canterbury/E.coli        4638690        1213579   1213579
canterbury/fields.c      11150          4963      4964
canterbury/geo           102400         77777     77777

That provides some support for the notion that the algorithm shown here behaves properly.

Your Program

If you want to build your own program and use these classes, all you need is a C++11 compiler, or an earlier version and a willingness to make a few changes.

To use the classes, include in order lzw_streambase.h, one of the four implementation files for iostreams, preferably lzw-d.h, and finally, lzw.h. Because the significant code in these files is all implemented as template functions or classes, there is no library to include in your project, and no C++ source you have to compile separately.

All of the code in these header files has been hoisted into the lzw namespace, so you will either have to explicitly use the namespace when you invoke compress() and decompress(), or insert this line into your program:

using namespace lzw;

One thing to note about the I/O routines I have defined. The template functions are specialized on std::istream and std::ostream. If you innocently pass in an object such as an std::ifstream, you will get compile time errors. This is because C++ template matching is done on a very strict basis - the compiler won't generally try to figure out that std::ifstream is derived from std::istream, and use the existing class. So instead, you will need to cast your arguments to the types defined in the header files. (Or write your own implementations.)

Your rights to use this code are covered by my Liberal Code Use Policy. As I have mentioned before, this is teaching code, if you decide to use it in a production system, there are many optimizations you might want to perform.

Benchmarks

So how does LZW do when it comes to compression? LZW's original strength was its combination of good compression ratios with high speed compression. The UNIX compress program is still nice and fast, and Terry Welch's original application for LZW was in disk controllers. Because my program is a teaching program, it won't be nearly as fast as compress, but it's still useful to compare it to the de facto standard for lossless compression: the deflate algorithm.

We can compare LZW against deflate by a small modification of my benchmark script that uses gzip instead of compress. The table below shows the average compression ratios for the files in the canterbury corpus when compressed using maximum code widths of 15-18 bits. (The ratio is defined as 100*compressed_size/uncompressed_size, so 0% is perfect compression and 100% is no compression.)

gzip LZW 15 bits LZW 16 bits LZW 17 bits LZW 18 bits
32.7% 43.2% 42.6% 42.5% 42.3%


You can see that LZW does do a good job of compressing data, but the deflate algorithm used by gzip manages to squeeze an additional 10%, more or less, out of the files it compresses. The gap between LZW and deflate is larger on some types of files, and smaller on others, but deflate will almost always show a noticeable difference in compression ratios.

Variations

There are many variations on the code I've presented here that make sense.

One obvious change is to eliminate the special EOF_CODE used to delimit the end of the code stream. If the code stream is a file or other stream with an inherent EOF condition, there is no need for an EOF_CODE - simply reaching the end of the input stream will properly signal the end of the decoded material. Freeing up this one code will make a microscopically small improvement in the compression ratios of the product.

If you want to mimic the output of the compress program, you need to remove the EOF_CODE, and replace it with a CLEAR_CODE that has a value of 256. The compress program monitors the compression ratios it achieves after its dictionary is full, and when the ratio starts to decay, it issues the CLEAR_CODE. That code tells the decoder to clear its dictionary and make a fresh start with new nine-bit codes.

Once you get the hang of LZW, a good exercise to make sure you have it working properly is to create a GIF encoder and decoder. GIF uses LZW to losslessly compress images with a constrained palette, and after all these years is still somewhat of a standard on the web.

History

Usually the history lesson on an algorithm is at the start of the article, but this is a how-to piece, and I feel like the trip down memory lane is not as important as understanding how the algorithm works.

The roots of LZW were set down in 1978 when Jacob Ziv and Abraham Lempel published the second of their two seminal works on data compression, "Compression of Individual Sequences via Variable-Rate Coding". This paper described a general approach to data compression that involved building dictionaries of previously seen strings.

Ziv and Lempel's work was targeted at an academic audience, and it wasn't truly popularized until 1984 when Terry Welch published A Technique for High-Performance Data Compression. Welch's paper took the somewhat abstract Information Theory work of Ziv and Lempel and reduced it to practice in such a way that others could easily implement it.

UNIX compress was probably the first popular program that used LZW compression, and it very quickly became a standard utility on UNIX systems. The freely available code for compress was incorporated into ARC, one of the first archiving programs for PCs. In addition, the algorithm was used in the GIF file format, originally created by Compuserve in 1987.

LZW's popularity waned in the 1990s for two important reasons. First, Unisys began enforcing their patents that covered LZW compression, demanding and receiving royalties from various software companies. Not only did this make developers think twice about the liability they could incur while using LZW, it resulted in a general public relations backlash against using patented technology.

Secondly, the LZW algorithm was eclipsed on the desktop by deflate, as popularized by PKZIP. Not only did deflate outperform LZW, it was unencumbered by patents, and eventually had a very reliable and free open source implementation in zlib, a library written by a team lead by Marc Adler and Jean-loup Gailly. I don't know if there is any way to actually quantify this, but I think one could speculate that zlib is currently installed on more computer systems than any other software package in existence.

So LZW has settled down to an existence out of the limelight. It is still an important algorithm, used in quite a few file formats, and as this article shows, its simplicity makes it an excellent learning tool.

Downloads