Dr. Dobb's Journal, October, 1989 Dr. Dobb's Journal

October, 1989

Note: I have an updated article on LZW posted here. Please check out the new article and tell me what you think. I hope it improves on this post and makes LZW easier to understand.

Thanks to Jan Hakenberg for correction of a couple of errors! In Figure 4, the values for new table entries 259 and 265 were truncated.

Thanks to David Littlewood for pointing out the missing line of pseudocde in Figure 6.

Thanks to Joe Snyder for pointing out a line where a macro should replace a hard-coded constant.

Any programmer working on mini or microcomputers in this day and age should have at least some exposure to the concept of data compression. In MS-DOS world, programs like ARC, by System Enhancement Associates, and PKZIP, by PKware are ubiquitous. ARC has also been ported to quite a few other machines, running UNIX, CP/M, and so on. CP/M users have long had SQ and USQ to squeeze and expand programs. Unix users have the COMPRESS and COMPACT utilities. Yet the data compression techniques used in these programs typically only show up in two places: file transfers over phone lines, and archival storage.

Data compression has an undeserved reputation for being difficult to master, hard to implement, and tough to maintain. In fact, the techniques used in the previously mentioned programs are relatively simple, and can be implemented with standard utilities taking only a few lines of code. This article discusses a good all-purpose data compression technique: Lempel-Ziv-Welch, or LZW compression.

The routines shown here belong in any programmer’s toolbox. For example, a program that has a few dozen help screens could easily chop 50K bytes off by compressing the screens. Or 500K bytes of software could be distributed to end users on a single 360K byte floppy disk. Highly redundant database files can be compressed down to 10% of their original size. Once the tools are available, the applications for compression will show up on a regular basis.

LZW Fundamentals

The original Lempel Ziv approach to data compression was first published in in 1977, followed by an alternate approach in 1978. Terry Welch’s refinements to the 1978 algorithm were published in 1984. The algorithm is surprisingly simple. In a nutshell, LZW compression replaces strings of characters with single codes. It does not do any analysis of the incoming text. Instead, it just adds every new string of characters it sees to a table of strings. Compression occurs when a single code is output instead of a string of characters.

The code that the LZW algorithm outputs can be of any arbitrary length, but it must have more bits in it than a single character. The first 256 codes (when using eight bit characters) are by default assigned to the standard character set. The remaining codes are assigned to strings as the algorithm proceeds. The sample program runs as shown with 12 bit codes. This means codes 0-255 refer to individual bytes, while codes 256-4095 refer to substrings.

Compression

The LZW compression algorithm in its simplest form is shown in Figure 1. A quick examination of the algorithm shows that LZW is always trying to output codes for strings that are already known. And each time a new code is output, a new string is added to the string table.

Routine LZW_COMPRESS

STRING = get input character
WHILE there are still input characters DO
    CHARACTER = get input character
    IF STRING+CHARACTER is in the string table then
        STRING = STRING+character
    ELSE
        output the code for STRING
        add STRING+CHARACTER to the string table
        STRING = CHARACTER
    END of IF
END of WHILE
output the code for STRING
Figure 1 — The Compression Algorithm

A sample string used to demonstrate the algorithm is shown in Figure 2. The input string is a short list of English words separated by the ‘/’ character. Stepping through the start of the algorithm for this string, you can see that the first pass through the loop, a check is performed to see if the string "/W" is in the table. Since it isn’t, the code for ‘/’ is output, and the string "/W" is added to the table. Since we have 256 characters already defined for codes 0-255, the first string definition can be assigned to code 256. After the third letter, ‘E’, has been read in, the second string code, "WE" is added to the table, and the code for letter ‘W’ is output. This continues until in the second word, the characters ‘/’ and ‘W’ are read in, matching string number 256. In this case, the code 256 is output, and a three character string is added to the string table. The process continues until the string is exhausted and all of the codes have been output.

Input String = /WED/WE/WEE/WEB/WET
Character Input Code Output New code value New String
/W / 256 /W
E W 257 WE
D E 258 ED
/ D 259 D/
WE 256 260 /WE
/ E 261 E/
WEE 260 262 /WEE
/W 261 263 E/W
EB 257 264 WEB
/ B 265 B/
WET 260 266 /WET
EOF T
Figure 2 — The Compression Process

The sample output for the string is shown in Figure 2 along with the resulting string table. As can be seen, the string table fills up rapidly, since a new string is added to the table each time a code is output. In this highly redundant input, 5 code substitutions were output, along with 7 characters. If we were using 9 bit codes for output, the 19 character input string would be reduced to a 13.5 byte output string. Of course, this example was carefully chosen to demonstrate code substitution. In real world examples, compression usually doesn’t begin until a sizable table has been built, usually after at least one hundred or so bytes have been read in.

Decompression

The companion algorithm for compression is the decompression algorithm. It needs to be able to take the stream of codes output from the compression algorithm, and use them to exactly recreate the input stream. One reason for the efficiency of the LZW algorithm is that it does not need to pass the string table to the decompression code. The table can be built exactly as it was during compression, using the input stream as data. This is possible because the compression algorithm always outputs the STRING and CHARACTER components of a code before it uses it in the output stream. This means that the compressed data is not burdened with carrying a large string translation table.

Routine LZW_DECOMPRESS

Read OLD_CODE
output OLD_CODE
WHILE there are still input characters DO
    Read NEW_CODE
    STRING = get translation of NEW_CODE
    output STRING
    CHARACTER = first character in STRING
    add OLD_CODE + CHARACTER to the translation table
    OLD_CODE = NEW_CODE
END of WHILE
Figure 3 — The Decompression Algorithm

The algorithm is shown in Figure 3. Just like the compression algorithm, it adds a new string to the string table each time it reads in a new code. All it needs to do in addition to that is translate each incoming code into a string and send it to the output.

Figure 4 shows the output of the algorithm given the input created by the compression earlier in the article. The important thing to note is that the string table ends up looking exactly like the table built up during compression. The output string is identical to the input string from the compression algorithm. Note that the first 256 codes are already defined to translate to single character strings, just like in the compression code.

Input Codes: / W E D 256 E 260 261 257 B 260 T
Input/
NEW_CODE
OLD_CODE STRING/
Output
CHARACTER New table entry
/ / /
W / W W 256 = /W
E W E E 257 = WE
D E D D 258 = ED
256 D /W / 259 = D/
E 256 E E 260 = /WE
260 E /WE / 261 = E/
261 260 E/ E 262 = /WEE
257 261 WE W 263 = E/W
B 257 B B 264 = WEB
260 B /WE / 265 = B/
T 260 T T 266 = /WET
Figure 4 — The Decompression Process

The Catch

Unfortunately, the nice simple decompression algorithm shown in Figure 4 is just a little too simple. There is a single exception case in the LZW compression algorithm that causes some trouble to the decompression side. If there is a string consisting of a (STRING,CHARACTER) pair already defined in the table, and the input stream then sees a sequence of STRING, CHARACTER, STRING, CHARACTER, STRING, the compression algorithm will output a code before the decompressor gets a chance to define it.

A simple example will illustrate the point. Imagine the the string JOEYN is defined in the table as code 300. Later on, the sequence JOEYNJOEYNJOEY occurs in the table. The compression output will look like that shown in Figure 5.

Input String: ...JOEYNJOEYNJOEY
Character Input New Code/String Code Output
JOEYN 300 = JOEYN 288 (JOEY)
A 301 = NA N
. . .
. . .
. . .
JOEYNJ 400 = JOEYNJ 300 (JOEYN)
JOEYNJO 401 = JOEYNJO 400 (???)
Figure 5 — A Problem

When the decompression algorithm sees this input stream, it first decodes the code 300, and outputs the JOEYN string. After doing the output, it will add the definition for code 399 to the table, whatever that may be. It then reads the next input code, 400, and finds that it is not in the table. This is a problem, what do we do?

Fortunately, this is the only case where the decompression algorithm will encounter an undefined code. Since it is in fact the only case, we can add an exception handler to the algorithm. The modified algorithm just looks for the special case of an undefined code, and handles it. In the example in Figure 5, the decompression routine sees a code of 400, which is undefined. Since it is undefined, it translates the value of OLD_CODE, which is code 300. It then adds the CHARACTER value, which is ‘J’, to the string. This results in the correct translation of code 400 to string "JOEYNJ".

Routine LZW_DECOMPRESS

Read OLD_CODE
output OLD_CODE
CHARACTER = OLD_CODE
WHILE there are still input characters DO
    Read NEW_CODE
    IF NEW_CODE is not in the translation table THEN
        STRING = get translation of OLD_CODE
        STRING = STRING+CHARACTER
    ELSE
        STRING = get translation of NEW_CODE
    END of IF
    output STRING
    CHARACTER = first character in STRING
    add OLD_CODE + CHARACTER to the translation table
    OLD_CODE = NEW_CODE
END of WHILE
Figure 6 — The Modified Decompression Algorithm

The Implementation Blues

The concepts used in the compression algorithm are very simple – so simple that the whole algorithm can be expressed in only a dozen lines. Implementation of this algorithm is somewhat more complicated, mainly due to management of the string table.

In the code accompanying this article, I have used code sizes of 12, 13, and 14 bits. In a 12 bit code program, there are potentially 4096 strings in the string table. Each and every time a new character is read in, the string table has to be searched for a match. If a match is not found, then a new string has to be added to the table. This causes two problems. First, the string table can get very large very fast. If string lengths average even as low as three or four characters each, the overhead of storing a variable length string and its code could easily reach seven or eight bytes per code.

In addition, the amount of storage needed is indeterminate, as it depends on the total length of all the strings.

The second problem involves searching for strings. Each time a new character is read in, the algorithm has to search for the new string formed by STRING+CHARACTER. This means keeping a sorted list of strings. Searching for each string will take on the order of log2 string comparisons. Using 12 bit words means potentially doing 12 string compares for each code. The computational overhead caused by this would be prohibitive.

The first problem can be solved by storing the strings as code/character combinations. Since every string is actually a combination of an existing code and an appended character, we can store each string as single code plus a character. For example, in the compression example shown, the string "/WEE" is actually stored as code 260 with appended character E. This takes only three bytes of storage instead of 5 (counting the string terminator). By backtracking, we find that code 260 is stored as code 256 plus an appended character E. Finally, code 256 is stored as a ‘/’ character plus a ‘W’.

Doing the string comparisons is a little more difficult. Our new method of storage reduces the amount of time for a string comparison, but it doesn’t cut into the the number of comparisons needed to find a match. This problem is solved by storing strings using a hashing algorithm. What this means is that we don’t store code 256 in location 256 of an array. We store it in a location in the array based on an address formed by the string itself. When we are trying to locate a given string, we can use the test string to generate a hashed address, and with luck can find the target string in one search.

Since the code for a given string is no longer known merely by its position in the array, we need to store the code for a given string along with the string data. In the attached program, there are three array elements for each string. They are code_value[i], prefix_code[i], and append_character[i].

When we want to add a new code to the table, we use the hashing function in routine find_match to generate the correct value of i. First find_match generates an address, then checks to see if the location is already in use by a different string. If it is, it performs a secondary probe until an open location is found.

The hashing function in use in this program is a straightforward XOR type hash function. The prefix code and appended character are combined to form an array address. If the contents of the prefix code and character in the array are a match, the correct address is returned. If that element in the array is in use, a fixed offset probe is used to search new locations. This continues until either an empty slot is found, or a match is found. By using a table about 25% larger than needed, the average number of searches in the table usually stays below 3. Performance can be improved by increasing the size of the table.

Note that in order for the secondary probe to always work, the size of the table needs to be a prime number. This is because the probe can be any integer between 0 and the table size. If the probe and the table size were not mutually prime, a search for an open slot could fail even if there were still open slots available.

Implementing the decompression algorithm has its own set of problems. One of the problems from the compression code goes away. When we are compressing, we need to search the table for a given string. During decompression, we are looking for a particular code. This means that we can store the prefix codes and appended characters in the table indexed by their string code. This eliminates the need for a hashing function, and frees up the array used to store code values.

Unfortunately, the method we are using of storing string values causes the strings to be decoded in reverse order. This means that all the characters for a given string have to be decoded into a stack buffer, then output in reverse order. In the program here this is done in the decode_string function. Once this code is written, the rest of the algorithm turns into code very easily.

One problem encountered when reading in data streams is determining when you have reached the end of the input data stream. In this particular implementation, I have reserved the last defined code, MAX_VALUE, as a special end of data indicator. While this may not be necessary when reading in data files, it is very helpful when reading compressed buffers out of memory. The expense of losing one defined code is minimal in comparison to the convenience.

Results

It is somewhat difficult to characterize the results of any data compression technique. The level of compression achieved varies quite a bit depending on several factors. LZW compression excels when confronted with data streams that have any type of repeated strings. Because of this, it does extremely well when compressing English text. Compression levels of 50% or better should be expected. Likewise, compressing saved screens and displays will generally show very good results.

Trying to compress binary data files is a little more risky. Depending on the data, compression may or may not yield good results. In some cases, data files will compress even more than text. A little bit of experimentation will usually give you a feel for whether your data will compress well or not.

Your Implementation

The code accompanying this article works. However, it was written with the goal of being illuminating, not efficient. Some parts of the code are relatively inefficient. For example, the variable length input and output routines are short and easy to understand, but require a lot of overhead. An LZW program using fixed length 12 bit codes could experience real improvements in speed just by recoding these two routines.

One problem with the code listed here is that it does not adapt well to compressing files of differing sizes. Using 14 or 15 bit codes gives better compression ratios on large files, (because they have a larger string table to work with), but actually has poorer performance on small files. Programs like ARC get around this problem by using variable length codes. For example, while the value of next_code is between 256 and 511, ARC inputs and outputs 9 bit codes. When the value of next_code increases to the point where 10 bit codes are needed, both the compression and decompression routines adjust the code size. This means that the 12 bit and 15 bit versions of the program will do equally well on small files.

Another problem on long files is that frequently the compression ratio begins to degrade as more of the file is read in. The reason for this is simple. Since the string table is of finite size, after a certain number of strings have been defined, no more can be added. But the string table is only good for the portion of the file that was read in while it was built. Later sections of the file may have different characteristics, and really need a different string table.

The conventional way to solve this problem is to monitor the compression ratio. After the string table is full, the compressor watches to see if the compression ratio degrades. After a certain amount of degradation, the entire table is flushed, and gets rebuilt from scratch. The expansion code is flagged when this happens by seeing a special code from the compression routine.

An alternative method would be to keep track of how frequently strings are used, and to periodically flush values that are rarely used. An adaptive technique like this may be too difficult to implement in a reasonably sized program.

One final technique for compressing the data is to take the LZW codes and run them through an adaptive Huffman coding filter. This will generally exploit a few more percentage points of compression, but at the cost of considerable more complexity in the code, as well as quite a bit more run time.

Portability

The code linked below was written and tested on MS-DOS machines, and has successfully compiled and executed with several C compilers. It should be portable to any machine that supports 16 integers and 32 bit longs in C. MS-DOS C compilers typically have trouble dealing with arrays larger than 64 Kbytes, preventing an easy implementation of 15 or 16 bit codes in this program. On machines using different processors, such as a VAX, these restrictions are lifted, and using larger code sizes becomes much simpler.

In addition, porting this code to assembly language should be fairly simple on any machine that supports 16 and 32 bit math. Significant performance improvements could be seen this way. Implementations in other high level languages should be straightforward.

Bibliography

  1. Terry Welch, “A Technique for High-Performance Data Compression”, Computer, June 1984
  2. J. Ziv and A. Lempel, “A Universal Algorithm for Sequential Data Compression”, IEEE Transactions on Information Theory, May 1977
  3. Rudy Rucker, “Mind Tools”, Houghton Mifflin Company, 1987

Source Code

LZW.C

Update - September 9, 2007

Reader David McKellar offers up some source code with the following comments:

I modified your fine 1989 lzw.c code so it outputs the same as the GNU/Linux compress command. Of course, I also made the corresponding changes in the expand part of your code.

I made the smallest number of changes possible. You can diff to see what I did. I realize your program is an example program to show the concepts but I think it doesn’t add to the complexity too much to be compress compatible.

I should mention that my changes don’t totally support the GNU/Linux compress. The real compress flushes its hash table when it notices the ratio goes down. It changes the bits used for encoding when the table is full. And probably more tricky stuff. However that is only an issue when lzw.c is attempting to read a compressed file.

lzw_nelson.cpp

Update - November 8, 2007

Reader Bogdan Banica contributes source code which is an adapted version of the program that operates on memory buffers instead of files: stream_lzw.c