|Dr. Dobb's Journal October, 1989|
Note: I have an updated article on LZW posted here. Please check out the new article and tell me what you think. I hope it improves on this post and makes LZW easier to understand.
Thanks to Jan Hakenberg for correction of a couple of errors! In Figure 4, the values for new table entries 259 and 265 were truncated.
Thanks to David Littlewood for pointing out the missing line of pseudocde in Figure 6.
Thanks to Joe Snyder for pointing out a line where a macro should replace a hard-coded constant.
Any programmer working on mini or microcomputers in this day and age should have at least some exposure to the concept of data compression. In MS-DOS world, programs like ARC, by System Enhancement Associates, and PKZIP, by PKware are ubiquitous. ARC has also been ported to quite a few other machines, running UNIX, CP/M, and so on. CP/M users have long had SQ and USQ to squeeze and expand programs. Unix users have the COMPRESS and COMPACT utilities. Yet the data compression techniques used in these programs typically only show up in two places: file transfers over phone lines, and archival storage.
Data compression has an undeserved reputation for being difficult to master, hard to implement, and tough to maintain. In fact, the techniques used in the previously mentioned programs are relatively simple, and can be implemented with standard utilities taking only a few lines of code. This article discusses a good all-purpose data compression technique: Lempel-Ziv-Welch, or LZW compression.
The routines shown here belong in any programmer’s toolbox. For example, a program that has a few dozen help screens could easily chop 50K bytes off by compressing the screens. Or 500K bytes of software could be distributed to end users on a single 360K byte floppy disk. Highly redundant database files can be compressed down to 10% of their original size. Once the tools are available, the applications for compression will show up on a regular basis.
The original Lempel Ziv approach to data compression was first published in in 1977, followed by an alternate approach in 1978. Terry Welch’s refinements to the 1978 algorithm were published in 1984. The algorithm is surprisingly simple. In a nutshell, LZW compression replaces strings of characters with single codes. It does not do any analysis of the incoming text. Instead, it just adds every new string of characters it sees to a table of strings. Compression occurs when a single code is output instead of a string of characters.
The code that the LZW algorithm outputs can be of any arbitrary length, but it must have more bits in it than a single character. The first 256 codes (when using eight bit characters) are by default assigned to the standard character set. The remaining codes are assigned to strings as the algorithm proceeds. The sample program runs as shown with 12 bit codes. This means codes 0-255 refer to individual bytes, while codes 256-4095 refer to substrings.
The LZW compression algorithm in its simplest form is shown in Figure 1. A quick examination of the algorithm shows that LZW is always trying to output codes for strings that are already known. And each time a new code is output, a new string is added to the string table.
STRING = get input character WHILE there are still input characters DO CHARACTER = get input character IF STRING+CHARACTER is in the string table then STRING = STRING+character ELSE output the code for STRING add STRING+CHARACTER to the string table STRING = CHARACTER END of IF END of WHILE output the code for STRING
A sample string used to demonstrate the algorithm is shown in Figure 2. The input string is a short list of English words separated by the ‘/’ character. Stepping through the start of the algorithm for this string, you can see that the first pass through the loop, a check is performed to see if the string "/W" is in the table. Since it isn’t, the code for ‘/’ is output, and the string "/W" is added to the table. Since we have 256 characters already defined for codes 0-255, the first string definition can be assigned to code 256. After the third letter, ‘E’, has been read in, the second string code, "WE" is added to the table, and the code for letter ‘W’ is output. This continues until in the second word, the characters ‘/’ and ‘W’ are read in, matching string number 256. In this case, the code 256 is output, and a three character string is added to the string table. The process continues until the string is exhausted and all of the codes have been output.
|Input String = /WED/WE/WEE/WEB/WET|
|Character Input||Code Output||New code value||New String|
The sample output for the string is shown in Figure 2 along with the resulting string table. As can be seen, the string table fills up rapidly, since a new string is added to the table each time a code is output. In this highly redundant input, 5 code substitutions were output, along with 7 characters. If we were using 9 bit codes for output, the 19 character input string would be reduced to a 13.5 byte output string. Of course, this example was carefully chosen to demonstrate code substitution. In real world examples, compression usually doesn’t begin until a sizable table has been built, usually after at least one hundred or so bytes have been read in.
The companion algorithm for compression is the decompression algorithm. It needs to be able to take the stream of codes output from the compression algorithm, and use them to exactly recreate the input stream. One reason for the efficiency of the LZW algorithm is that it does not need to pass the string table to the decompression code. The table can be built exactly as it was during compression, using the input stream as data. This is possible because the compression algorithm always outputs the STRING and CHARACTER components of a code before it uses it in the output stream. This means that the compressed data is not burdened with carrying a large string translation table.
Read OLD_CODE output OLD_CODE WHILE there are still input characters DO Read NEW_CODE STRING = get translation of NEW_CODE output STRING CHARACTER = first character in STRING add OLD_CODE + CHARACTER to the translation table OLD_CODE = NEW_CODE END of WHILE
The algorithm is shown in Figure 3. Just like the compression algorithm, it adds a new string to the string table each time it reads in a new code. All it needs to do in addition to that is translate each incoming code into a string and send it to the output.
Figure 4 shows the output of the algorithm given the input created by the compression earlier in the article. The important thing to note is that the string table ends up looking exactly like the table built up during compression. The output string is identical to the input string from the compression algorithm. Note that the first 256 codes are already defined to translate to single character strings, just like in the compression code.
|Input Codes: / W E D 256 E 260 261 257 B 260 T|
|CHARACTER||New table entry|
|W||/||W||W||256 = /W|
|E||W||E||E||257 = WE|
|D||E||D||D||258 = ED|
|256||D||/W||/||259 = D/|
|E||256||E||E||260 = /WE|
|260||E||/WE||/||261 = E/|
|261||260||E/||E||262 = /WEE|
|257||261||WE||W||263 = E/W|
|B||257||B||B||264 = WEB|
|260||B||/WE||/||265 = B/|
|T||260||T||T||266 = /WET|
Unfortunately, the nice simple decompression algorithm shown in Figure 4 is just a little too simple. There is a single exception case in the LZW compression algorithm that causes some trouble to the decompression side. If there is a string consisting of a (STRING,CHARACTER) pair already defined in the table, and the input stream then sees a sequence of STRING, CHARACTER, STRING, CHARACTER, STRING, the compression algorithm will output a code before the decompressor gets a chance to define it.
A simple example will illustrate the point. Imagine the the string JOEYN is defined in the table as code 300. Later on, the sequence JOEYNJOEYNJOEY occurs in the table. The compression output will look like that shown in Figure 5.
|Input String: ...JOEYNJOEYNJOEY|
|Character Input||New Code/String||Code Output|
|JOEYN||300 = JOEYN||288 (JOEY)|
|A||301 = NA||N|
|JOEYNJ||400 = JOEYNJ||300 (JOEYN)|
|JOEYNJO||401 = JOEYNJO||400 (???)|
When the decompression algorithm sees this input stream, it first decodes the code 300, and outputs the JOEYN string. After doing the output, it will add the definition for code 399 to the table, whatever that may be. It then reads the next input code, 400, and finds that it is not in the table. This is a problem, what do we do?
Fortunately, this is the only case where the decompression algorithm will encounter an undefined code. Since it is in fact the only case, we can add an exception handler to the algorithm. The modified algorithm just looks for the special case of an undefined code, and handles it. In the example in Figure 5, the decompression routine sees a code of 400, which is undefined. Since it is undefined, it translates the value of OLD_CODE, which is code 300. It then adds the CHARACTER value, which is ‘J’, to the string. This results in the correct translation of code 400 to string "JOEYNJ".
Read OLD_CODE output OLD_CODE CHARACTER = OLD_CODE WHILE there are still input characters DO Read NEW_CODE IF NEW_CODE is not in the translation table THEN STRING = get translation of OLD_CODE STRING = STRING+CHARACTER ELSE STRING = get translation of NEW_CODE END of IF output STRING CHARACTER = first character in STRING add OLD_CODE + CHARACTER to the translation table OLD_CODE = NEW_CODE END of WHILE
The Implementation Blues
The concepts used in the compression algorithm are very simple – so simple that the whole algorithm can be expressed in only a dozen lines. Implementation of this algorithm is somewhat more complicated, mainly due to management of the string table.
In the code accompanying this article, I have used code sizes of 12, 13, and 14 bits. In a 12 bit code program, there are potentially 4096 strings in the string table. Each and every time a new character is read in, the string table has to be searched for a match. If a match is not found, then a new string has to be added to the table. This causes two problems. First, the string table can get very large very fast. If string lengths average even as low as three or four characters each, the overhead of storing a variable length string and its code could easily reach seven or eight bytes per code.
In addition, the amount of storage needed is indeterminate, as it depends on the total length of all the strings.
The second problem involves searching for strings. Each time a new character is read in, the algorithm has to search for the new string formed by STRING+CHARACTER. This means keeping a sorted list of strings. Searching for each string will take on the order of log2 string comparisons. Using 12 bit words means potentially doing 12 string compares for each code. The computational overhead caused by this would be prohibitive.
The first problem can be solved by storing the strings as code/character combinations. Since every string is actually a combination of an existing code and an appended character, we can store each string as single code plus a character. For example, in the compression example shown, the string "/WEE" is actually stored as code 260 with appended character E. This takes only three bytes of storage instead of 5 (counting the string terminator). By backtracking, we find that code 260 is stored as code 256 plus an appended character E. Finally, code 256 is stored as a ‘/’ character plus a ‘W’.
Doing the string comparisons is a little more difficult. Our new method of storage reduces the amount of time for a string comparison, but it doesn’t cut into the the number of comparisons needed to find a match. This problem is solved by storing strings using a hashing algorithm. What this means is that we don’t store code 256 in location 256 of an array. We store it in a location in the array based on an address formed by the string itself. When we are trying to locate a given string, we can use the test string to generate a hashed address, and with luck can find the target string in one search.
Since the code for a given string is no longer known merely by its position in the array, we need
to store the code for a given string along with the string data. In the attached program, there
are three array elements for each string. They are
When we want to add a new code to the table, we use the hashing function in routine
to generate the correct value of i. First
find_match generates an address, then checks to see
if the location is already in use by a different string. If it is, it performs a secondary probe
until an open location is found.
The hashing function in use in this program is a straightforward XOR type hash function. The prefix code and appended character are combined to form an array address. If the contents of the prefix code and character in the array are a match, the correct address is returned. If that element in the array is in use, a fixed offset probe is used to search new locations. This continues until either an empty slot is found, or a match is found. By using a table about 25% larger than needed, the average number of searches in the table usually stays below 3. Performance can be improved by increasing the size of the table.
Note that in order for the secondary probe to always work, the size of the table needs to be a prime number. This is because the probe can be any integer between 0 and the table size. If the probe and the table size were not mutually prime, a search for an open slot could fail even if there were still open slots available.
Implementing the decompression algorithm has its own set of problems. One of the problems from the compression code goes away. When we are compressing, we need to search the table for a given string. During decompression, we are looking for a particular code. This means that we can store the prefix codes and appended characters in the table indexed by their string code. This eliminates the need for a hashing function, and frees up the array used to store code values.
Unfortunately, the method we are using of storing string values causes the strings to be decoded
in reverse order. This means that all the characters for a given string have to be decoded into
a stack buffer, then output in reverse order. In the program here this is done in the
decode_string function. Once this code is written, the rest of the algorithm turns into code
One problem encountered when reading in data streams is determining when you have reached the end
of the input data stream. In this particular implementation, I have reserved the last defined
MAX_VALUE, as a special end of data indicator. While this may not be necessary when
reading in data files, it is very helpful when reading compressed buffers out of memory. The
expense of losing one defined code is minimal in comparison to the convenience.
It is somewhat difficult to characterize the results of any data compression technique. The level of compression achieved varies quite a bit depending on several factors. LZW compression excels when confronted with data streams that have any type of repeated strings. Because of this, it does extremely well when compressing English text. Compression levels of 50% or better should be expected. Likewise, compressing saved screens and displays will generally show very good results.
Trying to compress binary data files is a little more risky. Depending on the data, compression may or may not yield good results. In some cases, data files will compress even more than text. A little bit of experimentation will usually give you a feel for whether your data will compress well or not.
The code accompanying this article works. However, it was written with the goal of being illuminating, not efficient. Some parts of the code are relatively inefficient. For example, the variable length input and output routines are short and easy to understand, but require a lot of overhead. An LZW program using fixed length 12 bit codes could experience real improvements in speed just by recoding these two routines.
One problem with the code listed here is that it does not adapt well to compressing files of
differing sizes. Using 14 or 15 bit codes gives better compression ratios on large
files, (because they have a larger string table to work with), but actually has poorer
performance on small files. Programs like ARC get around this problem by using variable length
codes. For example, while the value of
next_code is between 256 and 511, ARC inputs and
outputs 9 bit codes. When the value of
next_code increases to the point where 10 bit codes are
needed, both the compression and decompression routines adjust the code size. This means that
the 12 bit and 15 bit versions of the program will do equally well on small files.
Another problem on long files is that frequently the compression ratio begins to degrade as more of the file is read in. The reason for this is simple. Since the string table is of finite size, after a certain number of strings have been defined, no more can be added. But the string table is only good for the portion of the file that was read in while it was built. Later sections of the file may have different characteristics, and really need a different string table.
The conventional way to solve this problem is to monitor the compression ratio. After the string table is full, the compressor watches to see if the compression ratio degrades. After a certain amount of degradation, the entire table is flushed, and gets rebuilt from scratch. The expansion code is flagged when this happens by seeing a special code from the compression routine.
An alternative method would be to keep track of how frequently strings are used, and to periodically flush values that are rarely used. An adaptive technique like this may be too difficult to implement in a reasonably sized program.
One final technique for compressing the data is to take the LZW codes and run them through an adaptive Huffman coding filter. This will generally exploit a few more percentage points of compression, but at the cost of considerable more complexity in the code, as well as quite a bit more run time.
The code linked below was written and tested on MS-DOS machines, and has successfully compiled and executed with several C compilers. It should be portable to any machine that supports 16 integers and 32 bit longs in C. MS-DOS C compilers typically have trouble dealing with arrays larger than 64 Kbytes, preventing an easy implementation of 15 or 16 bit codes in this program. On machines using different processors, such as a VAX, these restrictions are lifted, and using larger code sizes becomes much simpler.
In addition, porting this code to assembly language should be fairly simple on any machine that supports 16 and 32 bit math. Significant performance improvements could be seen this way. Implementations in other high level languages should be straightforward.
- Terry Welch, “A Technique for High-Performance Data Compression”, Computer, June 1984
- J. Ziv and A. Lempel, “A Universal Algorithm for Sequential Data Compression”, IEEE Transactions on Information Theory, May 1977
- Rudy Rucker, “Mind Tools”, Houghton Mifflin Company, 1987
Update - September 9, 2007
Reader David McKellar offers up some source code with the following comments:
I modified your fine 1989 lzw.c code so it outputs the same as the GNU/Linux compress command. Of course, I also made the corresponding changes in the expand part of your code.
I made the smallest number of changes possible. You can diff to see what I did. I realize your program is an example program to show the concepts but I think it doesn’t add to the complexity too much to be compress compatible.
I should mention that my changes don’t totally support the GNU/Linux compress. The real compress flushes its hash table when it notices the ratio goes down. It changes the bits used for encoding when the table is full. And probably more tricky stuff. However that is only an issue when lzw.c is attempting to read a compressed file.
Update - November 8, 2007
Reader Bogdan Banica contributes source code which is an adapted version of the program that operates on memory buffers instead of files: stream_lzw.c