|
Dr. Dobb's Journal
October, 1989 |
Thanks to Jan Hakenberg for correction of a couple of errors! In Figure 4, the values for new table entries 259 and 265 were truncated.
Thanks to David Littlewood for pointing out the missing line of pseudocde in Figure 6.
Thanks to Joe Snyder for pointing out a line where a macro should replace a hard-coded constant.
Any programmer working on mini or microcomputers in this day and age should have at least some exposure to the concept of data compression. In MS-DOS world, programs like ARC, by System Enhancement Associates, and PKZIP, by PKware are ubiquitous. ARC has also been ported to quite a few other machines, running UNIX, CP/M, and so on. CP/M users have long had SQ and USQ to squeeze and expand programs. Unix users have the COMPRESS and COMPACT utilities. Yet the data compression techniques used in these programs typically only show up in two places: file transfers over phone lines, and archival storage.
Data compression has an undeserved reputation for being difficult to master, hard to implement, and tough to maintain. In fact, the techniques used in the previously mentioned programs are relatively simple, and can be implemented with standard utilities taking only a few lines of code. This article discusses a good all-purpose data compression technique: Lempel-Ziv-Welch, or LZW compression.
The routines shown here belong in any programmer's toolbox. For example, a program that has a few dozen help screens could easily chop 50K bytes off by compressing the screens. Or 500K bytes of software could be distributed to end users on a single 360K byte floppy disk. Highly redundant database files can be compressed down to 10% of their original size. Once the tools are available, the applications for compression will show up on a regular basis.
LZW Fundamentals
The original Lempel Ziv approach to data compression was first published in in 1977, followed by an alternate approach in 1978. Terry Welch's refinements to the 1978 algorithm were published in 1984. The algorithm is surprisingly simple. In a nutshell, LZW compression replaces strings of characters with single codes. It does not do any analysis of the incoming text. Instead, it just adds every new string of characters it sees to a table of strings. Compression occurs when a single code is output instead of a string of characters.
The code that the LZW algorithm outputs can be of any arbitrary length, but it must have more bits in it than a single character. The first 256 codes (when using eight bit characters) are by default assigned to the standard character set. The remaining codes are assigned to strings as the algorithm proceeds. The sample program runs as shown with 12 bit codes. This means codes 0-255 refer to individual bytes, while codes 256-4095 refer to substrings.
Compression
The LZW compression algorithm in its simplest form is shown in Figure 1. A quick examination of the algorithm shows that LZW is always trying to output codes for strings that are already known. And each time a new code is output, a new string is added to the string table.
Routine LZW_COMPRESS
-
STRING = get input character
-
WHILE there are still input characters DO
-
CHARACTER = get input character
-
IF STRING+CHARACTER is in the string table then
-
STRING = STRING+character
-
ELSE
-
output the code for STRING
-
add STRING+CHARACTER to the string table
-
STRING = CHARACTER
-
END of IF
-
END of WHILE
-
output the code for STRING
The Compression Algorithm
Figure 1
A sample string used to demonstrate the algorithm is shown in Figure 2. The input string is a short list of English words separated by the '/' character. Stepping through the start of the algorithm for this string, you can see that the first pass through the loop, a check is performed to see if the string "/W" is in the table. Since it isn't, the code for '/' is output, and the string "/W" is added to the table. Since we have 256 characters already defined for codes 0-255, the first string definition can be assigned to code 256. After the third letter, 'E', has been read in, the second string code, "WE" is added to the table, and the code for letter 'W' is output. This continues until in the second word, the characters '/' and 'W' are read in, matching string number 256. In this case, the code 256 is output, and a three character string is added to the string table. The process continues until the string is exhausted and all of the codes have been output.
| Input String = /WED/WE/WEE/WEB/WET | |||
| Character Input | Code Output | New code value | New String |
| /W | / | 256 | /W |
| E | W | 257 | WE |
| D | E | 258 | ED |
| / | D | 259 | D/ |
| WE | 256 | 260 | /WE |
| / | E | 261 | E/ |
| WEE | 260 | 262 | /WEE |
| /W | 261 | 263 | E/W |
| EB | 257 | 264 | WEB |
| / | B | 265 | B/ |
| WET | 260 | 266 | /WET |
| EOF | T | ||
The Compression Process
Figure 2
The sample output for the string is shown in Figure 2 along with the resulting string table. As can be seen, the string table fills up rapidly, since a new string is added to the table each time a code is output. In this highly redundant input, 5 code substitutions were output, along with 7 characters. If we were using 9 bit codes for output, the 19 character input string would be reduced to a 13.5 byte output string. Of course, this example was carefully chosen to demonstrate code substitution. In real world examples, compression usually doesn't begin until a sizable table has been built, usually after at least one hundred or so bytes have been read in.
Decompression
The companion algorithm for compression is the decompression algorithm. It needs to be able to take the stream of codes output from the compression algorithm, and use them to exactly recreate the input stream. One reason for the efficiency of the LZW algorithm is that it does not need to pass the string table to the decompression code. The table can be built exactly as it was during compression, using the input stream as data. This is possible because the compression algorithm always outputs the STRING and CHARACTER components of a code before it uses it in the output stream. This means that the compressed data is not burdened with carrying a large string translation table.
Routine LZW_DECOMPRESS
-
Read OLD_CODE
-
output OLD_CODE
-
WHILE there are still input characters DO
-
Read NEW_CODE
-
STRING = get translation of NEW_CODE
-
output STRING
-
CHARACTER = first character in STRING
-
add OLD_CODE + CHARACTER to the translation table
-
OLD_CODE = NEW_CODE
-
END of WHILE
The Decompression Algorithm
Figure 3
The algorithm is shown in Figure 3. Just like the compression algorithm, it adds a new string to the string table each time it reads in a new code. All it needs to do in addition to that is translate each incoming code into a string and send it to the output.
Figure 4 shows the output of the algorithm given the input created by the compression earlier in the article. The important thing to note is that the string table ends up looking exactly like the table built up during compression. The output string is identical to the input string from the compression algorithm. Note that the first 256 codes are already defined to translate to single character strings, just like in the compression code.
| Input Codes: / W E D 256 E 260 261 257 B 260 T | ||||
| Input/ NEW_CODE |
OLD_CODE | STRING/ Output |
CHARACTER | New table entry |
| / | / | / | ||
| W | / | W | W | 256 = /W |
| E | W | E | E | 257 = WE |
| D | E | D | D | 258 = ED |
| 256 | D | /W | / | 259 = D/ |
| E | 256 | E | E | 260 = /WE |
| 260 | E | /WE | / | 261 = E/ |
| 261 | 260 | E/ | E | 262 = /WEE |
| 257 | 261 | WE | W | 263 = E/W |
| B | 257 | B | B | 264 = WEB |
| 260 | B | /WE | / | 265 = B/ |
| T | 260 | T | T | 266 = /WET |
The Decompression Process
Figure 4
The Catch
Unfortunately, the nice simple decompression algorithm shown in Figure 4 is just a little too simple. There is a single exception case in the LZW compression algorithm that causes some trouble to the decompression side. If there is a string consisting of a (STRING,CHARACTER) pair already defined in the table, and the input stream then sees a sequence of STRING, CHARACTER, STRING, CHARACTER, STRING, the compression algorithm will output a code before the decompressor gets a chance to define it.
A simple example will illustrate the point. Imagine the the string JOEYN is defined in the table as code 300. Later on, the sequence JOEYNJOEYNJOEY occurs in the table. The compression output will look like that shown in Figure 5.
| Input String: ...JOEYNJOEYNJOEY | ||
| Character Input | New Code/String | Code Output |
| JOEYN | 300 = JOEYN | 288 (JOEY) |
| A | 301 = NA | N |
| . | . | . |
| . | . | . |
| . | . | . |
| JOEYNJ | 400 = JOEYNJ | 300 (JOEYN) |
| JOEYNJO | 401 = JOEYNJO | 400 (???) |
A problem
Figure 5
When the decompression algorithm sees this input stream, it first decodes the code 300, and outputs the JOEYN string. After doing the output, it will add the definition for code 399 to the table, whatever that may be. It then reads the next input code, 400, and finds that it is not in the table. This is a problem, what do we do?
Fortunately, this is the only case where the decompression algorithm will encounter an undefined code. Since it is in fact the only case, we can add an exception handler to the algorithm. The modified algorithm just looks for the special case of an undefined code, and handles it. In the example in Figure 5, the decompression routine sees a code of 400, which is undefined. Since it is undefined, it translates the value of OLD_CODE, which is code 300. It then adds the CHARACTER value, which is 'J', to the string. This results in the correct translation of code 400 to string "JOEYNJ".
Routine LZW_DECOMPRESS
-
Read OLD_CODE
-
output OLD_CODE
-
CHARACTER = OLD_CODE
-
WHILE there are still input characters DO
-
Read NEW_CODE
-
IF NEW_CODE is not in the translation table THEN
-
STRING = get translation of OLD_CODE
-
STRING = STRING+CHARACTER
-
ELSE
-
STRING = get translation of NEW_CODE
-
END of IF
-
output STRING
-
CHARACTER = first character in STRING
-
add OLD_CODE + CHARACTER to the translation table
-
OLD_CODE = NEW_CODE
-
END of WHILE
The Modified Decompression Algorithm
Figure 6
The Implementation Blues
The concepts used in the compression algorithm are very simple -- so simple that the whole algorithm can be expressed in only a dozen lines. Implementation of this algorithm is somewhat more complicated, mainly due to management of the string table.
In the code accompanying this article, I have used code sizes of 12, 13, and 14 bits. In a 12 bit code program, there are potentially 4096 strings in the string table. Each and every time a new character is read in, the string table has to be searched for a match. If a match is not found, then a new string has to be added to the table. This causes two problems. First, the string table can get very large very fast. If string lengths average even as low as three or four characters each, the overhead of storing a variable length string and its code could easily reach seven or eight bytes per code.
In addition, the amount of storage needed is indeterminate, as it depends on the total length of all the strings.
The second problem involves searching for strings. Each time a new character is read in, the algorithm has to search for the new string formed by STRING+CHARACTER. This means keeping a sorted list of strings. Searching for each string will take on the order of log2 string comparisons. Using 12 bit words means potentially doing 12 string compares for each code. The computational overhead caused by this would be prohibitive.
The first problem can be solved by storing the strings as code/character combinations. Since every string is actually a combination of an existing code and an appended character, we can store each string as single code plus a character. For example, in the compression example shown, the string "/WEE" is actually stored as code 260 with appended character E. This takes only three bytes of storage instead of 5 (counting the string terminator). By backtracking, we find that code 260 is stored as code 256 plus an appended character E. Finally, code 256 is stored as a '/' character plus a 'W'.
Doing the string comparisons is a little more difficult. Our new method of storage reduces the amount of time for a string comparison, but it doesn't cut into the the number of comparisons needed to find a match. This problem is solved by storing strings using a hashing algorithm. What this means is that we don't store code 256 in location 256 of an array. We store it in a location in the array based on an address formed by the string itself. When we are trying to locate a given string, we can use the test string to generate a hashed address, and with luck can find the target string in one search.
Since the code for a given string is no longer known merely by its position in the array, we need to store the code for a given string along with the string data. In the attached program, there are three array elements for each string. They are code_value[i], prefix_code[i], and append_character[i].
When we want to add a new code to the table, we use the hashing function in routine find_match to generate the correct value of i. First find_match generates an address, then checks to see if the location is already in use by a different string. If it is, it performs a secondary probe until an open location is found.
The hashing function in use in this program is a straightforward XOR type hash function. The prefix code and appended character are combined to form an array address. If the contents of the prefix code and character in the array are a match, the correct address is returned. If that element in the array is in use, a fixed offset probe is used to search new locations. This continues until either an empty slot is found, or a match is found. By using a table about 25% larger than needed, the average number of searches in the table usually stays below 3. Performance can be improved by increasing the size of the table.
Note that in order for the secondary probe to always work, the size of the table needs to be a prime number. This is because the probe can be any integer between 0 and the table size. If the probe and the table size were not mutually prime, a search for an open slot could fail even if there were still open slots available.
Implementing the decompression algorithm has its own set of problems. One of the problems from the compression code goes away. When we are compressing, we need to search the table for a given string. During decompression, we are looking for a particular code. This means that we can store the prefix codes and appended characters in the table indexed by their string code. This eliminates the need for a hashing function, and frees up the array used to store code values.
Unfortunately, the method we are using of storing string values causes the strings to be decoded in reverse order. This means that all the characters for a given string have to be decoded into a stack buffer, then output in reverse order. In the program here this is done in the decode_string function. Once this code is written, the rest of the algorithm turns into code very easily.
One problem encountered when reading in data streams is determining when you have reached the end of the input data stream. In this particular implementation, I have reserved the last defined code, MAX_VALUE, as a special end of data indicator. While this may not be necessary when reading in data files, it is very helpful when reading compressed buffers out of memory. The expense of losing one defined code is minimal in comparison to the convenience.
Results
It is somewhat difficult to characterize the results of any data compression technique. The level of compression achieved varies quite a bit depending on several factors. LZW compression excels when confronted with data streams that have any type of repeated strings. Because of this, it does extremely well when compressing English text. Compression levels of 50% or better should be expected. Likewise, compressing saved screens and displays will generally show very good results.
Trying to compress binary data files is a little more risky. Depending on the data, compression may or may not yield good results. In some cases, data files will compress even more than text. A little bit of experimentation will usually give you a feel for whether your data will compress well or not.
Your Implementation
The code accompanying this article works. However, it was written with the goal of being illuminating, not efficient. Some parts of the code are relatively inefficient. For example, the variable length input and output routines are short and easy to understand, but require a lot of overhead. An LZW program using fixed length 12 bit codes could experience real improvements in speed just by recoding these two routines.
One problem with the code listed here is that it does not adapt well to compressing files of differing sizes. Using 14 or 15 bit codes gives better compression ratios on large files, (because they have a larger string table to work with), but actually has poorer performance on small files. Programs like ARC get around this problem by using variable length codes. For example, while the value of next_code is between 256 and 511, ARC inputs and outputs 9 bit codes. When the value of next_code increases to the point where 10 bit codes are needed, both the compression and decompression routines adjust the code size. This means that the 12 bit and 15 bit versions of the program will do equally well on small files.
Another problem on long files is that frequently the compression ratio begins to degrade as more of the file is read in. The reason for this is simple. Since the string table is of finite size, after a certain number of strings have been defined, no more can be added. But the string table is only good for the portion of the file that was read in while it was built. Later sections of the file may have different characteristics, and really need a different string table.
The conventional way to solve this problem is to monitor the compression ratio. After the string table is full, the compressor watches to see if the compression ratio degrades. After a certain amount of degradation, the entire table is flushed, and gets rebuilt from scratch. The expansion code is flagged when this happens by seeing a special code from the compression routine.
An alternative method would be to keep track of how frequently strings are used, and to periodically flush values that are rarely used. An adaptive technique like this may be too difficult to implement in a reasonably sized program.
One final technique for compressing the data is to take the LZW codes and run them through an adaptive Huffman coding filter. This will generally exploit a few more percentage points of compression, but at the cost of considerable more complexity in the code, as well as quite a bit more run time.
Portability
The code linked below was written and tested on MS-DOS machines, and has successfully compiled and executed with several C compilers. It should be portable to any machine that supports 16 integers and 32 bit longs in C. MS-DOS C compilers typically have trouble dealing with arrays larger than 64 Kbytes, preventing an easy implementation of 15 or 16 bit codes in this program. On machines using different processors, such as a VAX, these restrictions are lifted, and using larger code sizes becomes much simpler.
In addition, porting this code to assembly language should be fairly simple on any machine that supports 16 and 32 bit math. Significant performance improvements could be seen this way. Implementations in other high level languages should be straightforward.
Bibliography
-
Terry Welch, "A Technique for High-Performance Data Compression", Computer, June 1984
J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data Compression", IEEE Transactions on Information Theory, May 1977
>Rudy Rucker, "Mind Tools", Houghton Mifflin Company, 1987
Source Code
Update - September 9, 2007
Reader David McKellar offers up some source code with the following comments:
I modified your fine 1989 lzw.c code so it outputs the same as the GNU/Linux compress command. Of course, I also made the corresponding changes in the expand part of your code.
I made the smallest number of changes possible. You can diff to see what I did. I realize your program is an example program to show the concepts but I think it doesn't add to the complexity too much to be compress compatible.
I should mention that my changes don't totally support the GNU/Linux compress. The real compress flushes its hash table when it notices the ratio goes down. It changes the bits used for encoding when the table is full. And probably more tricky stuff. However that is only an issue when lzw.c is attempting to read a compressed file.
Update - November 8, 2007
Reader Bogdan Banica contributes source code which is an adapted version of the program that operates on memory buffers instead of files: stream_lzw.c
Recommended Reading
If you are interested in data compression, the texts below are all excellent reference material, and I've included a link to WinZip just in case you need a copy of that. If you make a purchase on Amazon.com after clicking through on the links below, you'll be expanding your knowledge and supporting this site at the same time. Thanks. (Note: all links will open in a new window.)
228 users commented in " LZW Data Compression "
Follow-up comment rss or Leave a TrackbackDear Mark,
I have one doubt on "If that element in the array is in use, a fixed offset probe is used to search new locations. This continues until either an empty slot is found, or a match is found, the average is usually below 3".
My question is:
Will it be possible that one search will never finish or take a huge number of search (infinite loop) to find a new location for some unpredictable strings? Any upper bounds on the number of search during find_match() given the size of the table? Thanks.
Steven.
It isn't possible for the search to never complete - the hash table has maybe 10% more slots than will be used - so for example if the code size is 14 bits, you only need 16K slots, and the table is sized at 18K.
Because the number of slots in the table is a prime number, you are guaranteed to eventually visit every slot, meaning sooner or later you will hit an empty slot.
As to whether this is an efficient method or nort, all I can say is that the original authors of the UNIX compress program who developed this strategy though that it was generally efficient, and their tests came up with the figure that said no more than three probes were usually needed.
I'm not an expert on hashing, so I just used the same algorithm that was in that code. I think in general to really know if your hash table is performing well you just have to run tests and gather statistics - it's difficult to predict in advance what the usage patterns will be.
Dear Mark,
what is the algorithm used by Winzip?
Someone said it is LZW, but someone said it is hybrid of LZ77 and Huffman coding known as 'Deflate'.
Thanks.
The second someone is right, the algorithm popularized by PKZip and still used in most "Zip-compatible" compressors is called deflate. It is an LZ77 compressor that uses a Huffman compressor on its token stream. It was very fast and efficient when it was first used 18 years ago 9http://www.c10n.info/archives/430) and is still the most popular compression algorithm for general purpose work today. LZ77 and LZ78, despite the similar names, are completely different from one another.
Mark,
How close is the output of this algo to the Unix compress (LZC?) algorithm? Will the output of your 12-bit version generate a file that can be restored by 'uncompress'? I'm looking for code that is compatible with the Unix version. Thanks!
My code is very close to UNIX compress, but not identical. I think the best approach to getting compatibility is, unfortunately, to hack the compress source code to get it to do your bidding.
If you search long enough on the net, you might be able to find the UNIX compress algorithm as implemented long ago by the ARC archiver - that source should still be compatible but might be a little easier to work with.
This might also be answered by one of the experts in comp.compression, you should consider posting the question there.
You know, I had been looking all over the net for it, but all I found were man pages and such. Thanks for pointing me in the right direction -- I think I found an original Unix distribution version!
Will this code work as is for 16 bit unsigned integers? If not, can it be easily modified to do so and how?
thanks
Martin, any time you change the integer type from signed to unsigned you run the risk of a few bugs. But I'm curious, why would you care? If you are on a compiler that uses unsigned 16 bit ints by default, you can just change the declaration of all ints in this this program to "signed int" and you are free of the problem.
I suspect the hashing probe code will break, but I'm not sure about any of the rest of it.
Give it a try!
Thanks a lot. I compiled your code on Microsoft Visual C and it looks like it's running just fine despite the file format I'm using. The expanded file and the original had no differences. What I'm wondering now is what is the integer size of the compressed file? If this sounds like a weird question here's why. I'm proficient in MatLab which as you may or may not know is a lot like C with the biggest single exception being the uncompiled execution of code. I know the basics of C but I'm still in the dark about a lot of things. In order to read the compressed file into MatLab, you have to specify the binary type in which the files were written (i.e. uint16 = unsigned 16 bit integer, or uint8 and so forth). I hope you know what I'm asking, but thanks for the help you've given me already. I wrote a LZW encoder in MatLab and the execution time is at least 100 times that of your code so you may have saved me months of execution time.
thanks again
Martin, the compressed file is a bit stream organized as bytes - so if you are trying to read it in Matlab you are not going to describe it as being organized as integers. Unsigned bytes is probably the way to go.
The program should run and write your output to a file called test.lzw. After it completes, check to see if test.lzw is present. It then decompresses the file to lzw.out. lzw.out should be identical to your original input file.
If you want to just compress or just decompress, and if you want the output file to have a different name, you will have to modify the program.
I have a couple of question ...
1) I am looking for LZW source code (C or C ) that operates totally in memory (no files) and on text strings. Do you know where I can find one? All the implementations I can find use files.
2) Can the LZW be modified to take advantage of a list of "words" that will most likely make up the text? For example, imagine compressing a C source code file, could you preload the keywords for better performance?
I don't have an LZW implementation handy that operates on memory, although you could certainly modify this code easily enough - no big deal. Since LZW kind of loses out to the much more efficient deflate algorithm, maybe you could try zlib, which will easily do what you want and do a lot better. If not, start with this source and hack it.
To test your question number 2, simply preload the compressor with a list of words, then chop those same words off when the output is read. If the list of words is chosen carefully, yes, you will see a big improvement.
You might also look at my article on star encoding. You could use that to preprocess the input, and you'd probably see a big improvement there as well.
For those of you that like VB, this article on The CodeProject web site has a VB.NET implementation of this code:
http://www.codeproject.com/useritems/VB_LZW.asp
Please note that I find that this site moves articles around from time to time. If you find that the URL has changed, please drop me a line so I can fix this comment.
Is there a way to use your code with 7 bits output? I mean, I need to be able to print out the output. So, transforming output bytes from 0-127 to 32-159 will solve my problem :-)
It's often useful to be able to see the numeric values of the output, and it's easy enough to do. Just rewrite output_code so that all it does is
fprintf( output, "%d\n", code )
Then do the same thing for the input routine.
Of course, you won't get actual compression this way, because you'll be using at least a 3:1 multiplier on your output stream, but you will be able to read it.
Mark,
Thanks for this quick answer.
But it doesn't solve my problem: when I said "print out", I meant "print out in a textarea" (to be copied). Of course, your solution works (I did it) but, as you said, in this case, I'd lose the compression ratio :-(
Well then, I don't understand exactly what you want. (For example, what is a textarea?) If what you are saying is that you want the output stream to only use characters 0x20 through 0x7f, I would suggest you just uuencode the stream. Still, you're going to lose most of your compression that way, because it's going to expand your data by at least 2:1.
I think you are stretching for something that is not quite possible.
Mark,
Thanks again.
A textarea is a zone on a web page where you can type text in (like the one I'm typing in).
I ported your code to javascript and wanted to store the output in a textarea as you can manipulate files with it.
Too bad, it's not possible... anyway, it was quite interesting!
Dear Mark,
Have you ever written Code in C to calculate the entropy and public available?
Thanks.
Aloha, remember that calculating the entropy always means "calculating the entropy with respect to a specific model." The idea that there is an absolute entropy that we can calculate is sort of a red herring.
So to answer your question, no. If I wanted to know the entropy with respect to an LZW compressor, I would just compress it with this code and see what the file size turned out to be!
Dear Mark,
Yes, the book written by you mentioned that the entropy depends on the model.
I would like to know how to calculate relative entropy by using LZW.
Thanks.
compress file.
entropy = -logbase2( ouput_file_size/input_file_size )
Hi Mark, your article say: "LZW is always trying to output codes for strings that are already known. And each time a new code is output, a new string is added to the string table."
My question is if It`s posible (or practical) create an external file, containing a large numbers of redundant strings in a arbitrary type of file (like JPG or other) and its code-equivalent (like a external dictionary), and make the code rebuild the original file looking at this "external dictionary"?
I´ve thinking in this concept for a while, and I cant figure out the code.
Thanks and greetings from Argentina!
What you're talking about is feasible, but what you end up with is not really LZW, it's something else.
To stick with the LZW paradigm and use an external dictionary is not too hard. When you are running the compressor, you simply load your dictionary file first, which adds the strings to the internal LZW dictionary. You then compress the file using the modified dictionary. On decompression, you will have to do something similar to load the dictionary before processing your compressed data.
Giving up on LZW and doing what you are suggesting is more like star encoding. See my article here for details on that:
http://marknelson.us/2002/08/01/star-encoding-in-cpp/
I want to know what the pros and cons are when lzw compressing tiff image files that will be e-mailed to a Lab. Do I risk loosing data? As I am not a programmer but a novice photographer please reply in simple english. Your help in understanding would very helpful.
you don't risk losing any data, lzw compression is lossless - it doesn't alter the data in any way. as long as your lab is able to accomodate the format you can save some upload time by using it.
Mark,
I'm using your compression algorithm in an other way:
for a particular reason I have to store binary data as a basic text-file without any bytes being interpreted as control-characters etc.
So I modified your algorithm (MS Access) in order to generate only byte values between 32 and 127 (basic ascii) and to store each LZW-code in 2 bytes (16 bits).
Although at first glance it looks like a poor compression, it's comparable with a 13 bits compression with the overhead of storing 13 bits in 2 bytes. The compression ratio degradation is less than 20%.
Maybe this idea is usefull to others.
Example of a text-based compression:
]~]E]R^w^i^n]4] ]F^i^l^e]V^e^r^s^i^o^n]=]"]4]2]...
Robert
Do anyone has an implementation with java?
Edge, I don't have any familiarity with any Java implementations, but a Google search on "lzw java" turns up 220,000 hits as of today so I don't think you'll have too much trouble.
this entry says:
I just compiled your test program and tried it out on an mpeg 2 formatted file and the compressed file was 30% larger than the source! It would appear that LZW is not optimum for the compression methods already inherent in the mpeg 2 file. The source file was 17M and a huffman compression reduced the size by about 500k. Win rar compresses the file by about 1M, clearly it is using some other compression algorithm.
Jabberwocky -
MPEG-2 is already pretty tightly compressed. In general, attempting to run a second pass on compressed data is not too productive, so your results aren't too suprising. WinRAR has a number of good algorithms, most of which are considerably more powerful than LZW, so squeezing an extra couple of percent out of the MPEG-2 file is possible.
If you really want to shring the MPEG-2 files, the best way to go about it is to transcode them to H.264, in which case you could easily shrink them by 50%.
Thanks very much. It'S a very good article.
i have a question , when i type "file.txt" ,i can get "file.lzw" and "file.out", "file.out" is the same than "file.txt". But i can't type "file.lzw" to get "file.out" .... why ? it's not a decompression program??
kiwi, this is just a demonstration program. It does both compression and decompression of a single file. If you enter file.txt as your input, it compresses it to file.lzw, then decompresses it to file.out. You can then compare to make sure the round trip was successful.
If you want a program that just does compression or decompression, it should be easy enough to take this program and give it some command line options to do just that. But for the purposes of making this program illustrate the topics in the article, that wasn't really necessary.
now i give some command line
char *a="test.lzw";
char *b=input_file_name;
if (strcmp(a,b)==0)
{
expand(input_file,output_file);
fclose(input_file);
fclose(output_file);
free(prefix_code);
free(append_character);
exit(-3);
};
but the result not correct , can you mark it where i got wrong?
No, sorry, I can't really tell from just that code fragment. I suggest you walk through the debugger - both in the original program and in your altered version and see if you can tell what you're doing wrong. It's pretty hard for me to debug your program in the comments section of the article. :-)
^^ that's ok , thank you again !!!!
Here is a VB implementation of LZW and Huffman compression:
http://www.danesfahani.com/Telecommunications%202/Tels2%20Simulink%20Simulations.htm
Dear Mark,
I would like to thank you for this article. I'm a 16 years old student from Holland and for school, we need to make a final work. I have chosen to do something with datacompression, I already wrote the Huffman compression program, know I'm almost finished with LZW. Your article is very helpful to me and I learnt much about LZW!
Peter.
Hello, Mark.
Is it possible to Gif encoding because using your lzw sorce?
It's possible that you could adapt this code to do GIF encoding and decoding. But it would be a bit of work.
I recently found a pretty easy to use C++ GIF codec in the VIGRA project:
http://kogs-www.informatik.uni-hamburg.de/~koethe/vigra/
The VIGRA license is pretty reasonable for non-commercial use.
Download vigra1.5.0.tar.gz and look in the source folder for gif.hxx and gif.cxx. They've added a framework around it for portability with other image codecs, you can strip it out without to much effort.
Hi, Mark,
I've already written an lzw implementation with Python for my university assignment. The biggest trouble was to make it read different length symbols, from 2 to 13 bits, as a symbol. Your article is great, and as you mentioned, my encoding program is very slow because of searching symbols in a 2k-8k dictionary. I dont understand why decoder doesnt need hash function.. as if the encoder stores strings in a dictionary at hash function indexes, the decoder should make the same dictionary, so the indexes are the same for same strings, to decode correctly (because compressed file is made from those indexes). Dictionaries must be identical, as i understand. then the question is, if the location in the dictionary at hash index is used, it jumps some other place for a fixed number. but how the decoder should know, if to take the string that is in that hash index place, or to jump further and take next string.. your program with xor hash and bit operations is to difficult for me to understand, i want to make something more simple, but cant understand the basics of hashing in this case. maybe you have wrote this in your article, but i read several times, and still cant understand why decoder doesnt need hashing (im not english speaker), how then should it store strings in the dictionary, that indexes match.. sorry for the big post, i just wanted to explain everything clearly.
greeting from Lithuania, Vilnius University
thanks, Aleksandras
You don't need to use a hash when decoding. This is because the strings are added to a table as sequential indices, and because of this are easy to look up directly. For example, if I see a code of 258 in the input stream of an encoded LZW file, I just look at location 258 in the table to get its definition.
The dictionary is structured different when encoding, which means you can't just look up an index. For example, if I want to see if 'A' followed by 'B' is in the encoder table, I can't just create an index by concatenating'A' and 'B' - this obviously won't work. unless I have a 16 bit table, which is usually not the case. And that won't work when I try to create an entry for "AB" 'C'. I have to combine the string and the following in character in some way to create an entry into a smaller table, and that's where the hash function comes in.
thanks a lot for the trouble Mark ! i laid in the bed, and just understood everything :) i just missed one tiny thing, maybe thats because im too exhausted doing those assignments.. thank you for your comment, it helped to put dots on "i" :) good luck to you !
Hi Mark,
Would it be possible to compress a text file with LZW, then search the compressed text without decompressing it? My idea was to compress the keywords with the same dictionary, then do something like a Boyer-Moore search, but I'm not sure if (A) it would even work, or (B) it would be accurate. I'm pretty green when it comes to compression, so any input you can give would be very helpful.
Thanks!
There are some (but not many) specialized algorithms used for searching through compressed text. I don't know of any that work with LZW. This is an area that people ask about fairly frequently, but I never really have a good answer or a good direction to point them.
As an example, here is somebody that has a commercial product that is designed to search through compressed text:
http://www.techworld.com/features/index.cfm?RSS&FeatureID=3137
The only person I know of that has any substantial work published on this is Paolo Ferragina, who has a few good papers and some software you might be interested in.
http://roquefort.di.unipi.it/~ferrax/index.html
hello, I have a question, how can I modify your LZW implementation for read different length symbols? I need to process symbols with 12 bits length.
DJ. the code posted with article reads 12 bit symbols by default, and you can change it by simply altering a #define value. Since it already does what you are asking, I would suggest that you don't need to do any modification.
I'm a little confused because in your implementation the dictionary length is 12,13,14 bits, but, i think you're processing 8 bit word. It's really right or am i mistake?
I have a file with 12-bit words and I need to process 12 bits really. Do I have to make some modification?
...thanks for your time.
Ah, you confused me by saying "symbols with 12 bit length."
The LZW.c encodes source text composed of an eight bit alphabet into (by default) 12 bit symbols.
So you are saying that your source text is composed of a 12-bit alphabet.
Yes, you can adapt this program to deal with those words without too much trouble, but it will require modifications of the key data structures and the hashing algorithm to fit the new size. In addition you will obviously need to adjust the input and output routines.
Hi Mark!
I m a student from India. At first thank u very much for giving a clear understanding of the LZW algorithm. I have also seen the source code. Though tough, but is understandable. But I m not getting anything about the function
"void output_code(FILE *output,unsigned int code)". How u have entered the characters into the file using a code buffer? Plz make it clear 4 me.
The output_code function is a pretty simple piece of code that supports the output of tokens in sizes different than the standard eight bits.
Most compression algorithms either use a bit stream of some kind or write words of varying size, so they are all going to have an output_code() routine of some kind. If it doesn't make since to you, I suggest you brush up on your C shift and mask operators.
Sir,
we are working on a algo of partial string matching.....but only for text files...can you please help us with any study material that would help us for the same.......
A good topic for a final year project might be to come up with a good implementation of the LZSS algorith, or some other variant of LZ77. The classic paper on compression by Ziv and Lempel from 1977 is available in illegal copies all over the internet, like here:
http://www.stanford.edu/class/ee398a/resources/ziv:77-SDC.pdf
Ziv and Lempel didn't concern themselves too much with implementation, so it was many years before programs like PKZip actually started using this algorithm.
The most interesting part of an LZ77 ipmlementation is the requirement that you look back through previously seen string matches. Replacing those matches with pointers is what allows you to achieve compression. You can implement a program that does a brute force search, but it will be hopelessly slow.
Your assignment, then, would be to create an algorithm to look through a bounded window of previously seen text for matches, and discuss the various parameters and compromises one must make to get reasonable performance.
Quick note, think there is a small bug in the second pseudocode for decompressing (modified decompression algorithm):
The line:
add OLD_CODE CHARACTER to the translation table
should be (I believe)
add (translation of OLD_CODE) character to the translation table
the reason is that OLD_CODE will always be a single character, and if there is enough repetition then the string to be added will be more than two characters.
Cheers, and thanks for the great description!!
Matt,
No, that's not the the way it works. Let's say old_code has a value of 555, and translates to "ABCD".
If you look at the implementation, we definitely store the value 555 in the translation table, not "ABCD". In LZW.c, old_code is stored in an integer array called prefix_table, and the character being appended to it is stored in append_character, a character array.
If you look at the decode_string() routine, you'll see how that provides enough information to decode any known code. Given a code n, I decode it in a loop like this:
string s = "";
while ( n > 255 )
{
s = append_character[ n ];
n = prefix_character[ n ];
}
s = n;
I work my backwards through older and older codes until I finally find the original character that started the sequence. String s then contains the decoded string, backwards.
Hi Mark,
Could this algorithm could be implemented for use in mobile phones using J2ME?
You could implement this in mobile phones, you'll find plenty of examples of LZW written in Java. But J2ME includes the deflate/zlib library, which performs quite a bit better and is very well tested. You'd be much better off using that.
Hi...... do you know about other implementations with best performance: compression ratio and time? What can i do for i to get better compression ratio and less time with this LZW implementation?
hiiii, are you there?
What can i do for i to get better compression ratio and less time with this LZW implementation? or
do you know about other implementations with best performance: compression ratio and time?
If you want a better algorithm, I suggest using zlib with the deflate algorithm. It's faster and performs much better.
Hi Mark,
After few days of sitting above the code, I decided to write here.
My english isn't even good, but I hope that you'll understand me (I'm 16, from Poland).
I'm working on a database project using FLTK and C .
After closing the main window, app is writing all the data into the file base.txt.
After it, it calls com() function: http://rafb.net/p/bg7wFK96.html (It compress the file and delete old base.txt file, leaving just base.lzw).
But my problem is, that not everytime compression is ok, sometimes, last few words (or lines), is crashed up. I'm wondering if it's your algorithm problem or mine, cause without compression, everything's ok. Maybe I missed something ? could you someday remake your LZW source code, to use new c technicues?
Sorry for my post length and a problem. Regards and have a nice day.
Hi marcin,
This code has been posted for a long time and I haven't had any reports of decompression errors. Do you have a specific file that demonstrates a failed compression/decompress cycle?
Theoretically everything in your example is ok... hmm, look at my 'algo':
- decompress file.lzw to file.txt
- open file.txt
- make changes etc in file.txt
- close file.txt
- compress file.txt to file.lzw
And sometimes compression is ok, sometimes last lines are messed up. But if there were no errors commited, I think that the problem is by my site. So sorry for problem.
hi there, it's me again.
I found out what's wrong. It's not cause of your algorithm, but error with my implementation of ARC4 coding.
Sorry for problem, and congratulate of best article refering to the LZW compression! Thank you!
Hi Mark,
Is this LZW also same as GIF89a?
Gif89a uses LZW as its compression method. You won't be able to just plug in the code from this example, there are minor variations in the implementation, but yes, same algorithm.
An interesting article that people might be interested in is located here, titled:
Multiple Pattern Matching in LZW Compressed Text
Abstract:
In this paper we address the problem of searching in LZW compressed text directly, and present a new algorithm for finding multiple patterns by simulating the move of the Aho-Corasick pattern matching machine. The new algorithm finds all occurrences of multiple patterns whereas the algorithm proposed by Amir, Benson, and Farach finds only the first occurrence of a single pattern.
Hello, very clear article. another question about entropy. Lempel-ziv entropy is been used in analysis of EEG signals, it is a relation between length of dictionary and length of signal, I think. I am doing my ph, and I need calculate that in matlab, so I will port your code and... how do I calculate the entropy? could you please give any advice?
thanks in advance
Hi Mark,
Can you please explain how you are printing the output into the file?
When we open the file test.lzw in a notepad (or changing it to test.txt and opening it in notepad),
we cannot see the symbols clearly.
Also i could not understand the shifting operations that are performed in both the find_match routine and output_code routine. Can you please explain these operations?
Extension of the above query.....
I actually want to incorporate a Hamming(7,4) code on the compressed file, introduce 1-bit error (randomly, but for every 7 bits), correct it , and then decompress the file to get the original file.
I have separated the encoder and decoder as two separate files and now I need to read the compressed file as binary data(blocks of 4 bits).
What modification in the output_code routine will make this possible?
I have tried reading the compressed file and creating a new file with 8bit ASCII values of the symbols, but the file terminated prematurely.
Please help me, I can compromise a bit on the compression ratio but it should not be bad.
Juan, LZW seems completely inappropriate for processing of EEG signals, unless they have been heavily preprocessed, or perhaps transfered to the frequency domain. Still, it doesn't really make sense. Analog signals such as the EEG are not particulary well served by processing with dictionary-based algorithms.
You don't really need to port my code to Matlab, you can get a couple of implementations from Mathworks.
My advice would be to pick another topic, this doesn't make sense.
Sainith, test.lzw contains compressed binary data, you are definitely not going to bea ble to view it in notepad. If you want to see nicely composed symbols you are going to have to modify the source code.
The shift operatins are all standard C code. What part of it isn't clear, maybe I could explain if I had a more specific question.
I want to know why you used variable shifting in " output_bit_buffer |= (unsigned long) code > 24,output); " both these lines are from the output_code routine.
I have tried to understand the code by giving a small input of 5 characters and writing down the steps and analysing them, but unable to understand it.
What is the modification required in the output_code routine to print the output in binary form so that any error correcting code(like Hamming(7,4) say) can take the input from this compressed file.When I tried to read directly it gave a lot of errors. Please suggest the modification to work for my program.
Variable shifting in " output_bit_buffer |= (unsigned long) code > 24,output); ".
These are the correct lines. Sorry for the mistake.
Anyway my main concern is to read the compressed file to encode error correction using hamming(7,4).
And to use the same for the decompression routine. Hence suggest some modifications for this.
Hi...What's the algorithm complexity? computational time?
Hi MG,
As to to algorithm's time and space complexity, I'll leave that as an exercise for the reader - sounds like you've been given that exercise ;-)
yes, my question is...for example the string: ABBBCCCC
so, AB is the first code, ABB is the second, ABBB is the third, ABBBC exist?, ABBBCC exist?, if ABBBC not exist then the maximum string length is 4 (ABBB), else, the maximum is 5 (ABBBC).
in other words, What's the maximum number of symbols in one comparison(to dictionary)
Hi great article by the way has been invaluable for a college project ive undertaken i have one small question, which is why the TABLE_SIZE constant is et to 5021 i understand the MAX_VALUE and MAX_CODE settings but am at a loss as to why 5021. I am probably being dense here but cannot figure it out. Thanks in advance
Gordon, the TABLE_SIZE is the size of the table that holds all the tokenized strings. When MAX_CODE is 12 bits, the maximum number of entries we will use in the table is 4096.
Because collisions in the hash table reduce its efficiency, we jack up the size to be a bit bigger than 4096, which reduces the likelihood of a collision. And to top it off, we need the size to be a prime number. The probing mechanism needed used in this algorithm is only guaranteed to find an empty slot if the table size is prime.
Hashing isn't too complicated, and you can find good explanations of it on the net without too much trouble.
Hi Mark Thanks for the reply, everything is clear now. Guess i was being dense and should've realised the MAX_CODE was a prime to ensure probing of the hashcodes was successful, and larger than table size to ensure a load factor, covered hashtables this year too so am feeling very sheepish, thanks again
please said to me if the lzw.c witch is in this web site can be compiled by unix or said to me witch langage i can compiled him whith it( visual c or.....)
Juliette, the code is written in very generic ANSI C, and should compile with any C compiler you are using. Visual C and gcc should both work.
Hi, just wanna say I'm performing a research on LZW vs. Huffman. Well, i know it's a long issue but anyway I'm still doing it.
You know LZW won't work well with random data. I actually mean data with little repetition. But if that data has a more frequent characters than others it will be well compressed by Huffman. Just try it with pi. you can get a 1 million number of pi from the calgary canterbury corpus.
The actual research i was actually doing if i can switch to huffman if the data is random and to lzw otherwise. But it seems i cannot do that without reading the file first. So now my problem is how to detect the randomness of a file while i am encoding.
If you want to comment please do so. Thanks.
You're tackling a difficult problem :-)
Characterizing "randomness" is not easy (not that I'm an expert in the area.) My personal bias is to define randomness using a pseudo-Kolmogorov Complexity definition, which basically says that you define how random a stream is by how well it compresses wrt. a specific model. I don't accept the notion that there is an absolute measure of randomness.
So, for example, a binary representation of Pi will not compress at all with an LZW compressor, and it will not compress at all with an order-0 Huffman coder. But does this make it random? Not at all - you can write a very simple program with perhaps a couple K characters that generates as many digits of PI as you please - resulting in a compression ratio of thousands to 1.
In the case you are describing, data with little repetition but a strong bias towards certain characters. I think you might be surprised if you try to come up with data sets and compare order-0 Huffman and LZW. I suspect that their performance tracks one another fairly closely. Let's say that a binary file is biased so that it has 50% character 'A' and the remaining characters are all randomly distributed. LZW is still going to pick up on that by generating codes for all the strings that start with 'A' more frequently than other codes.
I think it's a good topic to experiment with - but the first thing you have to do is come up with an adequate characterization of "randomness", and I think that's the hard part.
Keep us posted!
Well thanks. and you're right the randomness part is the hardest in the research. Even if we can use certain tests...it would not provide a significant drag in the performance of the compression...because I was actually considering an online approach.
Actually I'm now in the process of testing their performance according to generated random sources.
Don't worry I'll go back here.
Anyway I like your site because you don't have to do tiring registration and you're editor also has a built in spell checker. :)
hello, i am new at this algorithm and all. i need help...can i do an image compression using LZW algorithm in MatLab? Thanks..
yes, certainly, with something like this:
http://www.mathworks.co.uk/matlabcentral/fileexchange/loadFile.do?objectId=14741&objectType=FILE
hello mr mark.. maybe my english is very bad, i'm sorry. you just teach us how to compress using LZW algorithm but you don't teach us how much the rasio ( percentage this algorithm can save ). i hope that you want to teach us. I'm making a thesis with this algorithm, it's very difficult to find the literature in indonesia.. i don't understand how to calculate the rasio. If you have some literature, i hope you can send to jack_padang@yahoo.com , it's my email.. i will very gratefull to you. thank's before
You generally calculate the compression ratio as simply output-size/input-size. Thus, perfect compression would give a size of 0, no compression would give a size of 1. You can also multiply by 100 to make it a percentage.
Frequently when compressing images, people multiply the compression number by the size of a pixel to give a bpP number - bits per Pixel.
Hi, Mark,
THX a lot for this article and source codes. It is really helpful to my understanding of this algorithm. I am a software engineer working for a company, and a project requires compression. If I use your source codes in LZW.C in my project, does my company need to pay any fee? The company lawyer told us that since your article and source codes are published in the journal "Dr. Dobb's", it meant it was free for us to use for a commercial product. Is this true?
In general, the questions are: 1) Is it free for us to use for a commercial product? 2) What kind of license for your source codes?
Thanks a lot.
Scott Pan
Hi Scott,
My code use policy is here. Basically, the answer is yes, you are free to use this source.
However, I highly recommend you think about using zlib - it is fast, efficient, well-supported, and free. Plus, the deflate algorithm will outperform LZW in almost all cases.
Mark
There's a neat alternative implementation of the compression (data->code) and decompression (code->data) tables for in-memory compression (where all of the source data and compressed data is in-memory simultaneously). Rather than storing your code-mapped-data (your "strings") as either raw data, or a code character combination, you reference an offset/length in your uncompressed data (for the compression side) and similarly an offset/length in your decompressed data (for the decompression side). This works because all codes in the table map to contiguous data in the uncompressed data, and that when decompressing you always output data before adding new codes.
The advantage of doing this is that it makes transcription to/from the tables trivial, as it does the equality test for the hash-table lookup.
J
Yeah, I think I can see how this would be good. Of course, it's an unusual situation, and probably is only useful in a few situations... generally you won't have both tables in memory simultaneously.
Ah - I didn't mean that you have both tables in-memory simultaneously - the compressor would have the uncompressed text and the compression table in-memory simultaneously, the decompressor would have the decompressed data and the decompression table in memory simultaneously. For messaging-based applications this can be a pretty useful approach.
[...] My 1989 DDJ article on LZW data compression, including C source [...]
This might be a little off-topic, but I'm curious to know...
Do you know if RAR implements any part of the basic LZW code you have shown here, or is it more along the lines of variable deflate/zlib/arithmetic coding/range coding/bwt and other combinations as a hybrid algorithm?
@james:
RAR doesn't use LZW for compression. It chooses from several different algorithms depending on what it decides about the characteristics of the data.
I believe that the general purpose RAR algorithm is undisclosed.
This might sound a stupid question but i am unable to unsderstand and looking forward for your reply.
Question is "in your code while outputing any data you have
static unsigned long output_bit_buffer=0L;
and then
putc(output_bit_buffer >> 24,outfile);
does this mean that first you right shift output_bit_buffer 24 times and then write 32 bits in the output file? "
Because when I am trying to compress a file i get the output file size larger than the original file.
@jaidee:
putc outputs a single byte, so this line of code is used to first shift the bit buffer right by 24 positions, then output the result.
In effect, it outputs the top byte of that long value.
The C I/O library is oriented around bytes - that is all it knows how to read and write.
The LZW algorithm wants to read and write codes of variable length, generally greater than eight bits.
Because of this, we have to shift the bits into a temporary location (output_bit_buffer), then write them out eight bits at a time as the buffer fills.
I realize this code might be somewhat confusing, but most libraries and languages have no support for bit-oriented I/O, so this is what we are stuck with.
But why am I getting the size of compressed file (i.e. test.lzw) greater than the original file size?
I can't tell you that, I don't know what you are doing.
LZW won't compress every file - some files will actually get larger. But a standard test file made up of just readable text will definitely be compressed.
Mark - thanks for your demo and for the support you kindly do!
As others have already mentioned I was after a version of LZW that would work on memory streams of bytes, as opposed to files, so I've taken your algo as a basis and changed it here and there so it works as such.
How should I send you the modified file so you can include here for others to use?
@Bog:
If you mail me a copy of the program I'll be glad to add it to this article. Contact info on my About page.
thanks for this. i'm understanding the algorithm better.
but i'm a little confused about how the modified decompression algorithm would work. on the first try, isn't CHARACTER just a copy of OLD_CODE and essentially the same character? and because NEW_CODE wouldn't be in the dictionary at the point, would ENTRY just be (in this case) //?
@lehdi:
When you enter the loop OLD_CODE and CHARACTER have the same value, and OLD_CODE has already been sent to the output.
In the first pass through the loop, we then read in a second compressed character, which is stored in NEW_CODE. When we try to translate it, we will normally succeed, and get a single character translation. In the example in the article, we first read '/', and output it immediately before entering the loop.
We'll then read 'W', translate that to a string, which will also be "W", and output that. We then add '/' and 'W' to the table.
I recommend walking through the code for a simple example to see how it works.
Also, I am working on an updated version of the article that will include much simpler source code. By using modern C++ containers, I can implement this so that it's a lot easier to read.
How much efficient this compression algorithm are?. can it reduce 1GB "any" file to 1MB?. Mainly movie and data files.
So far i see all this compressions are useless. They do only 20%.
unless it is filled with same characters.
@Alex:
LZW compression is not really suitable for movie files - movies need a lossy compression algorith, such as H.264, in order to see substantial improvement.
It will work fine on text, but if you want higher performance compression, I suggest you look into the deflate algorithm, as implemented with zlib.
Can you teach me how to organize data structure to program lZW? I was read LZW.c, but i cant understand it. I want to know by step how to compress a string or row of integer. How to print result of compressed string or row of integer.(By pascal or nature language). Thanks a lots!!!
@TuanSon:
What do you need to know that isn't in this article? I don't know what else I can add to it.
Thanks your reply!
I have 2 questions to you:
At first, i want to process on a string. Do you have a simple source code for me?
Can you show me how to use 12->14 bits for string in table? Because i dont know how to use it. 1byte=8bits, 2bytes=16bits, 12 or 14 bits=?
Thanks
@TuanSon:
LZW.C is the only source code i have.
Reading and writing data of non-integral byte size is done with input_code() and output_code(). The source is pretty simple, if you take a look you should have an easy time seeing how it works.
Program lzw.c does not work with gcc. Segmentation fault at 289 line in function decode_string. I think function input_code(FILE *input) works not right.
Error occurs because of on my machine
sizeof(long)=8 (not 4). So, input_bit_buffer
@Neo: Okay, that makes sense, sorry for the problem. When the program was published in 1989 it was pretty unusual for a machine to have a 64 bit long. Nonetheless, the fact that it is hard-coded for 32 bits is a bit of a mistake.
You can make a quick fix by changing input_code and output_code to use an unsigned int instead of an unsigned long for the input bit buffer and output bit buffer.
Dear Mark,
Thanks for your code. I have a question about the LZW decoding. Am I right that any bit stream can be decoded by the LZW decoder, even though, the decoded symbol string is incorrect? For example, if I input 0000..0, namely, all zero sequences into the decoder. It seems that the decoder can still decode it into aa..a, where 'a' is the symbol corresponding to index 0 in the table. However, if we encode 'aaa..a', the obtained bit stream is not '000..0'. My question is: Is this statement right?
Many thanks
JT
@JT:
No, this definitely not the case. It would be very easy to give the LZW decoder an bit stream that it would recognize as being in error. For example, if the first symbol decoded by my program had a value of 260, it would generate an error.
If you want to find out more about this search the comp.compression archives for the phrase "bijective." David Scott has invested quite a bit of time making various algorithms obey the property you describe above. However, I don't know whether he has a modified version of LZW.
@Mark:
Thanks for your reply. I am thinking how we can find all the bit streams that can be detected as en error by the decoder. These bit streams can be treated as 'forbidden' bit stream that can be used for error detection. I am also thinking whether the initialization or the insertion algorithm will affect these 'forbidden' bit streams. If yes, we can design a LZW algorithm that has such kind of error detection capability. Any comments on that?
Thanks
JT
@JT: Two comments on that.
First, check into David Scott's work on bijective coding, as he has been working on similar problems.
Second, I am actually in the middle of an article on forbidden symbol usage for error detection in network streams. My particular implementation uses an arithmetic coder engine that simply reserves a fixed percentage of the symbol space for a special "error" symbol. In the case of a transmission error, you will sooner or later bumble into the forbidden symbol, with the amount of time varying depending on how much symbol space you have carved out for it.
With LZW, the easiest forbidden symbols are those for which we have not yet defined a code.
First off, thanks for the fantastic discussion and examples! Way back in the early '90s, I worked for a game company and your code from DDJ was used in our routines to compress graphic images. We used 2 or 3 different compression schemes. Recently, I came across some old files that I have not been able to uncompress. They don't appear to be RLE - looks more like LZW to me. I tried every piece of C code I could find - but alas, I guess that particular code is just gone. (it's been 16 years!) However I DO have 2 sample files that are both compressed and uncompressed. So because I have a 'before and after' state for those, I was hoping to be able to figure out how to uncompress the other dozen files. Could you help?
DL
@David:
The best way to get to the bottom of a 'what format is this' question is usually with a post to comp.compression. I've seen a few pretty obscure formats decoded there. If you have a nice hex dump of the first part of the message there's a good chance somebody will have seen it before.
I'm no format expert, so the chances of me recognizing it are pretty slim, sorry. :-(
Thanks! I'll put up a post and see what happens!
ei mark.
hi..i am currently doing a thesis on text compression algorithms and i ran to your code and convert it to java. I find it very useful but i was a bit confused with the bitwise part.
anyways, thanks man for the code..it was a great help.
while (1)
{
if (code_value[index] == -1)
return(index);
if (prefix_code[index] == hash_prefix &&
append_character[index] == hash_character)
return(index);
index -= offset;
if (index
[...] was designed to improve upon the GIF format. But if you want to read further about GIF’s LZW data compression, the GIF file format, the JPG file format, the PNG file format, or a comparison between image file [...]
Hello Mark!
Im trying to compress bmp->gif. I was tried on string, that right but in image-> wrong. Can u show me how to compress bmp->gif? I think output_code( ) was wrong in image.I was knew bmp and gif structure, but i can't do it.
Tuanson:
The LZW encoder used in GIF files has a few differences from my code. You can find a number of LZW codecs designed specifically for GIF format by going to http://www.koders.com/ and entering search terms lzw and gif.
hi,Mark!
I am using lzw to compress data with Matlab, I do not understand what can I do when the dictionary is full, but at the same time the data has not compressed fully yet. Should we just reset the dictionary and then continue to encoder? If so, when we do decoding, should we reset the dictionary too?can we get the right data?
@marie:
Normally when the dictionary is full you have two choices:
1) Just keep compressing using the dictionary you have. Don't add new phrases, but continue using the old ones.
2) Flush the dictionary completely and start a new one from scratch.
In both cases, the decompressor has to be aware of this behavior, normally flagged using a special message. Because there is no LZW standard, you are on your own for implementation.
When the dictionary gets full under UNIX compress, it will keep using it, but it monitors the compression ratio. If compression starts to drop significantly, it drops the dictionary and starts from scratch.
If anybody wants a simple example on how to use LZW with GIF then I would recommend..
http://www.koders.com/c/fid02C7FC781D17762D22469C3EE59E1F20E5692EAA.aspx
http://www.koders.com/c/fidE48FFD96079A1031E34F983550AC0E5B7DA6F4B6.aspx
Hi Mark!,
I am new to LZW.I want to know what are the file formats(*.txt,*.doc,*html,*.jpg...etc) does LZW supports?Also how is the performance of LZW for files more that 1MB since the algorithm reads character by charcter?
Thanks
-Veeresh
@Vereesh:
LZW can compress any kind of file, it doesn't matter what the type is.
Reading a file character by character isn't necessarily a bad thing... In addition, a good algorithm will actually read in big chunks of data and then process it character by character.
Hi,
Do you know the fastest LZW implementation so fat and its speed?
Thanks
JT
@jt:
No, I'm sorry to say I don't have any sort of statistics on this.
I want code of MATLab for Lempel ziv encoding. I have the code that called (norm2lzw.m). But, it doesn't work. how can i make it work. it needs some modify to work. please, if any bady know, tell me now. because i need to submit the project.
Dear Mark,
In the standard LZW, each pointer is encoded using fixed length coding of 12 bits (assuming the dictionary size is 4096).
In GIF, it uses the minimum bits to encode each pointer (e.g. use 9 bits to encode the pointer '256'). This can be easily understood in the sequential insertion, namely, the new string is inserted in a sequential manner. However, in hashing-based new string insertion, it seems that we cannot use this variable-length coding to encode the pointer. Am I right?
Thanks a lot
JT
@JT:
JT, I'm not sure I know what you mean. ALL types of LZW build libraries with sequential codes that increase as each new string is added. It doesn't matter how you store the string table.
UNIX compress provides a nice example of how this works.
To tell you the truth, I still haven't sat down and worked out how they handle the implicit code lengthening when it occurs at just the wrong place in the STRING, CHARACTER, STRING, CHARACTER, STRING special sequence. I need to work that through, maybe it's not a real problem.
Let the input symbol sequence be S = s_1s_2...s_M, and the
alphabet size N = 256. Since s_1 can be found in the initial
dictionary, the first pointer I_1 satisfies I_1 \in [0 255]. In
GIF, I_1 is encoded using 8 bits (instead of 12 bits). Then the
new string s_1s_2 in inserted into the 256th entry. Hence, the
next pointer I_2, whatever its valus is, can be encoded using 9
bits. Following similar fashion, the consequent pointer can be
encoded with gradually increasing bits.
However, in hashing-based insertion method, the new string s_1s_2
may be inserted into the 4094th entry for example, which leads to
the fact that the next reference of s_1s_2 has to be represented
using 12 bits. This breaks the gradually increasing length
structure as mentioned above.
My question: How GIF solve the problem of incorporating the
variable-length coding of pointer and the hashing-based insertion?
Hope it is a bit clearer.
Thanks
JT
@JT:
I am not aware of any LZW encoding scheme that uses the system you are describing.
Dear Mark,
I was trying to use your code to compress and decompress an unsigned char array:
unsigned char array[12] = { 0x01, 0x01, 0x00, 0x12};
and then right the compressed code to an another array:
unsigned char array_code[50]={};
then used the "array_code" to expand it to it original form and write it to array_output[50].
I have been trying to analyze this code and some reason but for some reasons I don't get the same array back after expanding. Could you tell me what changes I need to make in terms of BIT_SIZE or any bitwise manipulations.
Thanks,
Tim
@Tim:
Well, you must have modified the I/O routines to write to a memory instead of a file. I would suggest that you do a sanity check. Compress the same data to a file using the original code, then see what the contents of test.lzw are.
Next, use your modified code, and perform the compression to an array. Now dump the contents of the array and see how it compares to the test.lzw from the previous step.
- mark
Mark,
thanks for your prompt reply. I have one more question. This is the problem I faced using an another LZW implementation.
I had an array say:
unsigned char array[5] = { 0x00, 0x01, 0x02, 0x00, 0x01};
Now when I compressed and expanded this array, I would loose the '0x00' which is between '0x02' & '0x0x'. I was combining two array members by converting the unsigned chars to int. When it read 0x00 it treated it as 0 and I lost it, and hence the expanding was lossy.
Does your algorithm solve this problem? and if it doesn't could you give me an headstart on how to solve this issue.
Thanks
@Tim:
I'm sorry, you're asking me about a problem in somebody else's code, which you haven't identified. There's not much I can do to understand problems in that code.
If you're losing data when a 0 byte is encountered, it's likely you are using a str*() function, which treats the 0 value as a terminator.
Thanks Mark,
I did the sanity check advised by you. And it seems to be working fine.
The problem is I am trying to compress a 32KB array:
unsigned char array[32768] = { 0x00.....}
when the test.lzw is created its size is 30285 , so there hardly is any compression. This is with bit-size of 12. I tried to increase the BIT_SIZE to 13 and the compressed file was 28850 bytes.
When I increased the BIT_SIZE to 14 I got a segmentation fault. Can I increase the BIT_SIZE beyond 14 to get higher compressed ratio?
May I also ask how did you come with the TABLE_SIZE based on BIT_SIZE? Didn't get the Math behind it ?
THanks
Hello Mark,
Do you think it would be advisable to perform compression twice i.e run the compress function again on the already compressed file. Will this achieve better compression?
Thanks,
Tim
@Tim:
No, you will generally not get better compression. If you do, any improvement will be very small, and will not be repeatable.
mire algunos de sus articulos publicados y por tal motivo si me pueden proporcionar la informacion siguiente.
Si han realizado investigacion de las imágenes *.GIF con el propósito de encontrar el algoritmo
principalmente el de comprensión o el que ya existe.o si conocen alguien que me pudiera proporcionar esa informacion.
Soy del Tecnologico de Tuxtla Gutierrez Chiapas, Mexico.
Esta informacion lo estoy solicitando, porque es una investigacion de la materia de graficacion,
donde nos pidieron informacion, seria y confiable.
estudio la carrera de ing. en Sistemas Computacionales
Octavio Vazquez Toledo.
Agradeceria mucho su ayuda.
Hello, Mark!
Is there LZW code on C#?
Hi Mark.
Are you aware of any problems running your lzw.c with BITS = 9 and 10?
I've been playing around with it using the test documents @ http://corpus.canterbury.ac.nz/descriptions/#cantrbry and am getting expanded files where some final characters (2 - 6) are replaced by an incorrect replicated final character in test.out.
Specifically, when BITS = 9, it occurs on files alice29.txt, asyoulik.text, cp.html, fields.c, kennedy.xls, lcet10.txt, and sum. When BITS = 10, it occurs on alice29.txt and grammar.lsp. I see no problems when compiling with BITS = 11, 12, 13, or 14.
Given my C-challenged status, I haven't been able to track down the problem yet. I'm on a a WinXP Pro machine and using MS VC++ 2008 Express Edition compiler.
Do you see any such problems when running lzw.c against any of these documents compiled for 9- or 10-bit code lengths? Any ideas?
BTW, thanks for a great article!
@jj:
Are you sure that there is really a problem here? When using BITS=9 and BITS=10, you are not likely to get very good compression, and so expansion on some files is not suprising. It's important to differentiate between that and a true error.
- Mark
Mark wrote:
Are you sure that there is really a problem here? When using BITS=9 and BITS=10, you are not likely to get very good compression, and so expansion on some files is not suprising. It's important to differentiate between that and a true error.
Yes, the problem is that that the expanded files (in test.out) are different from the original files. The expanded files are missing some final characters relative to the originals and have the penultimate character replicated.
A quick update to my previous post to correct the record...
Turns out the hexdump tool I was using for debug reported a non-existent final character on files being dumped. Thus no "penultimate character replication" is occuring in the expanded files as previously reported.
However, lzw.c in my environment still eliminates some final characters on the indicated files when BITS = 9 or 10. Seems like a buffer isn't being flushed. Perhaps something to do with the hashing routine or decode_stack (or my compiler)?
Hello,
I wanted to ask wiether you have any proof that running the algorithm again on a compresses string would now produce more compression ?
@vadim:
No, running the algorithm again will *usually* create a bigger sequence, but not necessarily always.
- Mark
Hello, Mark!
..resend...
Do you think this code variant simplify the expand function and it works fine?
[c]
while ((new_code=input_code(input,n_bits)) = decode_stack)
putc(*string--,output);
/*
** Finally, if possible, add a new code to the string table.
*/
if (next_code
@Marco:
Don't know, have you run tests on it?
Hello Mark,
I was willing to implement a LZW compressor/decompressor, and thanks to your explainations, I finally managed to make it (or it seems so, I still have to test it a bit!).
I looked for a bug for some time, and finally got it. I might be wrong, but I think there is something weird in your decompression algorithm.
At line 8, fig 6, you wrote: STRING = STRING + CHARACTER.
What I did is STRING = STRING + STRING[0] (Like in Wikipedia's decompression algorithm).
The thing is, CHARACTER gets the value of STRING[0] at line 13, but the STRING is 'outputed' at line 12, so the appended char is not the same.
Again, I might be wrong. But this is the problem I had (It might also come from my implementation).
Now I have to take a look at your hashing stuff, it seems very interesting! (I'm searching into my string table sequentially at the moment).
Thanks again for the post, and excuse my english mistakes, I'm not an english speaker.
PS: Can you make a similar post about adaptative huffman algorithm? (^_^)
@Clem:
If you think you have a bug, please demonstrate using the source code published here!
- Mark
Mark,
I think the VB.NET implementation of the code you refer to above is now at this url: http://www.codeproject.com/KB/recipes/VB_LZW.aspx
Regards.
@Luis:
Thanks, that looks like a good link.
Hi,
If I compress one file on windows and I compress (with the same code) the same file on lunux.
Why do I have 2 diferents sizes and files???
Conclusion...If I comppress on windows I can't descompress on linux :(
some idea?
thanks
javi
PD: If I compress and descompress on windows work fine and if I compress and descompress on linux work.
@javi:
Something is wrong, the files should be identical on both systems. Keep in mind that the code was posted almost 20 years ago, and compilers have changed a bit.
I think step 1 is to actually decode the binary output of the compressor and see if it is being properly stored in the file. This is a bit painful but it is a great exercise. Start by compressing a very short file and it should be relatively easy.
Let me know what you find.
Hi Mark;
Actualy; I am doing a researsh which related in some way to the compression of the random data and I would like to ask you about the variable bit counting VBC whether can perform a recursive compression on random data or not. If NO could you please tell me which lossless compression algorithm would be better to be used to compress recusively any kind of any random data.
Regards.
Mohammed.
@Mohammed:
You should probably stop your research - it is not going to work.
In order to compress what you are calling "random" data, a compressor must be able to compress *all* input files - not just a select few. This is a violation of the counting argument, also known as the pigeonhole principle.
A lot of people think they can find a tricky way to beat this problem, but it's just not going to happen. The counting argument is so exceedingly obvious that there is just no way around it.
Sorry.
hello guys first of all i would like to thank you for your
laxarious algorithms,
but i want to say a few words on the lzw.c programs,
just both the compression and decompression are written in
one program and after you enter the file to be compressed i don't know where it will put it after it compressed it.
in this time i'm developing an RPC program based on the LZW compression.
this program is going to work as below
1- the compresser program is to be placed on the server and the decompresser on the client side then when the client wants to acces any file which is found on the server , it will send a request to the server with the file name and then the server will reply for the clients request by compressing the data using the LZW and then after the client will decompress the data and access it.
so can you please send me any suggesion on this thing.
thanks
@alemat:
You're going to have to write some code to convert the demo program into a production program. should be pretty easy.
[...] matched my requirement fully was Mark Nelson’s compendium. You can find there information about LZW fundamentals, compression and decompression theory and of [...]
Hi Mark:
Glad to read this article. I was born in DEC,1989. So I was quite shocked when I found that this article was written before I was born ^_^
I've learned a lot from this article. Just want to say thanks.
Yuan Yang
@Yuan Yang:
Yeah there was a little interesting stuff that happened back in the old days. Not much, but some.
Thanks for the feedback!
Hi,
I tried to implement compress and decompress algorithms in java, but not able to decompress it well, when I expect a word with repeated characters like cccccccccc I am getting Nullpointer exception. can any one help me
Hi Mark/Raju,
I ran the lzw.c file and as expected I got the files test.out and test.lzw
However, the file test.lzw is not readable. I wanted to understand how does it compress using dictionary. Please advise me.
Thanks,
ash
@ash:
test.lzw is a file that is composed of variable length binary data, you will not be able to read it.
If you want to see the actual output of the compressor in human readable form, you will need to modify the program - both the output of the compressor and the input of the decompressor will need to be modified.
- Mark
Hello!
I'm now writing third implementation of LZW algorithm. First was using plane array of std::string's as a dictionary with linear search :) That was incredibly slow. The second was using binary tree of strings for faster search, and it was really much faster. Now I'm implementing hash table. I'm not a good programmer, and I can hardly understand your code. So, I have a question. My second LZW program (with tree) had better compression due to variable output code length: index length grew while dictionary was filling (I've wrote a stream class for bitwise file I/O). But the same trick with variable index length seems to be impossible with hash table. Am I right or wrong?
Thanks.
@Alex:
compress.c uses a hash table and manages variable length codes nicely. I don't see why it would be impossible?
I really don't undertand what's going on in your program:(
You don't use hash as index?..
Look at my hash func:
[cpp]
unsigned int CLZHashStringDict::hash(const std::string& str) const
{
unsigned int h = 1, length = str.length (), i = 0;
while (i
How do I post code?..
I got it. I can use variable index length if I use string counter as index. I'll try to implement it when I'll have spare time. But decompressing would be VERY slow, I need to use linear search for it.
Hi,
I've tried the compression on a binary file and the compressed file is larger than the original. The original data contains words.
Any thoughts?
@frankoni:
I don't know what you mane when you say the original data contains "words".
First make sure the executable you created is useful - try compressing some highly redundant text files to make sure you get a decrease.
The exe is functional. Words mean two bytes. The binary data is not an ascii stream, but 16bit unsigned integer values.
@frankoni:
In almost all data compression, you'll find that the compressor needs to operate on the same size units of data that are being used in the source input. This is why, for example, files containing say, floating point data, do so poorly with general purpose compressors. The compressor is looking for correlation between adjacent bytes, when the *real* correlation is between longer units.
This is the same problem you are seeing when trying to compress two-byte words. The best solution in a case like this is to modify the compressor to operate on the full size word instead of input bytes. Sometimes this is easy, sometimes not. But if you look around you can find sample source code written using various compressors to attack different size input streams.
- Mark
I've written pretty optimized (I guess:) ) LZW file archiver. If someone's interested in algorithms used, or their implementation (optimization still in progress), you can contact me by e-mail: alexxxx89[at]ya[dot]ru.
To Mark Nelson: should I relese my project on this site?
I have one doubt, is this coding only works for text files??or it works for binary files too?? because when try to compress a binary files it won't work but it work effciently for a text file..what is the reason for that??pls help me to clear my doubt...thanks in advance.
@rajesh:
LZW compression works fine on binary files. If you doubt this, create a file with one megabyte of nothing but the byte '00', then compress it - you will get excellent results.
To almost any general-purpose compressor, there is no concept of 'text' vs. 'binary'. These don't mean anything.
What you are observing is simply the difference between files that are more or less compressible using this method.
- Mark
@Alex:
I would be happy to post your code on the site, if you want to mail me a copy. Please be sure to include an appropriate license/usage declaration in the source itself - I might suggest the zlib license as an appropriate template.
- Mark
thanks for quick reply..and one more question i convert the compress file in ASCII format by using fprintf(output,"%d",code); then i need to read this table as input for decompression, what changes should i make to input so as to read this ascii file for decompression...pls reply...thanks in advance..
The people who created the RAR algorithm have gone through such lengths to protect it that they have a staff of approx. 100 people worldwide whose job is just to look out for infringers. Eugene Roshal has pledged to protect the secrets of RAR with his life.
Dear Mark,
Your work was extremely helpful. Thanks a lot :-)
I am doing a project which should compress any type of file to relatively smaller size, in smaller duration of time and using low memory possible.
Is LZW suitable for this purpose or should I use a case statement and use different algorithm for different file types?
Also the above code does not work on compressed files, Please guide me sir.
Thanks a lot in advance ...
Also can we use LZW in our commercial product, do we have to pay royalty for using it, or are there any legal issues involved ??
Thank You...
@Ceena:
>I am doing a project which should compress
>any type of file to relatively smaller size
Since you are proposing to do the impossible, LZW is as good a choice as any other. No algorithm can compress any file to a smaller size.
As far as I know all the LZW patents expired at least two years ago, you can use it in any commercial project you like. The deflate algorithm is probably a better choice for general-purpose lossless compression, but LZW will certainly work.
- Mark
Hi Mark,
Great article!
I am trying to implement LZW-12 bit compression such that it is able to compress and decompress ASCII as well as binary files. My I/O handler reads 8-bits byte as input (program for I/O handler appended), and I would like to write the output file in 12bit chunks.
How would I need to vary the algorithm presented by you to accomodate these requirements? Will my code have to handle files that are one or two or three bytes in length as individual cases given I am trying to write the output file in 12bit chunks?
Thanks,
AJ
MY I/O Handler -
@ICodeLikeAGirl:
First, your coment about wanting to compress binary and ASCII doesn't make any sense to me. The code as written doesn't distinguish between the type of data - as long as it comes in a stream of bytes it's going to compress properly.
If you want to write 12 bit chunks you can certainly just use the code in the article, which is bit-oriented. Or you can save up two 12-bit tokens and write them out as a three-byte sequence. This would be very efficient but you would have to possibly pad the file with one additional token at the end.
- Mark
Dear Mark,
I'm doing a dissertation in evaluating the compression algorithms. When I compiled the code in dev C++ its fine. But when I execute it it returns me a file which has got nothing in it. Since I'm not very good at code development I'm not able to understand the problem. I wanted to know whether this code has any data base connectivity and what shoudl I need to do in order to compress a text file. Should I need to change anything in the code. Please let me know some basic procedures behind this execution. Thank you.
Regards,
Mohan
Mohan, the executable will either take input and output file names from the command line, or will ask you for them if you leave the command line empty.
That is the basic procedure.
- Mark
Thank you Mark. I got the output. But the test.lzw is not in a readable form. Also I kindly request you for the codes of other compression algorithms like the one you have posted for lzw as its more easy to compile and execute. Possibly if I could get lzss, bwt or others it will be very helpful for my dissertation. Thanks for your immediate response.
Regards,
Mohan
Mohan -
Sorry the program does not meet your needs. Looks like you have some work to do, better get busy.
- Mark
thanks for this minimal and clear-cut implementation of LZW
i've been trying to adapt your code to use variable bit encoding so as to behave good for both small and large files
I've hit a but in your code which has been mentioned earlier by "jj" when you define BITS (constant) 10 or 9
to reproduce, take your code and add these 2 lines _after_ the definition of the hash table size
this way the hash table will be defined large enough for BITS=14 (any value is ok) then you redefine BITS to a lower value to demonstrate the bug
for certain kinds of input files, the final file test.out is DIFFERENT from the input file by a couple of bytes near the end
I am certain that the problem is your output_code() implementation that for 10 bits and lower isn't guaranteed to flush the final integer to the file (despite you writing MAX_VALUE and a zero at the end of your compress function).
It is easy to reproduce this bug if you try your code in a few files in your %TEMP% folder, I'm sure one of them will fail the test. Or if you can't find one I can send you a small test file.
Dear Mark,
I am glad to use your code to compress binary data in PC-based program and it results in great compression power.
However I'm having a problem of not enough RAM to decompress it in a small footprint device. Therefore, I'm thinking of hard-coding the tables. From my experiment, it seems that compression and decompression make use of different tables, i.e. prefix_code, append_character. Do you think is it possible to implement fix pattern tables for both compression and decompression?
Thank you
Regards,
YH
@YH:
I don't really think LZW is suitable for a static dictionary. There would be a lot of problems creating it, and what you would end up with would be something a bit different from LZW.
- Mark
I wrote a small blog article about this code, including a C++ wrapper to it, that extends it for variable bit length encoding which should help improving the compression ratio regardless of source file size:
http://zabkat.com/blog/24Jan10-lzw-compression-code.htm
Hello,
I noticed your code crashes on 64 bits architecture:
In input_code() left-shifting the bit buffer doesn't clear the top 32 bits, so I just clear non significative bits in the return code
Dear Mark
Thank you for such a gr8 article on LZw. I t has been really helpful for my final year project on data compression. I went through your code and i have sum doubts.The code execues for almost all types of text files like .txt,.doc,.xls,.rtf etc. Can you please explain which part of your code handles the different header formats pertaining to different formats of the text files.
I have also developed codes for RLE compression and HUFFMAN compression, but they work on only .txt files. How do I modify them to work for all formats.
I am also trying to develop a new algorithm. I plan to apply a fixed algorithm for 50% of the file, and then depending on the text content i want to apply a feasible algorithm. Is this algorithm feasible?
Regards
IMAGINOR(India)
@IMAGINOR:
There is no part of the algorithm that tries to determine the file type. It treats every file as a binary stream.
Your idea for an algorithm has some merit, I suggest you read up on PAQ for some ideas on differing algorithms depending on context.
- Mark
Hi Mark ,
Thnx for this excellent article . I hav a query ..
How does ur LZW implementation behave when the string table gets filled up ??
regards
Amit
@Amit:
My implementation simply stops adding new strings to the table when it is full. This is probably not optimaly, but it is simple.
If you can find the source code for UNIX compress.c, you will see that they monitor the compression ratio of the file after this occurs. If the compression ratio starts to drop, they flush the table and start over.
Those are probably the only two reasonable options for managing LZW table space.
- Mark
Hi Mark,
Thnx fr ur reply . I had 2 queries :-
a)if the table gets FULL then how will the compression proceed then as per ur implementation ?
b) Can you please explain the significance of MAX_CODE ?
Regards
Amit
@Amit, I think I explained this, but you can get the answers by simply reading the code.
Once the table is full compression proceeds normally, we just don't add new codes to the table.
MAX_CODE is the largest value that we will use.
- Mark
Hi Mark ,
Thnx for this excellent article . I have 3 questions for you :)
1) Can you explain what is the HASHING_SHIFT ?
2) Can you explain what is the difference between output_code and output_code_ORIG?
3) Can you explain why do you give "0L" value to output_bit_buffer?
Thank you
@sevdalone:
HASHING_SHIFT is used when creating a hash code. Note that the hash code is created by combining the character and its code, mangling them together into a single int. One of them is shifted to the left HASHING_SHIFT bits before this happens.
For your questions about the bit oriented I/O, all I can say is it is pretty straightforward. The variable length bits have to be shifted into an integer or byte of some kind before they can be written to the file.
I don't know what you are talking about when you say output_code_ORIG. That token does not appear in LZW.C.
- Mark
Dear Mark
Thank you for your reply and help. I am almost through with the development of the new algorithm i mentioned in my last post.
I have been working on development of another compression technique for text files. I propose to replace every new word in the file with a special character. Whenever i find a new word in the file i assign an index number to it and replace the word with the special character in the compressed file and update my dynamic dictionary with the integer index,the special character and the word. The next time the word is found in the file i replace the word with my defined special character. Once the single character replacement is saturated(i.e all the special characters from 1-254) are assigned, i concatenate the integers to get my new coder- defined special characters for the next 254^2 words and the porces continues.
I will be obliged if u comment on the merit of the algorithm i am attempting to develop and also suggest ways to optimise the algorithm.
Regards
Imaginor(India)
@IMAGINOR:
I imagine you will get performance similar to LZW with this scheme. If you think about it, you'll see that you are doing basically the same thing, you are just building tokens in a slightly different fashion.
- Mark
Dear Mark
Thank You for your reply. I did tally mt results with Lzw results. The compression ratio is superior in the range of 5% to 25% depending on the content of the file.
Moreover in lzw if we go for an example like : CAT, the first time it is coded we need to go for a check for existence of the word aftr every letter of the word is read and we continue with the process of continuation and finally assign the code equivalent to Whereas as per my implementation i just assign one special character to it(depending on the index assigned i.e if the index value is 2 i assign char(2) to the word), which not only saves my computation time and also increases my compression ratio.
I will be obliged if you kindly see whether my thought process is correct and would like to know where my logic of implementation might be wrong.
Regards
Imaginor(India)
Hi Mark
Thanks for this awesome article.
i am developing an algorithm where the dictionary size is kept 4096.Whenever it becomes full then some entries will be deleted depending on priority basis.If a pattern already exists in the dictionary then we increases it's HIT factor.The patterns which have lowest HIT will be deleted.If 2 or more have same HIT but length is different then the pattern with smallest length will be deleted.If 2 or more have same HIT and length is also same then the pattern which enters first in the dictionary will be deleted.this is the synopsis.
Can you please tell me whether it is a previous work or not?I went through several books but can not find this approach.Also may I use your sample program for modification.waiting for your reply.Thanks in advance.
@Cyber:
I am not sure that this is a very good idea with LZW. In fact, I'm not sure you can do it feasibly.
Just as an example, if you decide to remove the string for "ax", you will also have to delete all strings built on "ax", including "axi", "axia", "axial", etc.
To avoid this problem you will have to completely redo the way LZW stores its dictionary.
Keep in mind that you will also have to synchronize the dictionary management between the compressor and decompressor.
As for whether this has been done before, I don't know.
- Mark
Thanks for your reply Mark.
Checking the HIT count of all 4096 patterns and using the HIT count shall I not be able to delete a single pattern every time the dictionary becomes full?
Suppose we have various entries such as "ax","axi","axial","axials" etc.May I not be able to delete only "ax" at a time?Can you plz recommend any suitable approach for doing this?
Another question: may I use your sample code?If u grant me a permission then it will be very much helpful for me.
Thanks once again.waiting for reply.
@Cyber:
see: http://marknelson.us/code-use-policy/
Dear Mark,
I am trying to use this code to compress some web-based communication of wireless sensors. When I simply transfer the LZW.c to my server and compile and run it, I get a segmentation fault. It appears the fault happens during expansion. What could be causing this?
My server is 64-bit linux. I notice that there is an earlier post referring to this incompatibility. Could you explain further?
@prhaugen:
You are asking me to debug an unspecified error on an unkwown system using an unknown compiler and an unknown operating system with unknown input data. That's a bit tough.
- Mark
Hi Mark,
I came acorss this site and gone through a lot of the discussion. Find very useful and thanks. I have a question - and not sure if you can help me.
I have to do LZW comprssion as part of my college project for image processing. After a lot of confusion and trying, I finally understand the algorithm. I have written a java code which works for Strings (i would say probably not efficiently), however to work with pixels, I need it to work in Bytes / or Ints. My issues is how do I parse through bytes / ints and concatinate them to build dictionary? Say if I have 1 12 3, LZW works on Stream of 1123 and building the dictionary from there. Any hint - I am using Java - and any help to nudge me in the right direction is appreciated.
@addis:
The code already reads in bytes, so you don't have to do anything different for that.
To read integers you probably should just use DataInputStream and call the ReadInt() method. The LZW code will require a fairly large amount of modifications though. The library is currently built around chars and you will need to modify it to build entries using integers.
- Mark
I am doing my final year project on LZW. This site is really very nice and helpful..
but I have few doubts please clear them.
1. first of all i want to know why code_value is an array of int wheras prefix_code unsigned int,append_character is unsigned char.
2. u have mentioned that there is little difference between your implementation of lzw and unix one. may i know what is the difference.
3. how do u handle the situation when the table is filled completing. do u flush the table and start new one. I am unable to get it from code.
sorry for the long list of questions...
I will be really very thankful to you if u can provide me the answers to these questions.
@DJ:
1) code_value could be unsigned just as easily as signed, it wouldn't make any difference. In fact, it is probably better for it to be an unsigned integer. But this would only be an issue if we were building code tables that used all 16 or 32 bits, and in this code we are not.
2) The main difference is that the UNIX implementation starts with 9 bit code values and bumps them up implicitly.
3) This program does not flush the table, it simply stops adding new codes.
- Mark
Sir, i went through your site, and your code. This topic appears more like 'Mark Nelson's' compression, than Lzw compression! But i am not clear with the following issues. Kindly draw some light on them.
1.what is the reason for using unsigned EOF in compress()?
2.What is the logic behind this hash function? Is it the most efficient, or could it be replaced by any such formula
3. In decmpress(), string has the translation in reverse. In order to equate the 'first character' of 'string' to 'character', shouldnt we use the last letter of 'string'?
I would be very grateful to you if you reply back. Thank you...
@Archies:
1) EOF needs to be case to unsigned due to C semantics. After getc() assigns its value to character, whatever was returned is now unsigned. Thus, the comparison needs to be to an unsigned value.
2) The hash function is fairly efficient, it is very simple. But if I were writing this code today, I would simply use a C++ hash table and use the default hashing function. I am using the function that was used in UNIX compress something like 25 years ago when memory was dear and CPU cycles were expensive.
3) I'm not sure about this, but since the code works properly, I suggest examination inside a debugger to satisfy your curiosity.
- Mark
Thanks a lot for the quick reply :)
Leave A Reply
You can insert source code in your comment without fear of too much mangling by tagging it to use the iG:Syntax Hiliter plugin. A typical usage of this plugin will look like this:[c]
Note that tags are enclosed in square brackets, not angle brackets. Tags currently supported by this plugin are: as (ActionScript), asp, c, cpp, csharp, css, delphi, html, java, js, mysql, perl, python, ruby, smarty, sql, vb, vbnet, xml, code (Generic). If you post your comment and you aren't happy with the way it looks, I will do everything I can to edit it to your satisfaction.int main()
{
printf( "Hello, world!\n" );
return 1;
}
[/c]