<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Mark Nelson &#187; Data Compression</title>
	<atom:link href="http://marknelson.us/category/data-compression/feed/" rel="self" type="application/rss+xml" />
	<link>http://marknelson.us</link>
	<description>Programming, mostly.</description>
	<lastBuildDate>Mon, 06 Feb 2012 00:35:24 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>I&#8217;m In the Money</title>
		<link>http://marknelson.us/2012/02/01/im-in-the-money/</link>
		<comments>http://marknelson.us/2012/02/01/im-in-the-money/#comments</comments>
		<pubDate>Wed, 01 Feb 2012 16:36:28 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[Data Compression]]></category>
		<category><![CDATA[Humor]]></category>

		<guid isPermaLink="false">http://marknelson.us/?p=1437</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2012/02/01/im-in-the-money/' addthis:title='I&#8217;m In the Money' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>It looks like all my long years of studying data compression might be ready to pay off: Hello Good Day, This is Troop Emonds With regards to your Company i am sending this email Regards to order some( Compression Machine )I will like to know the type and sizes you have in stock and get [...]]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2012/02/01/im-in-the-money/' addthis:title='I&#8217;m In the Money' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p>It looks like all my long years of studying data compression might be ready to pay off:</p>
<blockquote><p>Hello Good Day,</p>
<p>This is Troop Emonds With regards to your Company i am sending this email Regards to order some( Compression Machine )I will like to know the type and sizes you have in stock and get me the sales price of one so that i will tell you the quantity i will be ordering, and if you accept credit card as a form of payment..</p>
<p>Hope to read from you soon about my order request&#8230;&#8230;<br />
With Kind Regards.<br />
Troop</p></blockquote>
<p>I just need to put together some compression machines, and then I&#8217;m set.</p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2012/02/01/im-in-the-money/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A Visit With Tim Bell</title>
		<link>http://marknelson.us/2012/01/21/a-visit-with-tim-bell/</link>
		<comments>http://marknelson.us/2012/01/21/a-visit-with-tim-bell/#comments</comments>
		<pubDate>Sun, 22 Jan 2012 02:22:50 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Data Compression]]></category>
		<category><![CDATA[People]]></category>

		<guid isPermaLink="false">http://marknelson.us/?p=1407</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2012/01/21/a-visit-with-tim-bell/' addthis:title='A Visit With Tim Bell' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>I was in Christchurch, New Zealand, recently and had a chance to meet Tim for the first time in person. Tim teaches at the <a href=" http://www.canterbury.ac.nz/" class="newpage">University of Canterbury in Christchurch</a>, and is <a href="http://www.cosc.canterbury.ac.nz/tim.bell/" class="newpage">Deputy Head of the Computer Science and Software Engineering</a> department. I got a chance to ask him about his work in data compression as well as one of his new areas of interest, Computer Science education.]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2012/01/21/a-visit-with-tim-bell/' addthis:title='A Visit With Tim Bell' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p><img src="/attachments/2012/bell/TimBell2.jpg" alt="Dr. Timothy Bell" align="right" style="margin-left:15px;border-style:solid;border-width:2px"><br />
In my early years of learning about data compression, the book <a href="http://books.google.com/books/about/Text_compression.html?id=sdZQAAAAMAAJ" class="newpage">Text Compression</a> by Timothy Bell, John Cleary, and Ian Witten was my resource of first resort. I was in Christchurch, New Zealand, recently and had a chance to meet Tim for the first time in person. Tim teaches at the <a href=" http://www.canterbury.ac.nz/" class="newpage">University of Canterbury in Christchurch</a>, and is <a href="http://www.cosc.canterbury.ac.nz/tim.bell/" class="newpage">Deputy Head of the Computer Science and Software Engineering</a> department. I got a chance to ask him about his work in data compression as well as one of his new areas of interest, Computer Science education.<br />
<span id="more-1407"></span></p>
<hr/>
MN: Tim, it seems like there has been a lot of interest in data compression in the Antipodes. Names that come to mind include you, John Cleary, and Peter Fenwick in New Zealand, and Ross Williams in Australia. Is this just coincidence, or is compression in the air down there?</p>
<p>TB: I’ve sometimes wonder about this myself&#8230; during the early days of computing and especially personal computers, it took some time for the latest technology to reach us “down under”, so perhaps we were motivated to get more out of what we had rather than wait some months for a larger disk or new memory to arrive from overseas. When the Internet arrived we started with a very small pipe, so a good compression algorithm could do the equivalent to laying a second cable from NZ to the US – who can resist getting something for free?</p>
<p>MN: Since you wrote Text Compression back in the early 90s, I&#8217;d say the biggest development in lossless compression has been the Burrows-Wheeler transform. Is lossless text compression basically done? Are we left with just incremental improvements as processor resources increase?</p>
<p>TB: That seems to be the case; the only big improvements we’ve seen have turned out to be frauds &#8212; we even had one in NZ recently, where a Nelson man raised NZ$5.3 million for an impressive sounding method; he was <a href="http://www.stuff.co.nz/nelson-mail/news/3892853/Whitley-found-guilty-of-fraud" class="newpage">convicted of fraud</a> last year. The main indicator we have that we’re running out of steam (apart from a lack of new discoveries) is Shannon’s experiments on predicting text which gave a bound in the order of 1 bit per character for English text, and current methods are approaching this. Of course, there’s plenty of room for dealing with new kinds of data (for example, bioinformatics deals with massive amounts of data that we’re still trying to understand) and for finding better data structures and algorithms for performing the compression and decompression. Lossy compression is a whole different story&#8230;</p>
<h4>A Change In Focus</h4>
<p>MN: It looks like you are now dedicating a large amount of your time to establishing computer science as part of the basic curriculum in high school education, for students in the 15-18 age range. In many ways, this is as much a bureaucratic problem as an academic one. What motivated you to take it on?</p>
<p>TB: It’s been a problem that we’ve complained about for decades, and it’s been getting worse and worse as computing in schools has focussed increasingly on using computers and not preparing students to be developers. A lot of this can be attributed to bureaucracy – it’s hard to explain to government officials that putting word processors in every classroom isn’t the same as building a computationally literate society. As a result of some strategic lobbying done by others, a small window of opportunity opened for me to be on a group to advise our Ministry of Education, just over 3 years ago. The group managed to convince the officials that something useful could be done, and then we had to work very quickly to come up with a concrete proposal before the enthusiasm died down.  This has happened rapidly; the advisory group first met in November 2008, and Computer Science started being taught in schools in February 2011.</p>
<p>MN: What have you been able to accomplish in New Zealand so far?</p>
<p>TB: Computer science (including programming, but also topics the involve understanding the importance of things like algorithms, HCI, programming languages and even compression) is currently available as part of computing courses for two of the three final years of our main high school graduation qualification, with all three years being covered from 2013. After that we would expect some of the introductory material to start filtering down to earlier classes, and for wider offerings as teachers become more confident in the subject. One of the biggest challenges has been preparing teachers, few of whom have significant experience in Computer Science. Many have embraced it enthusiastically, and the universities and others have done a lot of work to help them get up to  speed. It’s been a wild ride doing it so quickly, but there have been some very pleasing outcomes.</p>
<p>MN: And how do things look in the rest of the world? Are there any obvious winners and losers at this point? Do you have any concise advice for the world?</p>
<p>TB: Computing in schools is a hot topic around the world; the UK have just announced a strong drive to introduce this sort of material to schools, and the US has people working hard to make it available to students. Israel and Korea have had computer science in schools for some time. We’re learning a lot about what is worth teaching, and what the best pedagogy is for the general classroom (most of our experience is for specialist students who have chosen the subject!) The New Zealand path of getting something going quickly with grass-roots support seems to be more effective than waiting for a top-down approach which could take years to develop and prepare teachers for, although it does make for a bumpy ride as problems are ironed out as we go along!</p>
<p>MN: This might be straying out of your area a bit, but do you see CS in a K-12 education setting having an effect on the representation of women in the STEM fields?</p>
<p>TB: Attitudes that affect representation definitely start at school, and to me the biggest goal of teaching CS in high school is not so much to prepare students for further study, but to enable them to find out what the subject is! School students rarely know what CS is, and even worse, it’s common for them to assume that it must be advanced word processing or some other dull area, and hence they avoid it. It’s particularly important for female students to have the opportunity to find out if it’s something that they might be good at, as the stereotypes associated with computing can make them assume that they shouldn’t consider it as a career.</p>
<p>MN: One final question, Tim. The whole world has seen the devastating damage Christchurch has suffered from the earthquakes in the last year. How has the University of Canterbury held up? Have you managed to maintain continuity in your academic calendar?</p>
<p>TB: It’s been quite a year! Thankfully our university has escaped the brunt of the earthquakes (most of the damage is some distance from the university), and we’ve managed to keep a full programme going despite being closed for three weeks for safety checks. Many students joined the  “student volunteer army”, who helped with the cleanup in the damaged parts of town, and that was probably one of the most valuable experiences of their career! It hasn’t been without disruption as buildings need to be checked carefully, and some are still under repair, but with a bit of resourcefulness we managed to keep going (for a while I even delivered my classes in a restaurant while lecture theatres were being inspected) The city is now going through a massive program of redevelopment with some pretty creative ideas, and it’s an exciting time to be part of these changes.</p>
<hr/>
<p>
<img src="/attachments/2012/bell/New_Zealand.png" alt="New Zealand" align="left" style="margin-right:15px;border-style:solid;border-width:2px">Thanks to Dr. Bell for taking the time to share all this with us. My visit to his amazing homeland was a real treat, and the short time I got to spend with Tim in Christchurch was worth the trip all in itself.</p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2012/01/21/a-visit-with-tim-bell/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Streams or Iterators?</title>
		<link>http://marknelson.us/2011/12/24/streams-or-iterators/</link>
		<comments>http://marknelson.us/2011/12/24/streams-or-iterators/#comments</comments>
		<pubDate>Sat, 24 Dec 2011 18:21:11 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[C/C++]]></category>
		<category><![CDATA[Data Compression]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://marknelson.us/?p=1393</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2011/12/24/streams-or-iterators/' addthis:title='Streams or Iterators?' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>When I updated my LZW reference code to use the latest C++ features, I abstracted my input and output functions using templates. Data was read and written using the iostreams paradigm, which requires simple classes that implement just a few functions. Would I have been better off using the iterator paradigm instead? The C++ algorithms [...]]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2011/12/24/streams-or-iterators/' addthis:title='Streams or Iterators?' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p>When I updated my <a href="http://marknelson.us/2011/11/08/lzw-revisited/" class="newpage">LZW</a> reference code to use the latest C++ features, I abstracted my input and output functions using templates. Data was read and written using the iostreams paradigm, which requires simple classes that implement just a few functions. Would I have been better off using the iterator paradigm instead? The C++ algorithms library favors that method of processing data, and it can be both elegant and powerful. Which of the two paradigms is the right one for data compression?<br />
<span id="more-1393"></span></p>
<h4>The Conflict</h4>
<p>General purpose data compression routines tend to be used on binary streams of data, either from files or in-memory objects. So what is the best general paradigm for input and output when compressing data? </p>
<p>You might analyze this problem by imagining that you need to write a binary copy routine. </p>
<pre>
template&lt;class INPUT_ITERATOR, class OUTPUT_ITERATOR&gt;
void bcopy( INPUT_ITERATOR input, INPUT_ITERATOR eof, OUTPUT_ITERATOR output )
{
    while ( input != eof )
        *output++ = *input++;
}
</pre>
<p>This routine is particularly nice when you are performing a simple copy using pointers to memory &#8211; the generated code should be really efficient.</p>
<p>However, the iterator paradigm doesn&#8217;t work quite as well when you want to perform a binary copy of data in a file. I can make use of iterators that almost do the job:</p>
<pre>
 std::ifstream in( &quot;input.txt&quot;, std::ios_base::binary );
 std::ofstream out(&quot;output.txt&quot;, std::ios_base::binary );
 bcopy( std::istream_iterator(in),
        std::istream_iterator(),
	std::ostream_iterator(out) );
</pre>
<p>But the bad news is that both <code>istream_iterator</code> and <code>ostream_iterator</code> use the insertion and extraction operators, which are really meant for whitespace-delimited textual data, not binary data. The copy routine shown here will not make a binary byte-for-byte copy of the input file.</p>
<p>So when using files, the stream approach seems to be the way to go:</p>
<pre>
template&lt;class INPUT_STREAM, class OUTPUT_STREAM&gt;
void bcopy( INPUT_STREAM in, OUTPUT_STREAM out )
{
    char c;
    while ( in.get(c) )
        out.put(c);
}
</pre>
<p>If my files have been opened using the <code>iostream</code> classes, you can use this binary copy function without having to write any glue code &#8211; they already support the <code>get</code> and <code>put</code> methods, so this works right out of the box.</p>
<h4>My Choice</h4>
<p>If I&#8217;ve made up my mind that my data compression routine is going to use one of these two paradigms, it means I am going to have to write some glue code. If I choose the iterator-based approach, I need the equivalent of <code>istream_iterator</code> and <code>ostream_iterator</code> for binary files &#8211; and these aren&#8217;t in the standard library. If I choose the stream-based approach, I need efficient <code>put()</code> and <code>get()</code> members for blocks of memory. In some cases <code>basic_stringstream</code> might do the job, but not in all cases.</p>
<p>After dithering around with various solutions, I tentatively opted for the stream paradigm. I found the implementation for various sources of data to be fairly simple, and the interface is easy to understand. I don&#8217;t know if it&#8217;s the perfect choice, and I&#8217;ll keep experimenting, but for now it works for me. My abstraction of the LZW code still needs a lot of work, so it&#8217;s always possible I could rethink this at a later date.</p>
<p>I&#8217;d like to hear your thoughts &#8211; is there an obvious right answer to this question?</p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2011/12/24/streams-or-iterators/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>LZW Revisited</title>
		<link>http://marknelson.us/2011/11/08/lzw-revisited/</link>
		<comments>http://marknelson.us/2011/11/08/lzw-revisited/#comments</comments>
		<pubDate>Tue, 08 Nov 2011 15:21:41 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[C/C++]]></category>
		<category><![CDATA[Data Compression]]></category>

		<guid isPermaLink="false">http://marknelson.us/?p=1056</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2011/11/08/lzw-revisited/' addthis:title='LZW Revisited' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>In this updated look at LZW, I will first give a description of how LZW works, then describe the core C++ code that I use to implement the algorithm. I'll then walk you through the use of the algorithm with a few varieties of I/O. Finally, I'll show you some benchmarks and go over the history of this well-known compression algorithm.]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2011/11/08/lzw-revisited/' addthis:title='LZW Revisited' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p>One of the first articles I wrote for Dr. Dobb&#8217;s Journal, <a href="http://marknelson.us/1989/10/01/lzw-data-compression/" class="newpage">LZW Data Compression</a>, turned out to be very popular, and still generates a fair amount of traffic and email over twenty years later.</p>
<p>One of the reasons for its popularity seems to be that LZW compression is a popular homework assignment for CS students around the world. And that audience sometimes found the article to be bit of a struggle. My code was modeled on the UNIX <a href="http://en.wikipedia.org/wiki/Compress" class="newpage">compress program</a>, which was written in terse C for maximum efficiency. And sometimes optimization comes at the expense of comprehension.</p>
<p>By using C++ data structures I can model the algorithm in a much more straightforward way &#8211; the language doesn&#8217;t get in the way of a clear implementation. And after 20 years of answering puzzled queries I think I can improve on the overall explanation of just how LZW works. </p>
<p>In this updated look at LZW, I will first give a description of how LZW works, then describe the core C++ code that I use to implement the algorithm. I&#8217;ll then walk you through the use of the algorithm with a few varieties of I/O. Finally, I&#8217;ll show you some benchmarks.<br />
<span id="more-1056"></span><br />
I&#8217;m hoping that this version of the article will be good enough to last for another 20 years.</p>
<h4>LZW Basics</h4>
<p>LZW compression works by reading a sequence of <em>symbols</em>, grouping the symbols into <em>strings</em>, and converting the strings into <em>codes</em>. Because the codes take up less space than the strings they replace, we get compression.</p>
<p>My implementation of LZW uses the C++ <code>char</code> as its symbol type, the C++ <code>std::string</code> as its string type, and <code>unsigned int</code> as its code type.  The tables of codes and strings are implemented using <code>unordered_map</code>, the C++ library&#8217;s hash table data structure. By using the native types and standard library data structures the representation in the program is straightforward and easy to follow.</p>
<h4>Encoding/Decoding</h4>
<p>Rather than jumping directly into a full implementation, I&#8217;m going to work my way up to LZW one step at a time.</p>
<p>The first step is getting a clear understanding of how the encoding and decoding process works. As I said earlier, LZW compression converts strings of symbols into integer codes. Decompression converts codes back into strings, returning the same text that we started with.</p>
<p>LZW is a greedy algorithm &#8211; it tries to find the longest possible string that it has a code for, then outputs that string. The code below is not quite LZW, but it shows you the basic idea of how a greedy encoder can work:</p>
<pre>
void encode( input_stream in, output_stream out )
{
  //
  // This hash table contains a list of codes, indexed
  // by the string that corresponds to the code.
  //
  std::unordered_map&lt;std::string,unsigned int&gt; codes;
  //
  // There is presumably some code here that initializes
  // the dictionary with a set of codes based on whatever
  // algorithm we are implementing.
  //
  ...initialize the dictionary
  //
  // With codes in the dictionary, encoding is
  // now ready to begin.
  //
  std::string current_string;
  char c;
  while ( in &gt;&gt; c ) {
    current_string = current_string + c;
    if ( codes.find(current_string) == codes.end() ) {
      current_string.erase(current_string.size()-1);
      out &lt;&lt; codes[current_string];
      current_string = c;
    }
  }
  out &lt;&lt; codes[current_string];
}
</pre>
<p>The greedy encoder reads characters in from the uncompressed stream, and appends them one by one to the variable <code>current_string</code>. Each time it lengthens the string by one character, it checks to see if it still has a valid code for that string in the dictionary.</p>
<p>This continues until we eventually add a character that forms a string that isn&#8217;t in the dictionary. So we then erase the last character from that string, and issue the code for the resulting string &#8211; the string from the previous pass through the loop. </p>
<p>The value of <code>current_string</code> is then initialized with the character that broke the camel&#8217;s back, and the algorithm continues in the loop, building new strings until it runs out of input characters. At that point it outputs the last remaining code and exits.</p>
<p>As an example of how this would work, imagine I have the input stream <code>ACABCA</code>, and my code dictionary looks like this:<br />
<center></p>
<table border="1">
<tr>
<td>String</td>
<td>Code</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
</tr>
<tr>
<td>B</td>
<td>2</td>
</tr>
<tr>
<td>C</td>
<td>3</td>
</tr>
<tr>
<td>AB</td>
<td>4</td>
</tr>
<tr>
<td>ABC</td>
<td>5</td>
</tr>
</table>
<p>A sample dictionary<br />
</center><br />
If you follow the algorithm above, you&#8217;ll see that the code output has to be <code>1 3 5 1</code>. If this wasn&#8217;t a greedy algorithm, <code>1 3 4 3 1</code> would have been another valid output.</p>
<p>Decoding the stream in a system like this is very straightforward:</p>
<pre>
void decode( input_stream in, output_stream out )
{
  std::unordered_map&lt;unsigned int,std::string&gt; strings;
  //
  // Initialize the code table with the same set of codes and strings
  // that the encoder used for your algorithm.
  //
  ...initialize the dictionary
  //
  // With codes in the dictionary, decoding is now
  // ready to begin.
  //
  unsigned int code;
  while ( in &gt;&gt; code )
    out &lt;&lt; strings[code];
}
</pre>
<p>Remember, the decoder shown above is just a hypothetical sample - we're still working our way up to the full LZW decoder.</p>
<h4>The LZW Encoder</h4>
<p>The encoder shown above works okay, but there is one missing ingredient: management of the code dictionary. If you think about it, you'll see that we only achieve reasonable compression when we are able to build up longer strings and find them in the dictionary. Building a useful dictionary is referred to in the data compression world as <em>modeling</em>.</p>
<p>But our management of the dictionary is constrained by an important requirement: the encoder and decoder both have to be working with the same copy of the dictionary. If they have different dictionaries, the encoder might send a string that the decoder can't resolve.</p>
<p>Some data compression algorithms solve this problem by using a predefined dictionary that both the encoder and the decoder know in advance. But LZW builds a dictionary on the fly, using an <em>adaptive</em> method that ensures both the encoder and decoder are in sync.</p>
<p>LZW manages this in an effective and provably correct fashion. First, both the encoder and decoder initialize the dictionary with all possible single digit strings. For the compressor, that looks like this:</p>
<pre>
for ( unsigned int i = 0 ; i &lt; 256 ; i++ )
    codes[std::string(1,(char)i)] = i;
</pre>
<p>This insures that we can encode all possible streams. No matter what, we can always break a stream down into single digits and encode these, knowing that the decoder has the same strings in its dictionary with values 0-255.</p>
<p>Then comes the key component of the LZW algorithm. If you go back to the greedy encoding loop above, you'll see that I keep adding input symbols to a string until I find a string that isn't in the dictionary. This string has the characteristic of being composed of a string that currently exists in the dictionary, with one additional character.</p>
<p>LZW then takes that new string and adds it to the dictionary, creating a new code. The strings are added to the table with code values that increment by one with each new entry.</p>
<p>The resulting code is just a slightly modified version of the encoder that I listed above. It still only outputs codes for values that are in the dictionary, but now the dictionary is being updated with a new string every time an existing code is sent:</p>
<pre>
void compress( input_stream in, output_stream out )
{
  std::unordered_map&lt;std::string,unsigned int&gt; codes;
  for ( unsigned int i = 0 ; i &lt; 256 ; i++ )
    codes[std::string(1,(char)i)] = i;
  unsigned int next_code = 257;
  std::string current_string;
  char c;
  while ( in &gt;&gt; c ) {
    current_string = current_string + c;
    if ( codes.find(current_string) == codes.end() ) {
      codes[ current_string ] = next_code++;
      current_string.erase(current_string.size()-1);
      out &lt;&lt; codes[current_string];
      current_string = c;
    }
  }
  out &lt;&lt; codes[current_string];
}
</pre>
<p>The code above constitutes a more or less complete LZW encoder. I've only made a couple of additions to the previous encoder:</p>
<ul>
<li/>The initialization of codes 0-255 with all possible single character strings.
<li/>The insertion of the newly discovered string into the string table, generating a new code.
</ul>
<p>(One item of note in this code: you might wonder why <code>next_code</code> is initialized to 257, when 256 is the first free code. This is because I reserve code 256 for an EOF marker. More on this in a later section.)</p>
<p>Just to make sure this all adds up, I'll walk through the steps the encoder takes as it processes a string from a simple two letter alphabet: <code>ABBABBBABBA</code>. There are a lot of steps shown below, but working through the process in detail is a great way to be sure you understand it:<br />
<center><br />
<table border="1">
<tr>
<th>Input<br/>Symbol</th>
<th>Action(s)</th>
<th>New<br/>Code
<th>Output<br/>Code</th>
</tr>
<tr>
<td valign="top"><center>A</center></td>
<td>read 'A' - set current_string to 'A'<br/>'A' is in the dictionary, so continue</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
<tr>
<td valign="top"><center>B</center></td>
<td>read 'B' - set current_string to 'AB'<br/>'AB' is not in the dictionary, add it with code 257<br/>output the code for 'A' - 65<br/>set current_string to 'B'</td>
<td valign="top">257 (AB)</td>
<td valign="top">65 (A)</td>
</tr>
<tr>
<td valign="top"><center>B</center></td>
<td>read 'B' - set current_string to 'BB'<br/>'BB' is not in the dictionary, add it with code 258<br/>output the code for 'B' - 66<br/>set current_string to 'B'</td>
<td valign="top">258 (BB)</td>
<td valign="top">66 (B)</td>
</tr>
<tr>
<td valign="top"><center>A</center></td>
<td>read 'A' - set current_string to 'BA'<br/>'BA' is not in the dictionary - add it with code 259<br/>output the code for 'B' - 66<br/>set current_string to 'A'</td>
<td valign="top">259 (BA)</td>
<td valign="top">66 (B)</td>
</tr>
<tr>
<td valign="top"><center>B</center></td>
<td>read 'B' - set current_string to 'AB'<br/>'AB' is in the dictionary, so continue</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
<tr>
<td valign="top"><center>B</center></td>
<td>read 'B' - set current_string to 'ABB'<br/>'ABB' is not in the dictionary - add it with code 260<br/>output the code for 'AB' - 257<br/>set current_string to 'B'</td>
<td valign="top">260 (ABB)</td>
<td valign="top">257 (AB)</td>
</tr>
<tr>
<td valign="top"><center>B</center></td>
<td>read 'B' - set current_string to 'BB'<br/>'BB' is in the dictionary, so continue</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
<tr>
<td valign="top"><center>A</center></td>
<td>read 'A' - set current_string to 'BBA'<br/>'BBA' is not in the dictionary - add it with code 261<br/>output the code for 'BB' - 258<br/>set current_string to 'A'</td>
<td valign="top">261 (BBA)</td>
<td valign="top">258 (BB)</td>
</tr>
<tr>
<td valign="top"><center>B</center></td>
<td>read 'B' - set current_string to 'AB'<br/>'AB' is in the dictionary, so continue</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
<tr>
<td valign="top"><center>B</center></td>
<td>read 'B' - set current_string to 'ABB'<br/>'ABB' is in the dictionary, so continue</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
<tr>
<td valign="top"><center>A</center></td>
<td>read 'A' - set current_string to 'ABBA'<br/>'ABBA' is not in the dictionary - add it with code 262<br/>output the code for 'ABB' - 260<br/>set current_string to 'A'</td>
<td valign="top">262 (ABBA)</td>
<td valign="top">260 (ABB)</td>
</tr>
<tr>
<td valign="top"><center>EOF</center></td>
<td>end of the input stream - exit loop<br/>current string is 'A'<br/>output the code for 'A' - 65</td>
<td>&nbsp;</td>
<td>65 (A)</td>
</tr>
</table>
<p></center><br />
After processing string <code>ABBABBBABBA</code>, the output codes are <code> 65,66,66,257,258,260,65</code>. The dictionary at this point is:<br />
<center></p>
<table border="1">
<tr>
<td>String</td>
<td>Code</td>
</tr>
<tr>
<td>AB</td>
<td>257</td>
</tr>
<tr>
<td>BB</td>
<td>258</td>
</tr>
<tr>
<td>BA</td>
<td>259</td>
</tr>
<tr>
<td>ABB</td>
<td>260</td>
</tr>
<tr>
<td>BBA</td>
<td>261</td>
</tr>
<tr>
<td>ABBA</td>
<td>262</td>
</tr>
</table>
<p>The dictionary generated for <code>ABBABBBABBA</code><br/>(Entries 0-255 not shown for brevity)<br />
</center><br />
Looking at the above table, you can see a few interesting things happening. First, every time the algorithm outputs a code, it also adds a new code to the dictionary.</p>
<p>More importantly, as the dictionary grows, it starts to hold longer and longer strings. And the longer the string, the the more compression we can get. If the algorithm starts emitting integer codes for strings of length 10 or more, there is no doubt that we are going to get good compression.</p>
<p>As an example of how this works on real data, here are some entries from the dictionary created when compressing <em>Alice's Adventures in Wonderland</em>:</p>
<pre>
34830 : 'even\n'
34831 : '\nwith t'
34832 : 'the dr'
34833 : 'ream '
34834 : ' of Wo'
34835 : 'onderl'
34836 : 'land'
34837 : 'd of l'
34838 : 'long ag'
34839 : 'go:'
</pre>
<p>These strings have an average length of almost six characters. If we are writing the integer codes to a file using 16 bit binary integers, these entries offer the possibility of 3:1 compression.</p>
<p>The word <em>adaptive</em> is used to describe a compression algorithm that adapts to the type of text it is processing. LZW does an excellent job of this. If a string is seen repeatedly in the text, it will show up in longer and longer entries in the dictionary. If a string is seen rarely, it will not be the foundation for a large batch of longer strings, and thus won't waste space in the dictionary.</p>
<h4>The LZW Decoder</h4>
<p>The change made to the basic encoder to accommodate the LZW algorithm was really very simple. One small batch of code that initializes the dictionary, and another few lines of code to add every new unseen string to the dictionary.</p>
<p>As you might suspect, the changes to the decoder will be fairly simple as well. The first change is that the dictionary must be initialized with the same 256 single-symbol strings that the encoder uses.</p>
<p>Once the decoder starts running, each time it reads in a code, it must add a new value to the dictionary. And what is that value? The entire content of the previously decoded string, plus the first letter of the currently decoded string. This is exactly what the encoder does to create a new string, and the decoder must following the same steps:</p>
<pre>
void decompress( input_stream in, output_stream out )
{
  std::unordered_map&lt;unsigned int,std::string&gt; strings;
  for ( int unsigned i = 0 ; i &lt; 256 ; i++ )
    strings[i] = std::string(1,i);
  std::string previous_string;
  unsigned int code;
  unsigned int next_code = 257;
  while ( in &gt;&gt; code ) {
    out &lt;&lt; strings[code];
    if ( previous_string.size() )
      strings[next_code++] = previous_string + strings[code][0];
    previous_string = strings[code];
  }
}
</pre>
<p>I won't do a walk-through of the the decoder - you should be able to take the codes output from the encoder, shown above, and run them through the decoder to see that the output stream is what we expect.</p>
<p>The important thing is to understand the logic behind the decoder. When the encoder encounters a string that isn't in the dictionary, it breaks it into two pieces: a root string and an appended character. It outputs the code for the root string, and adds the root string + appended character to the dictionary. It then starts building a new string that starts with the appended character.</p>
<p>So every time the decoder uses a code to extract a string from the dictionary, it knows that the first character in that string was the appended character of the string just added to the dictionary by the encoder. And the root of the string added to the dictionary? That was the <em>previously</em> decoded string. This line of code implements that logic:</p>
<pre>
    strings[next_code++] = previous_string + strings[code][0];
</pre>
<p>It adds a new string to the dictionary, composed of the previously seen string, and the first character of the current string. Thus, the decoder is adding strings to the dictionary just one step behind the encoder.</p>
<p>You might note one curious point in the decoder. Instead of always adding the string to the dictionary, it is only done conditionally:</p>
<pre>
if ( previous_string.size() )
  strings[next_code++] = previous_string + strings[code][0];
</pre>
<p>The only time that <code>previous_string.size()</code> is 0 is on the very first pass through the loop. And on the first pass through the loop, we don't have a previous string yet, so the decoder can't build a new dictionary entry. Again, the decoder is always one step behind the encoder, which is a key point in the next section, which puts the final touches on the algorithm.</p>
<h4>The Catch</h4>
<p>So far the LZW algorithm we've seen seems very elegant - that's a characteristic we associate with algorithms that can be expressed in just a few lines of code.</p>
<p>Unfortunately, there is one small catch in this perceived elegance - the algorithm as I've shown it to you has a bug.</p>
<p>The bug in the algorithm relates to the fact that the encoder is always one step ahead of the decoder. When the encoder adds a string with code <em>N</em> to the table, it sends enough information to the decoder to allow the decoder to figure out the value of the string denoted by code <em>N-1</em>. The decoder won't know what the value of the string corresponding to code <em>N</em> is until it receives code <em>N+1</em>.</p>
<p>This makes sense if you recall the key line of code from the decoder. It calculates the value of the string encoded by <em>N-1</em> by looking at the string it received on the previous iteration, plus the first character of the current string. And that current string is the one that was sent after encoding <em>N</em>.</p>
<p>So how can this get us in trouble? The encoder is always one entry ahead of the decoder - it has entry <em>N</em> in its dictionary, and the decoder has entry <em>N-1</em>. So if the encoder ever sends code <em>N</em>, the decoder will look in its table and come up empty-handed, unable to do its job of decoding.</p>
<p>A simple example will show you how this can happen. Let's look at the state of the encoder after it has sent the first five symbols in a stream: <code>ABABA</code>:</p>
<p><center><br />
<table border="1">
<tr>
<th>Input<br/>Symbol</th>
<th>Action(s)</th>
<th>New<br/>Code
<th>Output<br/>Code</th>
</tr>
<tr>
<td valign="top"><center>A</center></td>
<td>read 'A' - set current_string to 'A'<br/>'A' is in the dictionary, so continue</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
<tr>
<td valign="top"><center>B</center></td>
<td>read 'B' - set current_string to 'AB'<br/>'AB' is not in the dictionary, add it with code 257<br/>output the code for 'A' - 65<br/>set current_string to 'B'</td>
<td valign="top">257 (AB)</td>
<td valign="top">65 (A)</td>
</tr>
<tr>
<td valign="top"><center>A</center></td>
<td>read 'A' - set current_string to 'BA'<br/>'BA' is not in the dictionary, add it with code 258<br/>output the code for 'B' - 66<br/>set current_string to 'A'</td>
<td valign="top">258 (BA)</td>
<td valign="top">66 (B)</td>
</tr>
<tr>
<td valign="top"><center>B</center></td>
<td>read 'B' - set current_string to 'AB'<br/>'AB' is in the dictionary, so continue</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
<tr>
<td valign="top"><center>A</center></td>
<td>read 'A' - set current_string to 'ABA'<br/>'ABA' is not in the dictionary, add it with code 259<br/>output the code for 'AB' - 257<br/>set current_string to 'A'</td>
<td valign="top">259 (ABA)</td>
<td valign="top">257 (AB)</td>
</tr>
</table>
<p></center><br />
Now we are set for trouble. The encoder has symbol 259 in its dictionary, while the decoder has only gotten to 258. If the encoder were to send a code of 259 for its next output, the decoder would not be able to find it in its dictionary. Can this happen?</p>
<p>Yes, if the next two characters in the stream are <code>BA</code>, the next code output by the encoder will be 259, and the decoder will be lost.</p>
<p>In general, this can happen when a dictionary entry exists that consists of a string plus a character, and the encoder encounters the sequence <code>string+character+string+character+string</code>. In the example above, the value of <em>string</em> is <code>A</code>, and the value of <em>character</em> is <code>B</code>. After the encoder counters <code>AB</code>, it has <code>string+character</code> in the dictionary, so if the following sequence is <code>ABABA</code>, we will emit code <em>N</em>.</p>
<p>Whether this is likely to happen or not is not too important, what is important is that it most definitely can happen, and the decoder has to be aware of it. And it will happen repeatedly in the pathological case: a stream that consists of a single symbol, repeated on end.</p>
<p>The good news is that the problem is easily solved. When the decoder receives a code, and finds that this code is not present in its dictionary, it knows right away that the code must be the one that it will add next to its decoder. And because this only happens when we are encoding the sequence discussed above, the decoder knows that instead of using this value for that code:</p>
<pre>
    strings[next_code++] = previous_string + strings[code][0];
</pre>
<p>it can instead use this value:</p>
<pre>
    strings[ code ] = previous_string + previous_string[0];
</pre>
<p>The result of this is the insertion of just two lines of code at the start of the decompress loop, giving a loop that now looks like this:</p>
<pre>

while ( in &gt;&gt; code ) {
  if ( strings.find( code ) == strings.end() )
    strings[ code ] = previous_string + previous_string[0];
  out &lt;&lt; strings[code];
  if ( previous_string.size() )
    strings[next_code++] = previous_string + strings[code][0];
  previous_string = strings[code];
}
</pre>
<p>And with that, you have a complete implementation of the LZW encoder and decoder.</p>
<h4>Implementation</h4>
<p>Now that I've shown you the algorithm, the next step is to take that code and add turn it into a working program. Without changing the algorithm itself, I'm going to take you through four different customizations that work as follows:</p>
<ul>
<li/>LZW-A reads and writes code values rendered in text mode, which is great for debugging. It means you can view the output of the encoder in a text editor.
<li/>LZW-B reads and writes code values as 16-bit binary integers. This is fast and efficient, and usually results in significant data compresion.
<li/>LZW-C reads and writes code values as N-bit binary integers, where N is determined by the maximum code size. Performing I/O on codes that are not aligned on byte boundaries complicates the code somewhat, but allows for greater efficiency and better compression.
<li/>LZW-D reads and writes code values as variable-length binary integers, starting with 9-bit codes and gradually increasing as the dictionary grows. This gives the maximum compression.
</ul>
<p>Before launching into these implementations, the code I showed above needs some minor tweaking to solve a couple of problems.</p>
<p>The first problem we have to deal with is the ever-expanding dictionary. In the algorithm I've presented, we keep adding new codes to the dictionary without end. This needs to be changed for a couple of reasons.</p>
<p>First, we don't have unlimited memory, so the dictionary simply can't grow forever. Second, practical experience shows that compression ratios don't improve as dictionary sizes grow without bound. As the dictionary grows, code sizes get larger and larger, and so they take up more space in the compressed stream, which can reduce compression efficiency. </p>
<p>To resolve this problem, I just add an additional argument to the encoder and decoder that sets the maximum code value that will be added to the dictionary. The function signatures now look like this:</p>
<pre>
void compress( input_string input,
               output_stream output,
               const unsigned int max_code = 32767 );
void decompress( input_string input,
                 output_stream output,
                 const unsigned int max_code = 32767 );
</pre>
<p>Implementing it means one small change in the encoder:</p>
<pre>
if ( next_code &lt;= max_code )
  codes[ current_string ] = next_code++;
</pre>
<p>And a corresponding change in the decoder:</p>
<pre>
if ( previous_string.size() &#038;&#038; next_code &lt;= max_code )
  codes[ current_string ] = next_code++;
</pre>
<h4>Input and Output</h4>
<p>Finally, I need to give the algorithm a decent way to perform input and output - and this is where C++ offers a huge amount of help.</p>
<p>When writing generic compression code that you intend to use in multiple contexts, one of the more difficult things to deal with is I/O. People using your code might want to compress data in memory, stored in files, or streaming in from sockets or other sources. Some input data sources might be of unknown length (data coming from a TCP socket, for example), while others will be of a prescribed length. Back in the days of C, it was particularly difficult to make your compression code both generic, so it would work with all types of data streams, and efficient, so that I/O doesn't take any more time than it has to.</p>
<p>With the advent of C++, we have a new tool that can help in this quest - templates. Templates are designed to solve this problem in an efficient way, and I take advantage of this in my sample code. The code below shows the final version of the compressor and decompressor that are are used in all four versions of the implementation. There are two final changes made to the routines shown previously. First, both C++ functions are now function templates, parameterized on the the types being used for input and output. Second, the actual input and output is done through four newly introduced template classes:</p>
<pre>
template&lt;class INPUT, class OUTPUT&gt;
void compress( INPUT &amp;input, OUTPUT &amp;output, const unsigned int max_code = 32767 )
{
  input_symbol_stream&lt;INPUT&gt; in( input );
  output_code_stream&lt;OUTPUT&gt; out( output, max_code );

  std::unordered_map&lt;std::string, unsigned int&gt; codes( (max_code * 11)/10 );
  for ( unsigned int i = 0 ; i &lt; 256 ; i++ )
    codes[std::string(1,i)] = i;
  unsigned int next_code = 257;
  std::string current_string;
  char c;
  while ( in &gt;&gt; c ) {
    current_string = current_string + c;
    if ( codes.find(current_string) == codes.end() ) {
      if ( next_code &lt;= max_code )
        codes[ current_string ] = next_code++;
      current_string.erase(current_string.size()-1);
      out &lt;&lt; codes[current_string];
      current_string = c;
    }
  }
  if ( current_string.size() )
    out &lt;&lt; codes[current_string];
}

template&lt;class INPUT, class OUTPUT&gt;
void decompress( INPUT &amp;input, OUTPUT &amp;output, const unsigned int max_code = 32767  )
{
  input_code_stream&lt;INPUT&gt; in( input, max_code );
  output_symbol_stream&lt;OUTPUT&gt; out( output );

  std::unordered_map&lt;unsigned int,std::string&gt; strings( (max_code * 11) / 10 );
  for ( int unsigned i = 0 ; i &lt; 256 ; i++ )
    strings[i] = std::string(1,i);
  std::string previous_string;
  unsigned int code;
  unsigned int next_code = 257;
  while ( in &gt;&gt; code ) {
    if ( strings.find( code ) == strings.end() )
      strings[ code ] = previous_string + previous_string[0];
    out &lt;&lt; strings[code];
    if ( previous_string.size() &amp;&amp; next_code &lt;= max_code )
      strings[next_code++] = previous_string + strings[code][0];
    previous_string = strings[code];
  }
}
</pre>
<p>What exactly is the effect of implementing this algorithm using a pair of <em>function templates</em>, parameterized on the the types of the input and output objects? What this means is that you can call these compression routines with any type of I/O object you can throw at them. It can work with C++ iostreams, C FILE&nbsp;* objects, raw blocks of memory, whatever you want.</p>
<p>But there's a catch to that flexibility - you have to implement some basic I/O routines for whatever type you are using. Fortunately, this is not too hard.</p>
<p>The actual I/O that is done in the compression routines is defined by four template classes I created. These classes are defined in <code>lzw_streambase.h</code>. These classes don't have implementations, but they do define the methods you need to implement to work with the compressor and decompressor. The four classes are: </p>
<ul>
<li/><code>input_symbol_stream&lt;T&gt;</code>
<li/><code>ouput_symbol_stream&lt;T&gt;</code>
<li/><code>input_code_stream&lt;T&gt;</code>
<li/><code>output_code_stream&lt;T&gt;</code>
</ul>
<p>The first two classes are the symbol input and output classes. These are normally going to be very simple implementations, as they just have to read single characters to and from streams, while checking for errors or ends of streams. I use the same versions of these classes in all four implementations, so the code in <code>lzw-a.h</code> is unchanged in the other three header files.</p>
<p>The <code>input_symbol_stream&lt;T&gt;</code> class has one member function: the extraction operator, which reads a character from the stream and returns a boolean true or false. You'll see later in this section that the implementation of this for types such as <code>std::istream</code> is trivial.</p>
<pre>
template&lt;typename T&gt;
class input_symbol_stream
{
public :
    input_symbol_stream( T &amp; );
    bool operator&gt;&gt;( char &amp;c );
};
</pre>
<p>The <code>output_symbol_stream&lt;T&gt;</code> class uses the insertion operator to write strings instead of individual characters - because that is what is stored in the dictionary. The C++ <code>std::string</code> class makes a perfectly good container for any variety of symbols, including binary data, and unlike the alternative <code>vector&lt;char&gt;</code>, it comes with hash functions and <code>iostream</code> operators.</p>
<pre>
template&lt;typename T&gt;
class output_symbol_stream
{
public :
    output_symbol_stream( T &amp;  );
    void operator&lt;&lt;( const std::string &amp;s );
};
</pre>
<p>The <code>input_code_stream&lt;T&gt;</code> class reads codes, normally unsigned integers, from some type of stream. In my implementations, this class also returns false if it encounters the <code>EOF_CODE</code> in the stream of incoming codes. Removing the responsibility for EOF detection from the decompressor makes the code a bit simpler and more versatile.</p>
<p>The formatting of the integer is entirely up to the implementor, but the most common approach will probably be variable length codes ranging from 9 to 16 or so bits.</p>
<pre>
template&lt;typename T&gt;
class input_code_stream
{
public :
    input_code_stream( T &amp;, unsigned int );
    bool operator&gt;&gt;( unsigned int &amp;i );
};
</pre>
<p>The <code>output_code_stream&lt;T&gt;</code> class writes codes, usually unsigned integers, to some type of stream. Whatever class you implement for this function must agree with the implementation for <code>input_code_stream&lt;T&gt;</code>.</p>
<pre>
template&lt;typename T&gt;
class output_code_stream
{
public :
    output_code_stream( T &amp;, unsigned int );
    void operator&lt;&lt;( const unsigned int i );
};
</pre>
<p>You can see that at the top of the compressor and decompressor, I instantiate objects of these types, then use the standard insertion and extraction operators to read and write from these objects. </p>
<h4>LZW-A</h4>
<p>In my sample windows program, I include <code>lzw_streambase.h</code> and <code>lzw.h</code>, which accounts for all of the code you have seen so far. I have the following lines that perform compression and decompression:</p>
<pre>
std::ifstream in( name, std::ios_base::binary );
std::ofstream lzw_out( temp_name_lzw, std::ios_base::binary );
compress( (std::istream &amp;) in, (std::ostream&amp;) lzw_out, pDlg-&gt;m_MaxCodeSize );
.
.
.
std::ifstream lzw_in( temp_name_lzw, std::ios_base::binary );
std::fstream out( temp_name_out,
                  std::fstream::in    |
                  std::fstream::out   |
                  std::fstream::binary );
decompress( (std::istream &amp;) lzw_in, (std::ostream&amp;) out, pDlg-&gt;m_MaxCodeSize );
</pre>
<p>If I try to build this project as-is, I get a nasty list of eight linker errors:<br />
<center></p>
<table border="0">
<tr>
<td><img src="/attachments/2011/lzw/Figure01.png"/></td>
</tr>
<tr>
<td><center>Visual Studio 10 Error Messages</center></td>
</tr>
</table>
<p></center><br />
If you have the fortitude to crawl through those link errors, you will see that what is missing are the implementations of the four classes parameterized on <code>std::ostream</code> and <code>std::istream</code>. Each of the four classes needs the implementation of a constructor and either an insertion or extraction operator. And with no class definitions at all, that adds up to eight missing functions. To get us started on performing actual LZW compression, I've created the first implementation of these four classes in <code>lzw-a.h</code>. Let's take a look at each of these in turn.</p>
<p>It's tempting to try to read characters using the <code>ifstream</code> extraction operator, as in <code>m_impl &gt;&gt; c</code>, but that operator skips over whitespace, so we don't get an exact copy of the input stream. Using <code>get()</code> works around this problem. Below is the complete definition of <code>input_symbol_stream&lt;std::istream&gt;</code> used in all four LZW implementations in this article:</p>
<pre>
template&lt;&gt;
class input_symbol_stream&lt;std::istream&gt; {
public :
    input_symbol_stream( std::istream &amp;input )
        : m_input( input ) {}
    bool operator&gt;&gt;( char &amp;c )
    {
        if ( !m_input.get( c ) )
            return false;
        else
            return true;
    }
private :
    std::istream &amp;m_input;
};
</pre>
<p>Using the insertion operator to output strings seems to work properly, even when the strings contain binary data, so the implementation of the class used to output symbols is as simple as we could hope for. Again, this exact code is used in all four implementations in this article:</p>
<pre>
template&lt;&gt;
class output_symbol_stream&lt;std::ostream&gt; {
public :
    output_symbol_stream( std::ostream &amp;output )
        : m_output( output ) {}
    void operator&lt;&lt;( const std::string &amp;s )
    {
        m_output &lt;&lt; s;
    }
private :
    std::ostream &amp;m_output;
};
</pre>
<p>LZW-A prints the text values of integers to the output stream, and reads them back in that format. This is not efficient at all, but it is a great aid in debugging. If you are having a problem with the algorithm, this provides a nice way to examine your stream. The implementation of this is very simple - just use the <code>std::ostream</code> insertion operator, and follow each code by a newline so it can be properly parsed on input, as well as be easily loaded into a text editor.</p>
<p>One important thing to notice in this class: the presence of a destructor that prints the <code>EOF_CODE</code>. Since this object goes out of scope as the compressor exits, this insures that every code stream will end with this special code. Putting the onus on the I/O routines to deal with EOF issues simplifies the algorithm itself. (It also means that you can implement versions of LZW that don't use an EOF in the code stream.)</p>
<pre>
template&lt;&gt;
class output_code_stream&lt;std::ostream&gt; {
public :
    output_code_stream( std::ostream &amp;output, const unsigned int )
        : m_output( output ) {}
    void operator&lt;&lt;( unsigned int i )
    {
        m_output &lt;&lt; i &lt;&lt; '\n';
    }
    ~output_code_stream()
    {
        *this &lt;&lt; EOF_CODE;
    }
private :
    std::ostream &amp;m_output;
};
</pre>
<p>The corresponding version of the input class just reads in the white-space separated codes. If there is an error or an <code>EOF_CODE</code> encountered in the stream, the extraction operator returns false, which allows the decompressor to know when it is time to stop processing.</p>
<pre>
template&lt;&gt;
class input_code_stream&lt;std::istream&gt; {
public :
    input_code_stream( std::istream &amp;input, unsigned int )
        : m_input( input ) {}
    bool operator&gt;&gt;( unsigned int &amp;i )
    {
        m_input &gt;&gt; i;
        if ( !m_input || i == EOF_CODE )
            return false;
        else
            return true;
    }
private :
    std::istream &amp;m_input;
};
</pre>
<p>By including <code>lzw-a.h</code> along with the other two header files, I can now create a program that compiles, links, and is able to test the algorithm. Using my UNIX test program, I compress the demo string from earlier in this article, and I see the output as it is sent directly to <code>stdout</code>:<br />
<center></p>
<table border="0">
<tr>
<td><img src="/attachments/2011/lzw/Figure02.png"/></td>
</tr>
<tr>
<td><center>Compressing <code>ABBABBBABBA</code></center></td>
</tr>
</table>
<p></center><br />
Fortunately, the output is identical to what was shown earlier, with the addition of the final <code>EOF_CODE</code> used to delimit the end of the code stream.</p>
<h4>LZW-B</h4>
<p>The header file <code>lzw-b.h</code> implements specialized classes that replace the text-mode output of the codes in <code>lzw-a.h</code> with binary codes stored in a short integer - two bytes. </p>
<p>The classes that read and write symbols are unchanged, but reading and writing codes has to change in order to do this new binary output.</p>
<p>Writing the codes to <code>std::ostream</code> as binary values requires breaking the integer code into two bytes and writing the bytes one at a time. There are more efficient ways to write the complete short integer in one function call, but they raise code portability problems, as we don't always know what order bytes will be written in.</p>
<p>Like the code stream output object in <code>lzw-a.h</code>, this version of the code output class has a destructor that outputs an <code>EOF_CODE</code> value:</p>
<pre>
template&lt;&gt;
class output_code_stream&lt;std::ostream&gt; {
public :
    output_code_stream( std::ostream &amp;output, const unsigned int )
        : m_output( output ) {}
    void operator&lt;&lt;( unsigned int i )
    {
        m_output.put( i &amp; 0xff );
        m_output.put( (i&gt;&gt;8) &amp; 0xff);
    }
    ~output_code_stream()
    {
        *this &lt;&lt; EOF_CODE;
    }
private :
    std::ostream &amp;m_output;
};
</pre>
<p>Reading the codes requires reading the two bytes that make up the short integer, then combining them. While reading, if the routine detects an <code>EOF_CODE</code>, it returns false, which tells the decompressor to stop processing. It also returns false if there is an error on the input code stream.</p>
<pre>
template&lt;&gt;
class input_code_stream&lt;std::istream&gt; {
public :
    input_code_stream( std::istream &amp;input, unsigned int )
        : m_input( input ) {}
    bool operator&gt;&gt;( unsigned int &amp;i )
    {
        char c;
        if ( !m_input.get(c) )
            return false;
        i = c &amp; 0xff;
        if ( !m_input.get(c) )
            return false;
        i |= (c &amp; 0xff) &lt;&lt; 8;
        if ( i == EOF_CODE )
            return false;
        else
            return true;
    }
private :
    std::istream &amp;m_input;
};
</pre>
<p>The most exciting thing about <code>lzw-b.h</code> is that you can now see data compression taking place. The figure below shows a sample run of this implementation against the <a href="http://corpus.canterbury.ac.nz/descriptions/" class="newpage">Canterbury Corpus</a>, a standard set of files used to test compression. A run with my Windows test program shows that  the files are compressing quite nicely:<br />
<center></p>
<table border="0">
<tr>
<td><img src="/attachments/2011/lzw/Figure03.png"/></td>
</tr>
<tr>
<td><center>Compressing the Canterbury Corpus with <code>lzw-b.h</code></center></td>
</tr>
</table>
<p></center></p>
<h4>LZW-C</h4>
<p>The third I/O implentation, defined in <code>lzw-c.h</code>, writes binary codes like <code>lzw-b.h</code>, but with one crucial difference. Instead of being hard coded to 16 bit codes, <code>lzw-c.h</code> determines the maximum code size needed based on the maximum code value passed as an argument to <code>compress()</code> and <code>decompress()</code>. It then writes codes based on that width, which will normally be something in the range of 9-18 bits wide.</p>
<p>Since these values are not aligned with byte boundaries, there are some issues writing them to streams that expect to read and write bytes. However, it is definitely worth all the bit shifting, ORing, and ANDing, because when the size is 12 bites, we are going to save four bits per code when compared to using <code>lzw-b.h</code>. But every read and write potentially starts somewhere in the middle of a byte, so the I/O classes have to do some extra work - mostly involved with shifting bits to the correct position in the output stream.</p>
<p>Note that the code to read and write symbols is unchanged from <code>lzw-a.h</code> and <code>lzw-b.h</code>.</p>
<p>Many of the CS students who read my earlier article on LZW ran into a brick wall when they started trying to understand the code that performs I/O on codes of variable bit lengths. Obviously, writing 11 bit codes when your file system is oriented around eight-bit bytes involves a lot of bit twiddling, and I'm afraid that many novices are woefully deficient in this department. Not just in understanding the bitwise operators in C, such as shifting, masking, etc., but in understanding binary arithmetic in general.</p>
<p>That's why I've structured the code and this article a bit differently this time around. If the I/O operations in <code>lzw-c.h</code> and <code>lzw-d.h</code> are bewildering, well, no worries. They have absolutely nothing to do with the LZW algorithm itself. You can investigate and explore the algorithm completely using <code>lzw-a.h</code> and <code>lzw-b.h</code>, and just forget about the last two I/O implementations. They provide additional efficiency, but as I have said, have nothing to do with the algorithm itself. </p>
<p>Further, once you use <code>lzw-a.h</code> to debug and understand the algorithm, you can certainly plug in <code>lzw-c.h</code> and <code>lzw-d.h</code> and take advantage of their improved compression, even if you don't follow all the code. </p>
<p>It might be appropriate to add a sidebar or another section to explain the variable bit length I/O in detail, but this article is quite long already, and there are numerous other resources for the interested reader to explore the details. (But if you find yourself deficient in this area, you owe it to yourself to hit the books and get to the point where these operations make sense. This won't be the last time you need to understand bitwise operators.)</p>
<p>For those who are ready to tackle this more complicated I/O procedure, we will look first at the <code>output_code_stream&lt;std::ostream&gt;</code> class. Here, the first thing to understand is that the constructor has to initialize the number of bits in the code. This value is calculated from the <code>max_code</code> parameter, and is stored in member <code>m_code_size</code>, where it is used frequently.</p>
<p>Next, the insertion operator. Output of codes proceeds as follows. Member <code>m_pending_bits</code> tells us how many bits are pending output while sitting in member <code>m_pending_output</code>. These bits are right justified, and the count will always be less than eight. When the new code is written, it is inserted into <code>m_pending_output</code> after being left shifted so it will be laid down just past the pending bits. After doing that, we presumably have some bytes to output - the exact number depends on various factors. The <code>flush()</code> routine is called, and it flushes all complete bytes out. When it completes, there can be anywhere from zero to seven bits still waiting to be output, and they will be right justified in <code>m_pending_output</code>.</p>
<p>In the destructor, we output an <code>EOF_CODE</code>, and then do a flush as well. But in this case, we flush all possible bits, not just the complete bytes. There are two good reasons for this. First,  we don't care if the last bits that are flushed out are only part of a code - the code will be <code>EOF_CODE</code>, and that is the last one. And second, if we don't flush those final bits out in the destructor, they will never be sent to the output stream. This means the decoder will not see those bits, and we will most likely break the decompress process.</p>
<pre>
template&lt;&gt;
class output_code_stream&lt;std::ostream&gt;
{
public :
    output_code_stream( std::ostream &amp;out, unsigned int max_code )
        : m_output( out ),
          m_pending_bits(0),
          m_pending_output(0),
          m_code_size(1)
    {
        while ( max_code &gt;&gt;= 1 )
            m_code_size++;
    }
    ~output_code_stream()
    {
        *this &lt;&lt; EOF_CODE;
        flush(0);
    }
    void operator&lt;&lt;( const unsigned int &amp;i )
    {
        m_pending_output |= i &lt;&lt; m_pending_bits;
        m_pending_bits += m_code_size;
        flush( 8 );
    }
private :
    void flush( const int val )
    {
        while ( m_pending_bits &gt;= val ) {
            m_output.put( m_pending_output &amp; 0xff );
            m_pending_output &gt;&gt;= 8;
            m_pending_bits -= 8;
        }
    }
    std::ostream &amp;m_output;
    int m_code_size;
    int m_pending_bits;
    unsigned int m_pending_output;
};
</pre>
<p>Like the output code class, the input code class has to calculate the code size for this decompression based on the <code>max_code</code> value passed in the function call. </p>
<p>When an attempt is made to read a code, there must be a  minimum of <code>m_code_size</code> bits in member <code>m_pending_input</code>. If there aren't, new bytes are read in one at a time, and inserted into <code>m_pending_input</code> after having been shifted left the appropriate amount. Once <code>m_pending_input</code> contains at least <code>m_code_size</code> bits, the code is extracted from <code>m_pending_input</code> using the appropriate mask, the count in <code>m_pending_input</code> is reduced, and <code>m_pending_input</code> is shifted right by <code>m_code_size</code> bits.</p>
<pre>
template&lt;&gt;
class input_code_stream&lt;std::istream&gt;
{
public :
    input_code_stream( std::istream &amp;in, unsigned int max_code )
        : m_input( in ),
          m_available_bits(0),
          m_pending_input(0),
          m_code_size(1)
    {
        while ( max_code &gt;&gt;= 1 )
            m_code_size++;
    }
    bool operator&gt;&gt;( unsigned int &amp;i )
    {
        while ( m_available_bits &lt; m_code_size )
        {
            char c;
            if ( !m_input.get(c) )
                return false;
            m_pending_input |= (c &amp; 0xff) &lt;&lt; m_available_bits;
            m_available_bits += 8;
        }
        i = m_pending_input &amp; ~(~0 &lt;&lt; m_code_size);
        m_pending_input &gt;&gt;= m_code_size;
        m_available_bits -= m_code_size;
        if ( i == EOF_CODE )
            return false;
        else
            return true;
}
private :
    std::istream &amp;m_input;
    int m_code_size;
    int m_available_bits;
    unsigned int m_pending_input;
};
</pre>
<p>The table below shows the results of a test run comparing LZW-B and LZW-C run with a maximum code of 4095. With this maximum value, all codes fit in a 12-bit integer. Since LZW-B will use a 16-bit integer to store the code values, and LZW-C will use 12-bits, there should be a 4:3 ratio between the ratio of the file sizes when compressed using the two algorithms, and this looks to be the case:<br />
<center></p>
<table border=1">
<tr>
<th>File Name</th>
<th>Original<br/>Size</th>
<th>Compressed<br/>LZW-B</th>
<th>Compressed<br/>LZW-C</th>
<th>Ratio</th>
</tr>
<tr>
<td>alice29.txt</td>
<td>152089</td>
<td>96428</td>
<td>72322</td>
<td>0.750</td>
</tr>
<tr>
<td>alphabet.txt</td>
<td>100000</td>
<td>4538</td>
<td>3404</td>
<td>0.750</td>
</tr>
<tr>
<td>asyoulik.txt</td>
<td>125179</td>
<td>83966</td>
<td>62975</td>
<td>0.750</td>
</tr>
<tr>
<td>bib</td>
<td>111261</td>
<td>71792</td>
<td>53845</td>
<td>0.750</td>
</tr>
<tr>
<td>bible.txt</td>
<td>4047392</td>
<td>2468326</td>
<td>1851245</td>
<td>0.750</td>
</tr>
</table>
<p>Comparing 12-bit compression between LZW-B and LZW-C<br />
</center><br />
It looks like things are working as expected.</p>
<h4>LZW-D</h4>
<p>The code in <code>lzw-d.h</code> represents the final and most efficient version of I/O for the LZW code streams. It builds on the code in <code>lzw-c.h</code> - at its core it is a variable bit-length I/O stream. However, there is one crucial difference from <code>lzw-c.h</code>: the code I/O in <code>lzw-d.h</code> starts at the smallest possible code size, nine bits, and increases the code size as needed, until it reaches the maximum value for this compression session. The maximum value is the parameter passed in to the invocation of <code>compress()</code> or <code>decompress()</code>.</p>
<p>The logic behind this is pretty simple. Even if we are going to use 16-bit codes in an LZW program, when the program first starts, the maximum possible code the program can emit is 256, which only needs nine bits to encode. And each time we output a new symbol, that maximum possible code value only increases by one, which means that the first 256 codes output by the encoder can all fit in nine bits.</p>
<p>So the LZW-D encoder starts encoding using nine-bit code widths, and then bumps the value to ten as soon as the highest possible output code reaches 512. This process continues, incrementing the code size until the maximum code size is reached. At that point the code size stays fixed, as no new codes are being added to the dictionary.</p>
<p>The decoder follows exactly the same process - reading in the first code with a width of nine bits, then bumping to ten when the maximum possible input code reaches 512.</p>
<p>The code for this class is built on that from <code>lzw-c.h</code>, with some added complexity. Due to its increasing length, and the fact that it doesn't add too much to the discussion of LZW, I've omitted the listing, and instead refer you to the download available at the end of the article.</p>
<h4>The Windows Test Program</h4>
<p>When you develop compression code, there are a few different common tasks you are likely to want to perform:</p>
<ul>
<li/>Check your code for correctness, often through bulk testing.
<li/>Check your compression ratios against standard benchmarks.
<li/>Analyze your program's performance so as to make it more efficient and locate bottlenecks.
</ul>
<p>My Windows app is designed to help with all of these tasks. It basically allows you to select a single directory, set a maximum code size, then perform compression and decompression of all the files in the directory. An optional checkbox lets you include files in all directories under the test directory as well.</p>
<p>The application was built using Visual Studio 10, and it is a simple MFC Dialog-based application. It allows you to select a base directory, a maximum code size, and then compress all the files in that directory. If you select the recursion check box, you will also compress all the files in the entire tree of subdirectories below it.</p>
<p>Each file is compressed to a temporary location, then decompressed in a temporary location. The size of the compressed file is saved, and then a comparison is done to ensure that the original and expanded files are identical.</p>
<p>To help with data collection, after running a test, you can press the copy button and get the results of the test stuffed into your clipboard. Although it isn't visible in the display, the data stored in your clipboard includes the full path name of the original file, not just the basename.</p>
<p>This Visual Studio project takes advantage of a number of C++11 features, and as a result it will need some modification to work with earlier versions. Any version that supports <code>unordered_map</code> can be made to build without too many changes. And if you are going way back in time, you could replace <code>unordered_map</code> with <code>map</code>.</p>
<p>As shipped, the test program uses <code>lzw-d.h</code>. To use any of the other three other versions of I/O discussed in this article, just modify the include file selected at the top of LzwTestDlg.cpp. The figure below shows what the app looks like after running through some data:<br />
<center></p>
<table border="0">
<tr>
<td><img src="/attachments/2011/lzw/Figure04.png"/></td>
</tr>
<tr>
<td><center>The Windows test app after a test run</center></td>
</tr>
</table>
<p></center><br />
After pressing the copy button at the bottom of the dialog, you can paste the data into a spreadsheet and then crunch it to your heart's content:<br />
<center></p>
<table border="0">
<tr>
<td><img src="/attachments/2011/lzw/Figure05.png"/></td>
</tr>
<tr>
<td><center>Copying the data into a spreadsheet</center></td>
</tr>
</table>
<p></center></p>
<h4>The Linux Test Program</h4>
<p>The LZW code is platform independent, and will build and run just fine on UNIX or Linux systems. The Linux test program, <code>lzw.cpp</code>, allows you to compress or decompress files from the command line. It builds just fine with g++ 4.5, as long as you use the <code>-std=c++0x</code> switch to turn on the latest language features. Compiling with earlier versions will require a few minor modifications.</p>
<p>The command line interface to the test program is not too complicated, and is probably best documented by looking at the usage output:</p>
<pre>
mrn@ubuntu:~/LzwTest$ g++ -std=c++0x lzw.cpp -o lzw
mrn@ubuntu:~/LzwTest$ ./lzw
Usage:
lzw [-max max_code] -c input output #compress file input to file output
lzw [-max max_code] -c - output     #compress stdin to file otuput
lzw [-max max_code] -c input        #compress file input to stdout
lzw [-max max_code] -c              #compress stdin to stdout
lzw [-max max_code] -d input output #decompress file input to file output
lzw [-max max_code] -d - output     #decompress stdin to file otuput
lzw [-max max_code] -d input        #decompress file input to stdout
lzw [-max max_code] -d              #decompress stdin to stdout
mrn@ubuntu:~/LzwTest$
</pre>
<p>Like the Windows test program, the command line program is built by default with <code>lzw-d.h</code>. Replacing this algorithm with any of the three others requires a minor change to the source code.</p>
<p>With the default build, the program produces output nearly identical to UNIX compress. The one difference is that UNIX compress monitors the compression ratio after the dictionary is full, and clears the dictionary if the ratio starts to deteriorate (which it almost always does.) I include a benchmark program that tests UNIX compress against the command line test program, and the results show that for small files, the file size is almost identical:</p>
<pre>
mrn@ubuntu:~/LzwTest$ ./benchmark.sh 65535 16 canterbury | head -n 15 | column -t
Filename                 Original-size  LZW-size  Compress-size
--------                 -------------  --------  -------------
canterbury/aaa.txt       33406          320       321
canterbury/alice29.txt   152089         62247     62247
canterbury/alphabet.txt  100000         3052      3053
canterbury/asyoulik.txt  125179         54989     54990
canterbury/a.txt         1              3         5
canterbury/bib           111261         46527     46528
canterbury/bible.txt     4047392        1417735   1377093
canterbury/book1         768771         317133    317133
canterbury/book2         610856         247593    251289
canterbury/cp.html       24603          11315     11317
canterbury/E.coli        4638690        1213579   1218349
canterbury/fields.c      11150          4963      4964
canterbury/geo           102400         77777     77777
</pre>
<p>You can see in this test that LZW-D and UNIX compress perform nearly identically for all but the largest files in the test sample. If I modify UNIX compress to not monitor compression ratios, the difference seen with larger files goes away:</p>
<pre>
mrn@ubuntu:~/LzwTest$ ./benchmark.sh 65535 16 canterbury | head -n 15 | column -t
Filename                 Original-size  LZW-size  Compress-size
--------                 -------------  --------  -------------
canterbury/aaa.txt       33406          320       321
canterbury/alice29.txt   152089         62247     62247
canterbury/alphabet.txt  100000         3052      3053
canterbury/asyoulik.txt  125179         54989     54990
canterbury/a.txt         1              3         5
canterbury/bib           111261         46527     46528
canterbury/bible.txt     4047392        1417735   1417735
canterbury/book1         768771         317133    317133
canterbury/book2         610856         247593    247593
canterbury/cp.html       24603          11315     11317
canterbury/E.coli        4638690        1213579   1213579
canterbury/fields.c      11150          4963      4964
canterbury/geo           102400         77777     77777
</pre>
<p>That provides some support for the notion that the algorithm shown here behaves properly.</p>
<h4>Your Program</h4>
<p>If you want to build your own program and use these classes, all you need is a C++11 compiler, or an earlier version and a willingness to make a few changes. </p>
<p>To use the classes, include in order <code>lzw_streambase.h</code>, one of the four implementation files for <code>iostreams</code>, preferably <code>lzw-d.h</code>, and finally, <code>lzw.h</code>. Because the significant code in these files is all implemented as template functions or classes, there is no library to include in your project, and no C++ source you have to compile separately.</p>
<p>All of the code in these header files has been hoisted into the <code>lzw</code> namespace, so you will either have to explicitly use the namespace when you invoke <code>compress()</code> and <code>decompress()</code>, or insert this line into your program:</p>
<pre>
using namespace lzw;
</pre>
<p>One thing to note about the I/O routines I have defined. The template functions are specialized on <code>std::istream</code> and <code>std::ostream</code>. If you innocently pass in an object such as an <code>std::ifstream</code>, you will get compile time errors. This is because C++ template matching is done on a very strict basis - the compiler won't generally try to figure out that <code>std::ifstream</code> is derived from <code>std::istream</code>, and use the existing class. So instead, you will need to cast your arguments to the types defined in the header files. (Or write your own implementations.)</p>
<p>Your rights to use this code are covered by my <a href="http://marknelson.us/code-use-policy/" class="newpage">Liberal Code Use Policy</a>. As I have mentioned before, this is teaching code, if you decide to use it in a production system, there are many optimizations you might want to perform.</p>
<h4>Benchmarks</h4>
<p>So how does LZW do when it comes to compression? LZW's original strength was its combination of good compression ratios with high speed compression. The UNIX compress program is still nice and  fast, and Terry Welch's original application for LZW was in disk controllers. Because my program is a teaching program, it won't be nearly as fast as compress, but it's still useful to compare it to the de facto standard for lossless compression: the deflate algorithm.</p>
<p>We can compare LZW against deflate by a small modification of my benchmark script that uses gzip instead of compress. The table below shows the average compression ratios for the files in the canterbury corpus when compressed using maximum code widths of 15-18 bits. (The ratio is defined as 100*compressed_size/uncompressed_size, so 0% is perfect compression and 100% is no compression.)<br />
<center></p>
<table border="1">
<tr>
<th>gzip</th>
<th>LZW 15 bits</th>
<th>LZW 16 bits</th>
<th>LZW 17 bits</th>
<th>LZW 18 bits</th>
</tr>
<tr>
<td>32.7%</td>
<td>43.2%</td>
<td>42.6%</td>
<td>42.5%</td>
<td>42.3%</td>
</tr>
</table>
<p></center><br />
You can see that LZW does do a good job of compressing data, but the deflate algorithm used by gzip manages to squeeze an additional 10%, more or less, out of the files it compresses. The gap between LZW and deflate is larger on some types of files, and smaller on others, but deflate will almost always show a noticeable difference in compression ratios.</p>
<h4>Variations</h4>
<p>There are many variations on the code I've presented here that make sense. </p>
<p>One obvious change is to eliminate the special <code>EOF_CODE</code> used to delimit the end of the code stream. If the code stream is a file or other stream with an inherent EOF condition, there is no need for an <code>EOF_CODE</code> - simply reaching the end of the input stream will properly signal the end of the decoded material. Freeing up this one code will make a microscopically small improvement in the compression ratios of the product.</p>
<p>If you want to mimic the output of the compress program, you need to remove the <code>EOF_CODE</code>, and replace it with a <code>CLEAR_CODE</code> that has a value of 256. The compress program monitors the compression ratios it achieves after its dictionary is full, and when the ratio starts to decay, it issues the <code>CLEAR_CODE</code>. That code tells the decoder to clear its dictionary and make a fresh start with new nine-bit codes.</p>
<p>Once you get the hang of LZW, a good exercise to make sure you have it working properly is to create a GIF encoder and decoder. GIF uses LZW to losslessly compress images with a constrained palette, and after all these years is still somewhat of a standard on the web.</p>
<h4>History</h4>
<p>Usually the history lesson on an algorithm is at the start of the article, but this is a how-to piece, and I feel like the trip down memory lane is not as important as understanding how the algorithm works.</p>
<p>The roots of LZW were set down in 1978 when Jacob Ziv and Abraham Lempel published the second of their two seminal works on data compression, <a href="http://www.cs.duke.edu/courses/spring03/cps296.5/papers/ziv_lempel_1978_variable-rate.pdf" class="newpage">"Compression of Individual Sequences via Variable-Rate Coding"</a>. This paper described a general approach to data compression that involved building dictionaries of previously seen strings.</p>
<p>Ziv and Lempel's work was targeted at an academic audience, and it wasn't truly popularized until 1984 when Terry Welch published <a href="http://www.cs.duke.edu/courses/spring03/cps296.5/papers/welch_1984_technique_for.pdf" class="newpage">A Technique for High-Performance Data Compression</a>. Welch's paper took the somewhat abstract Information Theory work of Ziv and Lempel and reduced it to practice in such a way that others could easily implement it.</p>
<p>UNIX compress was probably the first popular program that used LZW compression, and it very quickly became a standard utility on UNIX systems. The freely available code for compress was incorporated into <a href="http://en.wikipedia.org/wiki/ARC_(file_format)" class="newpage">ARC</a>, one of the first archiving programs for PCs. In addition, the algorithm was used in the GIF file format, originally created by Compuserve in 1987.</p>
<p>LZW's popularity waned in the 1990s for two important reasons. First, Unisys began enforcing their patents that covered LZW compression, demanding and receiving royalties from various software companies. Not only did this make developers think twice about the liability they could incur while using LZW, it resulted in a general public relations backlash against using patented technology.</p>
<p>Secondly, the LZW algorithm was eclipsed on the desktop by deflate, as popularized by PKZIP. Not only did deflate outperform LZW, it was unencumbered by patents, and eventually had a very reliable and free open source implementation in <a href="http://zlib.net/" class="newpage">zlib</a>, a library written by a team lead by Marc Adler and Jean-loup Gailly. I don't know if there is any way to actually quantify this, but I think one could speculate that zlib is currently installed on more computer systems than any other software package in existence.</p>
<p>So LZW has settled down to an existence out of the limelight. It is still an important algorithm, used in quite a few file formats, and as this article shows, its simplicity makes it an excellent learning tool. </p>
<h4>Downloads</h4>
<ul>
<li><a href="/attachments/2011/lzw/LzwTest.zip">LzwTest.zip</a> - source for the Windows test app.
<li><a href="/attachments/2011/lzw/LzwExe.zip">LzwExe.zip</a> - The Windows test app executable.
<li><a href="/attachments/2011/lzw/lzw.tgz">lzw.tgz</a> - source for the UNIX text app.
</ul>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2011/11/08/lzw-revisited/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Combinatorial Data Compression</title>
		<link>http://marknelson.us/2011/01/09/combinatorial-data-compression/</link>
		<comments>http://marknelson.us/2011/01/09/combinatorial-data-compression/#comments</comments>
		<pubDate>Sun, 09 Jan 2011 23:10:57 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Data Compression]]></category>
		<category><![CDATA[Mathematics]]></category>

		<guid isPermaLink="false">http://marknelson.us/?p=154</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2011/01/09/combinatorial-data-compression/' addthis:title='Combinatorial Data Compression' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>Newcomers to the world of data compression often stumble on this old idea in hopes of creating a novel and powerful algorithm. In a nutshell, the idea is to create an enumerative coding system that uses combinatorial numbering to identify a message, in hopes of providing a more compact representation . Unfortunately, these schemes always [...]]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2011/01/09/combinatorial-data-compression/' addthis:title='Combinatorial Data Compression' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p>Newcomers to the world of data compression often stumble on this old idea in hopes of creating a novel and powerful algorithm. In a nutshell, the idea is to create an enumerative coding system that uses combinatorial numbering to identify a message, in hopes of providing a more compact representation . Unfortunately, these schemes always fail, for reasons that I&#8217;ll lay out in this article.<br />
<span id="more-154"></span></p>
<h4>Combinations</h4>
<p>To design a combinatorial algorithm that will compress files, you can think of the file as a series of integers. Since most of the files that you use are streams of bytes, consider each file to be a sequence of integers with values from 0 to 255.</p>
<p>If you go back to your first classes on probability and statistics, you might remember the definition of a <a href="http://en.wikipedia.org/wiki/Combination" class="newpage">combination</a>. A combination is simply a way of selecting a number of things from a larger set. When you are trying to compress a file of bytes, the natural size of this set is 256.</p>
<p>Probability theory tells us that we can count the number of combinations of a given size using a pretty simple formula. If a set has <em>n</em> elements and we are choosing <em>k</em> at a time, the number of possible combinations is given by the formula n!/k!*(n-k)!. This number is also known as the <em>binomial coefficient</em>.</p>
<p>Just for a simple example, the number of different ways you can select three bytes out of a set of 256 is 2,763,520. In general, with a large set, most combinations are going to generate very large numbers. The exceptions will be for values of k that are either very small or very close to the size of the set.</p>
<p>Combinations are well ordered, so any instance of the three bytes has a specific number between 0 and 2,763,519. We can call this the combinatorial rank. This means I can identify any three byte sequence by a combination number.</p>
<p>Assuming all combinations are equal, we can use an optimal arithmetic coder to encode this number in lg(2,763,510) bits, roughly 21.4. That&#8217;s interesting, because the three bytes actually take up 24 bits, so maybe there is some savings to be found here.</p>
<h4>The First Problem</h4>
<p>Knowing the combinatorial rank is good, but it won&#8217;t let you reconstruct a compressed file on its own. The combinatorial rank gives you the set of bytes in the file, but doesn&#8217;t tell you the <i>order</i> of those bytes. If there are k bytes, they can be ordered in k! different permutations. So to fully describe the file, you need to encode the combination rank <em>and</em> the permutation number.</p>
<p>Encoding the permutation for your 3 byte file is going to take 2.58 bits, calculated as lg(3!). This makes the total needed to encode your three byte file 23.98 bits. Admittedly not a lot of savings, but it&#8217;s also non-zero.</p>
<p>Let&#8217;s look at the number of bits needed to encode a 20 byte message. The number of combinations of this length are roughly 2.8*10^29, which will take 97.8 bits to encode. 20! is roughly 2.43*10^18, which will take 61.1 bits to encode. The total comes out to 158.9 bits. Since we&#8217;re encoding 160 bits of information, there is clearly a greater savings.</p>
<p>As the message size increases, the savings start to grow. At a message length of 50 we save 7.5 bytes, at 75 bytes we save 16 bytes per message. The trend looks good. By the time you get to a message length of 100 bytes, you&#8217;re saving 32 bits per message &#8211; a compression of 4% for doing nothing but recoding!</p>
<h4>The Second Problem</h4>
<p>The second problem you encounter in the combinatorial system is that, by definition, a combination is composed of unique elements. So if you are compressing a three byte file, you can&#8217;t have any duplicate bytes. Is this a problem?</p>
<p>Your inclination is to hope not. You know that every compression scheme only works on a subset of files, so perhaps the combinatorial scheme can be developed to work on segments of files with no duplicates. </p>
<p>How likely are you to find a duplicate in a file of three bytes? You can start by enumerating the total number of files of that length: 256^3. And you know how many files there are with no duplicates: the combinatorial number times the number of permutations. So it&#8217;s a simple matter to calculate the probability that a message of length k has no duplicated bytes. The value will be n!/(n-k)!*256^k.</p>
<p>For a value of 3, we see that the probability of no duplicate bytes is .988 &#8211; this means you can compress almost every file by a fraction of a bit.</p>
<p>You&#8217;d like to think that you can look at pretty long stretches of data and expect a low probability of duplicates, but unfortunately you run into the <a href="http://en.wikipedia.org/wiki/Birthday_problem" class="newpage">Birthday Paradox</a>. In the birthday paradox, you&#8217;re asked a question something like this: in a room of 23 people, what are the chances that two people share a common birthday? For most people, the answer, 50% or so, is non-intuitive. </p>
<p>Likewise, it means that a file with 100 bytes and no duplicates is such a rarity that it might as well never appear &#8211; the chances are less than one in a billion.</p>
<h4>Facing the Music</h4>
<p>You can see the problem here. We can compress very short sequences using a combinatorial system, but the savings are very small. Even so, we can compress most files. We can compress longer files for greater savings, but very few sequences will prove to be eligible.</p>
<p>It&#8217;s actually worse than that. Let&#8217;s work out the number on a hypothetical compressor. This compressor will use a combinatorial scheme to compress all files of 10 bytes. The compressor will look at the file, and if it has no duplicates, it will set a flag symbol in the output stream to be true, followed by the combinatorial number, followed by the permutation.</p>
<p>If the 10 byte file has duplicates, the compressor will generate a flag symbol of false, followed by the uncompressed data.</p>
<p>If this scheme gives us some savings, we can scale it up to operate on files of any size &#8211; we&#8217;ll just compress them in 10-byte chunks.</p>
<p>So let&#8217;s analyze the result. First, the number of files that will make it past the first test is pretty impressive: 83.695%. Each of these files will be compressed down to 79.743 bits. The remaining 16.305 percent will take exactly 80 bits in the output stream. So the overall size of our output file thus far is going to add up to 79.78519 bytes. Our algorithm is still in the black!</p>
<p>Unfortunately, we also need to account for the cost of the flag message. Using optimal coding, when the flag is true we are going to require .25679 bits. When it is false, optimal coding of the much rarer message will require 2.6 bits. Add in the cost of the flag, and the average output size goes up to a smidgen over 80 bits. </p>
<p>In other words, you lose.</p>
<h4>Conclusion</h4>
<p>The problem is a familiar one in data compression. Every time you come up with a way to encode a subset of files that saves some space, you find that all your savings are lost when you try to encode the files that aren&#8217;t part of the subset. Even using a single bit to flag special files as being incompressible is enough to wipe out your savings. It is the definition of <a href="http://www.amazon.com/Hasbro-40509-Whac-A-Mole-Game/dp/B0001GDP00?tag=theinternetdatac" class="newpage">Whac-A-Mole</a>.</p>
<p>With combinatorial coding you will find that same rule to hold true as for all forms of data compression: It isn&#8217;t going to be a universal compressor that can reduce every file in size. The only reason it will be useful is if you have an input set of files that all have a common characteristic: a preponderance of streams where duplicates are rare. And odds are, this set of files will probably be compressible using some more reasonable algorithm.</p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2011/01/09/combinatorial-data-compression/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>The Pigeonhole Principle</title>
		<link>http://marknelson.us/2010/08/01/the-pigeonhole-principle/</link>
		<comments>http://marknelson.us/2010/08/01/the-pigeonhole-principle/#comments</comments>
		<pubDate>Sun, 01 Aug 2010 19:01:01 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Data Compression]]></category>
		<category><![CDATA[Snarkiness]]></category>

		<guid isPermaLink="false">http://marknelson.us/2010/08/01/the-pigeonhole-principle/</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2010/08/01/the-pigeonhole-principle/' addthis:title='The Pigeonhole Principle' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>The Pigeonhole Principle, also referred to as the Counting Theorem, is a handy tool for mathematicians, and naturally, computer programmers. The loose version of this principle says &#8220;After placing n pigeons into m compartments, if n is greater than m, you will find that some compartment must contain more than one pigeon.&#8221; Seems obvious, and [...]]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2010/08/01/the-pigeonhole-principle/' addthis:title='The Pigeonhole Principle' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p>The <a href="http://en.wikipedia.org/wiki/Pigeonhole_principle">Pigeonhole Principle</a>, also referred to as the Counting Theorem, is a handy tool for mathematicians, and naturally, computer programmers.</p>
<p>The loose version of this principle says &#8220;After placing n pigeons into m compartments, if n is greater than m, you will find that some compartment must contain more than one pigeon.&#8221;</p>
<p>Seems obvious, and perhaps it is, but at least in the world of data compression it must be trotted out from time to time in order to bludgeon dreams back to reality.<br />
<span id="more-127"></span></p>
<h2>
<div>Impossible Compression</div>
</h2>
<p>A common dream for the novice is the creation of a compressor that will reduce the size of <i>all</i> files. (Often touted as the ability to compress &#8220;random&#8221; data.) For example, Dr. Constant Wong of <a href="http://recursiveware.com/">Recursiveware</a> has been polishing his technique for compressing random data since 2003. And the USENET newsgroup <a href="http://groups.google.com/group/comp.compression/topics">comp.compression</a> always has at least one thread dedicated to thrashing a new and eager theorist with a (flawed) idea.</p>
<p>The Pigeonhole Principle quickly puts this idea to rest. We know that if a file is of length n bits, there are 2<sup>n</sup> possible input files. If a compressor can reduce the size of <i>every</i> file, the number of possible output files is 2<sup>n</sup>-1. The Pigeonhole Principle tells us that the output of at least two file compressions have to be identical. And since they are identical, the decompressor cannot create two different output files. </p>
<h2>
<div>And More</div>
</h2>
<p>The Wikipedia has another nice example of the principle in use. Imagine that you have a party with n people attending. At random, people shake hands with one another as they mill about. At the end of the night, we check the number of unique individuals each person has shaken with. What are the odds that two people will have shaken hands with the same number of people?</p>
<p>The answer is of course that there will always be two people who have shaken the same number of hands. There are n-1 pigeonholes, and n pigeons, <i>QED</i>.</p>
<h2>
<div>Don&#8217;t Go There</div>
</h2>
<p>If you ever find yourself spiraling down the rabbit hole of impossible data compression, I urge you to grab the life jacket of the Pigeon Principle before you are lost. It will save you a lot of pointless effort, plus clue you in to the fact that there are two IBM employees with the same number of hairs on their head.</p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2010/08/01/the-pigeonhole-principle/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Cashing in On Electronic Books</title>
		<link>http://marknelson.us/2008/02/11/cashing-in-on-e-books/</link>
		<comments>http://marknelson.us/2008/02/11/cashing-in-on-e-books/#comments</comments>
		<pubDate>Mon, 11 Feb 2008 20:19:05 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[Business]]></category>
		<category><![CDATA[Data Compression]]></category>
		<category><![CDATA[Graphics]]></category>
		<category><![CDATA[People]]></category>
		<category><![CDATA[Writing]]></category>

		<guid isPermaLink="false">http://marknelson.us/2008/02/11/cashing-in-on-e-books/</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2008/02/11/cashing-in-on-e-books/' addthis:title='Cashing in On Electronic Books' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>Jeff Bezos Hawks the Kindle It&#8217;s still not clear whether electronic books are the wave of the future or a consumer products cul-de-sac. Technology continues to improve, and there are certainly lots of good reasons for a device like Amazon&#8217;s Kindle to be the leading edge of a major wave of adoption. A few of [...]]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2008/02/11/cashing-in-on-e-books/' addthis:title='Cashing in On Electronic Books' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><table align="right" cellspacing="5" border="0">
<tr>
<td><center><img src="http://marknelson.us/attachments/2008/cashing-in-on-e-books/bezos.jpg"><br />
Jeff Bezos Hawks the Kindle</center></p>
<td></tr>
</table>
<p><font size="+1"><strong>It&#8217;s </strong></font>still not clear whether electronic books are the wave of the future or a consumer products cul-de-sac. Technology continues to improve, and there are certainly lots of good reasons for a device like Amazon&#8217;s Kindle to be the leading edge of a major wave of adoption. A few of the more obvious arguments include:</p>
<ul>
<li/>Reduced cost of distribution. The publishing industry wastes a lot of money printing and shipping books, and because of historical practices, creates huge numbers of books that never even get sold. Not very green, and a waste of money.
<li/>Niche markets that can be very well-served. For example, high school and college students can replace those 30-pound backbacks with a 30-ounce tablet-sized device.
<li/>Removal of barriers to publication. The news and magazine businesses are being revolutionized by self-publication in the form of blogs. Self-publication is possible in the printed book world, but it is still a rather awkward process. Publication to electronic format is presumably a trivial problem.
<li/>Integration of information resources. A device like the Kindle allows you to consult the Internet, written reference materials, and your personal notes all from the same device, making it a true information portal.
</ul>
<p>But consumer acceptance is a fickle thing, so we don&#8217;t know if these rational arguments are going to fly. And of course, every writer who reviews a device like the Kindle or the equally capable Sony Reader feels compelled to write something along the lines of &#8220;But I just can&#8217;t imagine forgoing the pleasure of curling up on my couch with a good book.&#8221; I&#8217;m sure that when Gutenberg introduced movable type there were millions of industry reviewers posting notes to their blogs saying &#8220;The uniformity of the type is just esthetically unpleasing &#8211; I love it when I recognize the script of one of my favorite transcribing brothers.&#8221;</p>
<p>And then of course there&#8217;s the Steve Job&#8217;s <a href="http://bits.blogs.nytimes.com/2008/01/15/the-passion-of-steve-jobs/index.html" class="newpage">money quote</a> on the subject:</p>
<blockquote><p>
“It doesn’t matter how good or bad the product is, the fact is that people don’t read anymore, forty percent of the people in the U.S. read one book or less last year. The whole conception is flawed at the top because people don’t read anymore.”
</p></blockquote>
<p>So it&#8217;s really pretty hard to be sure just which way this is going to go.<br />
<span id="more-119"></span></p>
<h4>Looking Beyond The Reader</h4>
<table align="right" cellspacing="5" border="0">
<tr>
<td><center><img src="http://marknelson.us/attachments/2008/cashing-in-on-e-books/sony.jpg"><br />
Sony&#8217;s Reader</center></p>
<td></tr>
</table>
<p><font size="+1"><strong>For </strong></font>electronic books to succeed, one thing is certain: the physical reading experience needs to match up well with the one we have right now for our printed media. In some ways this problem is already solved &#8211; readers like the products from Amazon and Sony are book-sized and lightweight, with displays that are doing their best to match the various good qualities of paper.</p>
<p>But there are still issues that need work. One of the most important is in the area of layout and markup. A presentation format such as HTML is designed to work with multiple display sizes, repositioning elements as needed. This doesn&#8217;t necessarily work so well with textbooks, magazines, etc., where graphic artists invest huge amounts of time and energy on positioning, font selection, and other esthetic issues.</p>
<p>Additionally, there is am staggering amount of material that is simply not in a format compatible with today&#8217;s electronic books. Even magazines being published today are not always ready for transfer to an eBook format, and there is of course a massive backlog of valuable material in the world that has never existed in digital format.</p>
<p>Into this void steps Robert Maxwell Case, who has developed a system called ReadAllOver, and is attempting to exploit it via his company, <a href="http://seeandbelieve.com/" class="newpage">SeeAndBelieve.com</a>. SeeAndBelieve.com has created a digital layout technology called ReadAllOver (the company seemingly has a love affair with awkward CamelCase constructions) that does a superior job of preserving the look of printed materials. The company web site gives a good demonstration of exactly what they are capable of doing &#8211; it is definitely worth your time to take a look.</p>
<p>I asked Robert if he could take the time to answer a few questions about ReadAllOver, and he was gracious enough to respond.</p>
<h4>Questions With Robert Maxwell Case</h4>
<p><strong>Mark Nelson:</strong> Hi Robert. I just recently became aware of your company, SeeAndBelieve.Com, and your imaging system, ReadAllOver. Before we get into the details of your technology, can you tell me a little bit about the history of your company? How long have you been at it? Are you working mostly solo or do you have some help? What kind of background do you have that got you into your current work?</p>
<p><strong>Robert Maxwell Case:</strong> Sure, Mark. I come from a background of being a full-time musician and a part-time graphic designer. Around 1991-92, I was unhappy with then-current digital halftoning routines and began experimenting on my own. So I&#8217;ve been at it 15-plus years. </p>
<table align="right" cellspacing="5" border="0">
<tr>
<td><center><img src="http://marknelson.us/attachments/2008/cashing-in-on-e-books/AtWorkSABC.JPG"><br />
Jimmy Kung (left) and Robert</center></p>
<td></tr>
</table>
<p>I&#8217;m not a programmer, so I have had a succession of programming assistants.  In recent years, they have come from my affiliation with the Computer Science department at Texas State University-San Marcos where I am a seven-year member of the Industrial Advisory Board. Currently SeeAndBelieve.Com, in addition to myself, has one full-time employee, Jimmy Kung, and several part-timers. </p>
<p>The first of my five U.S. patents (three issued, two pending) was filed in 1993 in response to some interest expressed by Steve Carlsen, developer of the .TIF  graphics file format at Aldus (he&#8217;s now with Adobe Systems.)</p>
<p><strong>MN:</strong> Can you give me a capsule summary of ReadAllOver? How does it differ from page layout systems like we see in web browsers or PDF viewers? Does it differ from the rendering systems used in Sony and Amazon&#8217;s current eBook readers?</p>
<p><strong>RC:</strong> Well, ReadAllOver in a nutshell is a digital halftone-based graphics system suitable for eBooks. It renders on the screen a digital page with the &#8220;look and feel&#8221; of a <a href="http://www.gi.alaska.edu/ScienceForum/ASF8/823.html" target="_blank">printed page</a>, with all included graphic elements, typography and images, placed precisely as the graphic designer intended. It differs from existing web browsers and .PDF viewers in that it relies less on text files and font metrics and instead places more emphasis on a simplified, highly-compressible bitmap image. In many respects, it is a &#8220;picture&#8221; of the page, with an ancillary text file. </p>
<p>The Sony and Amazon eBook readers are primarily text-based, offering a limited number of typefaces and few graphics. They both use the E-<a href="http://www.clickinks.com/" target="_blank">Ink</a> subtractive screen and we think ReadAllOver&#8217;s halftone system can be tailored to enable a good fit with that screen.</p>
<p><strong>MN:</strong> It looks like your technology emulates the halftone process used to render photographic <a href="http://householdproducts.nlm.nih.gov/cgi-bin/household/brands?tbl=brands&#038;id=8020031" target="_blank">images</a> in newspapers and magazines. How does it improve on that process to achieve smaller file sizes? Do you have data showing the level of compression you get for specific images? And do you also render type as halftone images? That would seem a lot less efficient than treating type as marked-up <a href="http://www.misterinkjet.com/bulk-inks.htm" target="_blank">text</a>.</p>
<p><strong>RC:</strong> That&#8217;s right, ReadAllOver does emulate the halftone process with one major difference, and that is that typography can be processed with it and not fall apart. </p>
<table align="left" cellspacing="5" border="0">
<tr>
<td><center><img src="http://marknelson.us/attachments/2008/cashing-in-on-e-books/sample.png"><br />
Samples of ReadAllOver output<br/>(Detail may not be representative, images were resized)</center></p>
<td></tr>
</table>
<p>The big idea is that the output image starts as an interim monochromatic checkerboard pattern, beginning from gray. To simplify, we derive the output image by rendering local areas of the input image that are darker than checkerboard gray by turning corresponding output monochrome white pixels to black. Conversely, we render local areas of the input image that are lighter than checkerboard gray by turning corresponding output monochrome black pixels to white. The result is a checkerboard-based ordered dither that can be re-ordered for variable-length run-length compression. </p>
<p>We presently are comparing compression levels for specific images and plan to publish the results. So with typography, abrupt gray level changes such as font outlines fall on the black pixels of the interim checkerboard to render any font without reliance on font metrics or hinting. For example, ReadAllOver pages containing <em>only</em> typography are competitive in file size to marked-up systems, and with the additional attribute that our system is able to efficiently render any font, in any language, and any image, placed correctly. That&#8217;s exciting.</p>
<p><strong>MN:</strong> I get digital delivery of some magazines already. For example, IEEE Spectrum is delivered using technology from Qmags. Do you think you can do a better job than they are already doing?</p>
<p><strong>RC:</strong> Qmags has selected Adobe PDF as its preferred file format, so the comparisons I&#8217;ve made previously between ReadAllOver and PDF come into play here. Taking a quick look at Qmags&#8217; file sizes, I would say that ReadAllOver is competitive and looks subjectively more &#8220;print-like.&#8221;</p>
<p><strong>MN:</strong> Have you released any sample code or SDKs for people to work with your compression technology?</p>
<p><strong>RC:</strong> Not at the present time &#8230; we have been too busy developing our prototype (you can see it at SeeAndBelieve.Com ).</p>
<p><strong>MN:</strong> The idea of an electronic book reader has been floating around for a long time, but right now it seems like we&#8217;re finally seeing designs that are actually gaining some traction. There are still a lot of naysayers, however. What do you see in the future for the electronic book reader? Will it eventually do for reading what the iPod did for music? Or will it forever be a niche product that is stuck on the verge of popular success? And how hard will it be for you to get the manufacturers of eBook readers to adopt ReadAllOver?</p>
<p><strong>RC:</strong> Mark, I do believe that an electronic book reader will achieve iPod-like popular success, and, hopefully, in the near-term. In my opinion, the more book-like the readers become, the closer we will get to that so-called tipping point. Then the public will recognize the added benefit of having access to any publication, including one&#8217;s own library, available whenever and wherever they desire. There are some of obstacles still to be overcome, like screen pixels that are pretty large, and wired and wireless transmission pipes that are pretty small. </p>
<p>I personally would like to see a reading device with 8-1/2&#8243; x 11&#8243; facing screens, about the size and weight of a coffee-table book, with graphics-intensive magazines and newspapers rendered on screen nearly indistinguishable from paper and ink. I certainly can imagine a college student not having to carry a backpack full of textbooks around campus.</p>
<p>Here&#8217;s my vision for our product: We&#8217;re hoping to make adapting to ReadAllOver as seamless and easy as possible. We envision three communities of ReadAllOver users: </p>
<ol>
<li/>The reader</p>
<li/>The publisher/bookseller
<li/>The hardware manufacturer.
</ol>
<p>We plan to provide each of these groups with a cost effective, easy-to-use solution. The ReadAllOver Viewer will be free for readers. They&#8217;ll read .RAO files on dedicated eBook devices, and most likely on other devices that could emulate a book-like reading experience. The inherent look and feel of printed material should reduce eye strain and, of course, we plan to offer such extras as page flipping, content search, printing, etc.</p>
<p>Publishers and booksellers will use ReadAllOver Publisher, our media content production system. They&#8217;ll be able to convert content from scanned material, as well as from existing editing and page layout systems and standards, including Adobe InDesign and FrameMaker, Quark XPress, MS/Office, Open Office, etc. (and, of course, we&#8217;ll include a .PDF to .RAO converter.) This community will appreciate a built-in high level of content protection with ReadAllOver&#8217;s emphasis on bitmaps. Hopefully they won&#8217;t need more, but if they do, our product should easily be able to incorporate additional encryption and DRM mechanisms. </p>
<p>We also plan to collaborate with eBook (and other display device) manufacturers in order to provide built-in support for the ReadAllOver rendering system. We feel we can efficiently adapt our system to display components with limited available grayscale and color levels. We also hope to offer a fixed-bit-rate option where every delivered page is the same file size. </p>
<h4>Conclusion</h4>
<p><font size="+1"><strong>Thanks </strong></font>Robert, this is all interesting stuff. I can see advantages to ReadAllOver that we don&#8217;t get from layout systems like PDF or HTML, so perhaps you will be able to hammer out an effective market position. I don&#8217;t have an eBook reader yet, but I think when the technology reaches the point where I can have a color Kindle I&#8217;ll probably jump on board. I&#8217;m almost there now, but since the Kindle is perpetually sold out at Amazon.com, my dollars are still safely in my wallet.</p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2008/02/11/cashing-in-on-e-books/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Abraham Lempel Honored by IEEE</title>
		<link>http://marknelson.us/2007/07/13/lempel-award/</link>
		<comments>http://marknelson.us/2007/07/13/lempel-award/#comments</comments>
		<pubDate>Fri, 13 Jul 2007 13:09:51 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[Data Compression]]></category>
		<category><![CDATA[People]]></category>

		<guid isPermaLink="false">/2007/07/13/lempel-award/</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2007/07/13/lempel-award/' addthis:title='Abraham Lempel Honored by IEEE' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>The IEEE has announced its list of medal winners for 2007, and this year&#8217;s Richard Hamming medal was awarded to Dr. Abraham Lempel: For pioneering work in data compression especially the Lempel-Ziv algorithm. This is a timely award, because it comes on the 30th anniversary of the publication of the first of two seminal papers [...]]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2007/07/13/lempel-award/' addthis:title='Abraham Lempel Honored by IEEE' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p>The IEEE has <a href="http://www.theinstitute.ieee.org/portal/site/tionline/menuitem.130a3558587d56e8fb2275875bac26c8/index.jsp?&#038;pName=institute_level1_article&#038;TheCat=2202&#038;article=tionline/legacy/inst2007/may07/newsmajorawards.xml">announced</a> its list of medal winners for 2007, and this year&#8217;s Richard Hamming medal was awarded to <a href="http://www.ieee.org/portal/pages/about/awards/bios/2007_Bios/2007Hamming-Lempel.html">Dr. Abraham Lempel</a>: </p>
<blockquote><p>For pioneering work in data compression especially the Lempel-Ziv algorithm.</p></blockquote>
<p><img src="http://marknelson.us/attachments/2007/lempel-award/lempel.jpg" class="alignleft"/><br />
This is a timely award, because it comes on the 30th anniversary of the publication of the first of two seminal papers by Dr. Lempel and Jacob Ziv, his associate at Technion &#8211; Israel Institute of Technology.<br />
<span id="more-83"></span><br />
In 1977 Ziv and Lempel published <i><a href="http://www.cs.duke.edu/courses/spring03/cps296.5/papers/ziv_lempel_1977_universal_algorithm.pdf">A Universal Algorithm for Sequential Data Compression</a></i>, describing what would come to be known as the LZ77 algorithm. In 1978, they followed this with <i><a href="http://citeseer.ist.psu.edu/rd/44576777%2C580359%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/27647/http:zSzzSzcompression.graphicon.ruzSzdownloadzSzarticleszSzlzzSzziv_lempel_1978_variable-rate.pdf/ziv78compression.pdf">Compression of Individual Sequences via Variable-Rate Coding</a></i>, which described what came to be known as the LZ78 algorithm.</p>
<p>Both of these algorithms use macro substitution to compress arbitrary data. By replacing a long sequence of bytes with a short macro, compression is achieved. Both algorithms build this library of macros dynamically, adjusting to the input text as it is read. They differ in the way they build the library of macros.</p>
<p>LZ78 was eventually reduced to practice by Terry Welch with the publication of the LZW algorithm, used in UNIX compress, GIF files, and elsewhere. LZ77 provided the core of the deflate algorithm, which is used in the Zip standard, arguably the dominant lossless compression algorithm since it was introduced by PKZIP in 1993.</p>
<p>It would be hard to overstate the impact of these two papers in the world of data compression. While there are other lossless algorithms that can compress as well as LZ77 and LZ78 (<i>e.g.</i> <a href="http://en.wikipedia.org/wiki/Prediction_by_Partial_Matching">PPM</a>), the two LZ algorithms have won the battle of speed and efficiency almost 20 years. Applications for the algorithms vary from desktop applications to graphics file formats to tape drive controllers. As an example of their influence, Citeseer lists 458 citations for the <a href="http://citeseer.ist.psu.edu/ziv77universal.html">1977 paper</a>, and 293 for the <a href="http://citeseer.ist.psu.edu/ziv78compression.html">1978 paper</a>.</p>
<p>Dr. Lempel is currently employed by HP, performing and directing research at HP Labs in Israel. He was kind enough to answer a few questions for this article.</p>
<p><b>Mark Nelson:</b> In 1977 and 1978, you published two now-famous compression papers with Jacob Ziv that provided the background for some of today&#8217;s most effective and popular compression algorithms, such as the deflate standard used in Zip-compatible programs. At the time, did you imagine that your work would be so broadly used 30 years later?</p>
<p><b>Abraham Lempel:</b> We were so excited with the results of our work, that we did not give too much thought to this question at the time. As more and more extensions and versions were published by other researchers, we recognized the seminal nature of this work and viewed our approach to lossless data compression as a long term method.</p>
<p><b>MN:</b> The papers you published provide a mathematical and theoretical description of universal compressors, but don&#8217;t dig into implementation details. Do you have much personal interest in converting theory into practice, or do you get the most satisfaction from the research work that provides the foundation?</p>
<p><b>AL:</b> Following the publication of 1977 and 1978 papers, we both participated in writing an invention disclosure which included all the details of a preferred embodiment implementation. This was done while I was on sabbatical at Sperry Research, and the 2 granted patents ended up as Sperry, now Unisys, Intellectual Property.</p>
<p><b>MN:</b> Since your seminal work, I think the most significant advance in lossless  compression has been the creation of block-sorting algorithms, as described by Burrows and Wheeler. Do you think there are any revolutionary new techniques on the horizon, or should we just expect minor incremental improvements?</p>
<p><b>AL:</b> Hard to tell. When we worked on our method, the Huffman algorithm was considered the last word in lossless data compression. Now, the various LZ versions are treated as such. All this is in the context of text compression. I do expect radical progress in lossless <i>image</i> compression, which is a different beast from text.</p>
<p><b>MN:</b> After a distinguished career at Technion &#8211; Israel Institute of Technology and almost 25 years with HP Labs, you&#8217;re at a point where many people would be thinking of retiring and slowing down. Is that in your near-term plans?</p>
<p><b>AL:</b> I do plan to retire in 2008.</p>
<p>It is great to see the IEEE honor Dr. Lempel for his work. It would have been hard to understand how important this was in 1977. Today, with the benefit of 30 years hindsight, it clearly stands out as a masterpiece.</p>
<hr />
<h4>Reference Links</h4>
<ul>
<li><a href="http://citeseer.ist.psu.edu/ziv77universal.html">Citeseer page</a> for <i>A Universal Algorithm for Sequential Data Compression (1977)</i></li>
<li><a href="http://citeseer.ist.psu.edu/ziv78compression.html">Citeseer page</a> for <i>Compression of Individual Sequences via Variable-Rate Coding (1978)</i></li>
<li><a href="http://www.ieee.org/portal/pages/about/awards/bios/2007_Bios/2007Hamming-Lempel.html">IEEE Bio Page</a> that accompanied the Hamming Medal Announcement</li>
<li><a href="http://www.ieee.org/portal/pages/about/awards/pr/hampr.html">List of Hamming Medal Recipients</a> on the IEEE site.</li>
<li><a href="http://en.wikipedia.org/wiki/Abraham_Lempel">Wikipedia entry</a> for Abraham Lempel</li>
<li><a href="http://en.wikipedia.org/wiki/Jacob_Ziv">Wikipedia entry</a> for Jacob Ziv</li>
<li><a href="http://www.hpl.hp.com/about/bios/abraham_lempel.html">HP Bio page</a> for Abraham Lempel</li>
<li><a href="http://en.wikipedia.org/wiki/Lzw">Wikipedia entry</a> for LZW compression.</li>
<li><a href="http://marknelson.us/1989/10/01/lzw-data-compression/">My 1989 DDJ article</a> on LZW data compression, including C source</li>
<li><a href="http://www.zlib.net/">The zlib home page</a>. zlib is the free and widely used library that implements the deflate algorithm using in Zip programs. deflate combines an LZ77-style algorithm with a Huffman coder.
</ul>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2007/07/13/lempel-award/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Forgent&#8217;s JPEG Litigation Run is Over</title>
		<link>http://marknelson.us/2006/11/02/forgents-jpeg-litigation-run-is-over/</link>
		<comments>http://marknelson.us/2006/11/02/forgents-jpeg-litigation-run-is-over/#comments</comments>
		<pubDate>Thu, 02 Nov 2006 15:57:23 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[Data Compression]]></category>

		<guid isPermaLink="false">/2006/11/02/forgents-jpeg-litigation-run-is-over/</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2006/11/02/forgents-jpeg-litigation-run-is-over/' addthis:title='Forgent&#8217;s JPEG Litigation Run is Over' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>It looks like Forgent&#8217;s long run of JPEG lawsuit revenue has now officially dried up. The Austin American Statesman has an article (registration required) that says the remaining JPEG lawsuits have all been settled, with a grand total of less than $8 million from somewhere fewer than 20 defendents. Apparently, the whole JPEG lawsuit run [...]]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2006/11/02/forgents-jpeg-litigation-run-is-over/' addthis:title='Forgent&#8217;s JPEG Litigation Run is Over' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p>It looks like Forgent&#8217;s long run of JPEG lawsuit revenue has now officially dried up. The Austin American Statesman has an <a href="http://www.statesman.com/business/content/business/stories/technology/11/02/2forgent.html">article</a> (registration required) that says the remaining JPEG lawsuits have all been settled, with a grand total of less than $8 million from somewhere fewer than 20 defendents.</p>
<p>Apparently, the whole JPEG lawsuit run that Forgent was on must have racked up a lot of bad karma for the company. Although they collected over $100 million over a few years, they had to spend over half of that on legal fees, and while that was going on, they seem to have taken their eye off the ball and forgotten about developing new products and business. Their low stock price has generated a delisting warning from the NASDAQ, and to top it off, a digital video patent they had high hopes for appears to have been invalidated as well.</p>
<p>For those of you who haven&#8217;t been keeping up on Forgent&#8217;s activities, the company has been generating fairly large sums of money since 2002 by suing users of JPEG technology, starting with camera companies and moving on to computer manufacturers, web publishers, and so on. It all came to an end this year with the patent office invalidated key portions of the patent. A good recap can be found in <a href="http://en.wikipedia.org/wiki/Forgent_Networks">this Wikipedia article</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2006/11/02/forgents-jpeg-litigation-run-is-over/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Voice From the Past</title>
		<link>http://marknelson.us/2006/10/17/a-voice-from-the-past/</link>
		<comments>http://marknelson.us/2006/10/17/a-voice-from-the-past/#comments</comments>
		<pubDate>Wed, 18 Oct 2006 01:39:37 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[Data Compression]]></category>
		<category><![CDATA[People]]></category>
		<category><![CDATA[Writing]]></category>

		<guid isPermaLink="false">/2006/10/17/a-voice-from-the-past/</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2006/10/17/a-voice-from-the-past/' addthis:title='A Voice From the Past' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>The eminently quotable Samuel Johnson had a very pragmatic view about writing, and was quoted by Boswell as having said: No man but a blockhead ever wrote, except for money. Personally, I think Johnson is pretty close to the mark on this one, but I will add one caveat. Ask any writer about their first [...]]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2006/10/17/a-voice-from-the-past/' addthis:title='A Voice From the Past' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p><img src="http://marknelson.us/attachments/2006/misc/dcus.gif" class="alignleft"/> The eminently quotable <a href="http://en.wikipedia.org/wiki/Samuel_Johnson">Samuel Johnson</a> had a very pragmatic view about writing, and was quoted by Boswell as having said:</p>
<blockquote><p>No man but a blockhead ever wrote, except for money.</p></blockquote>
<p>Personally, I think Johnson is pretty close to the mark on this one, but I will add one caveat. Ask any writer about their first book, and the thing they remember best is the thrill of seeing a volume with their name on it up on the shelf &#8211; money has nothing to do with it.</p>
<p>My first effort was <a href="http://dogma.net/markn/tdcb/tdcb.htm">The Data Compression Book</a>, published all the way back in 1992, when there was a lot less interest in the field than there is now. It was unbroken ground, which meant there was room for an amateur in the field, and I nearly had it to myself. </p>
<p>With the help of DDJ Editor and mentor <a href="http://ddj.com/erickson.htm">Jon Erickson</a>, I convinced M&#038;T Books I could do a creditable job on this book. They took me up on it, and believe me, the first time I walked into Taylor&#8217;s books in Dallas and saw three or four of these on the shelf, it was a thrill you can hardly imagine. </p>
<p>Fifteen years later, with a handful of other books behind me, I&#8217;m much more Johnson-esque and blase these days. Show me the money. </p>
<p>But every once in a while, something manages to pierce this hard-boiled shell and remind of what it was like to first see that book in 1992. This week it was an email out of the blue from somebody named Steve Johnson, who was kind enough to let me reprint his email in its entirety:</p>
<blockquote><p>
<em>Dear Mark,</p>
<p>I have had a copy of The Data Compression Book in my possession for a long time.  It was a book of instrumental importance to me when I was starting out a fledgling business long ago &#8211; in 1992 &#8211; which turned into an image compression company partly because of your book.  Your book captivated me with the compression problem, and taught me the basics of information/coding theory (enough to become a dangerous dilettante).  Out of that understanding sprang a company called Johnson-Grace Company, which created the first streaming online media ever.  AOL used my algorithm to put pictures online in early 1994 when the world dialed in at 2400bps or 9600bps, and &#8216;digital pictures&#8217; were as fantastic as radio in 1920.  My little company went on in 1995 (around the time MSN was launched) and created &#8216;streaming sound&#8217; and &#8216;moving pictures&#8217; (slideshows with sound) and then simple telephony (still over dialup, now at 14.4kbps, we added a &#8216;talk&#8217; button to AOL’s Instant Message box – and then subtracted it after seeing the challenge of creating a consumer grade experience over low bandwidth dial up!). </p>
<p>AOL bought my company in 1996, and the algorithm (AOL’s proprietary &#8216;ART&#8217; format) still compresses billions of images everyday on the backend of their web delivery system.   I owe a great deal to your book – for its clear accessible style, and excellent coverage of the subject.  It quite simply taught me (an economist) how compression works, and I managed to put it to a use that solved an important problem.</p>
<p>So here’s why I’m writing (besides finally thanking you after all these years!).</p>
<p>My eldest daughter Emma graduated from high school earlier this year (we live in Boston) and has just commenced her freshman year at Oberlin College this fall.  As a graduation gift, I’m presenting her with a bound set of my favorite books called &#8220;Dad’s Great Books,&#8221; of which I have included my rebound copy of Data Compression.  It is no doubt one of the most influential books I’ve ever read, and I hope Emma cherishes the book as much as I did (she happens to be fascinated with information theory at the moment).</p>
<p>If it isn’t too much trouble to ask, I would greatly appreciate it if I could send this bound copy to you for your autograph and have you return the book in the FedEx envelope that I would include.</p>
<p>Please let me know, at your earliest convenience, as well as a good address to use for sending this your way. </p>
<p>Warm Regards,</p>
<p>Steve Johnson<br />
</em></p></blockquote>
<p>When I read this email to my wife she had tears in her eyes, and yes, I was a bit verklempt myself. </p>
<p>Steve sent me a nicely bound copy of the book and I inscribed it to he and Emma (without even asking for an honorarium!), and with luck she&#8217;ll have it on her shelf in a few days, a curious souvenir of bygone times.</p>
<p>All in all a Hallmark moment, although there is a scary side to the whole thing. Steve comes from a background in Economics, knowns as the Dismal Science. Emma is interested in Information Theory, which has no nickname yet, but perhaps should be known as The Yet More Dismal Science.</p>
<p>Should this continue, I fear for Emma&#8217;s children. They&#8217;ll have to find a branch of mathematical science even more obscure, confusing, and impossible to explain at family functions. Will they be Set Theorists? Transfinite Algebrists? </p>
<p>Regardless of their intellectual or business pursuits, it seems inevitable that anyone from this line of people will be delightful to have around. </p>
<p>Thanks, Steve, for a reminder of what it was like to write that book in 1992, and best of all, to  know that at least for you, it was exactly what I had hoped it would be.</p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2006/10/17/a-voice-from-the-past/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

