DDJ Cover from August, 2002 Dr. Dobb's Journal

August, 2002

This article describes an interesting data compression technique called star-encoding. I'll explain what star-encoding is, describe a simple C++ implementation, and show you what sort of results you can expect to see when using it.

Pre-Compression Transformations

One interesting class of data compression techniques consists of transformation algorithms. It is often possible to perform a reversible transformation on a data set that increases its susceptibility to other compression techniques.

A great example of this is the Burrows-Wheeler Transform (BWT), which I described in an article in Dr. Dobb's Journal in 1996. The BWT is a completely reversible sorting algorithm that tends to create long runs of identical characters in the output text. Using a Move To Front algorithm followed by Huffman or arithmetic coding gives compression that outperforms all but the best algorithms.

A similar example is that of the Discrete Cosine Transform (DCT). In the JPEG compression algorithm, 8X8 blocks of pixels are run through the DCT, which transforms the spatial data points to the frequency domain. At that point, lossy compression algorithms can then be effectively applied.

Of course, the transformations I'm describing here generally don't compress data at all; they merely prep the data before compression is applied. Thus, a key factor in using any transform is understanding what compression technique can be effectively be used on the output. With the Burrows-Wheeler transform, the best results come from a three-stage compressor. The output of the BWT transform is usually piped through a move-to-front stage, then a run length encoder stage, and finally an entropy encoder, normally arithmetic or Huffman coding. The actual command line to perform this sequence will look something like this:

BWT < input-file | MTF | RLE | ARI > output-file

Star-Encoding

The Burrows-Wheeler transform is a fairly complicated sorting algorithm that radically changes the nature of an input file. Star-encoding is much simpler to understand, and can be implemented easily with the help of the standard C++ library.

Star-encoding works by creating a large dictionary of commonly used words expected in the input files. The dictionary must be prepared in advance, and must be known to the compressor and decompressor.

Each word in the dictionary has a star-encoded equivalent, in which as many letters a possible are replaced by the '*' character. For example, a commonly used word such as the might be replaced by the string t**. The star-encoding transform simply replaces every occurrence of the word the in the input file with t**.

Ideally, the most common words will have the highest percentage of '*' characters in their encodings. If done properly, this means that transformed file will have a huge number of '*' characters. This ought to make the transformed file more compressible than the original plain text.

As an example, a section of text from Project Gutenberg's version of Romeo and Juliet looks like this in the original text:

But soft, what light through yonder window breaks?
It is the East, and Iuliet is the Sunne,
Arise faire Sun and kill the enuious Moone,
Who is already sicke and pale with griefe,
That thou her Maid art far more faire then she

Running this text through the star-encoder yields the following text:

B** *of*, **a* **g** *****g* ***d*r ***do* b*e***?
It *s *** E**t, **d ***i** *s *** *u**e,
A***e **i** *un **d k*** *** e****** M****,
*ho *s a****** **c*e **d **le ***h ****fe,
***t ***u *e* *ai* *r* f*r **r* **i** ***n s**

You can clearly see that the encoded data has exactly the same number of characters, but is dominated by stars. It certainly looks as though it is more compressible. But before you can test the compressibility of the output data, you need to build the three programs this system requires:

  • A program to create the dictionary.
  • An encoder which star-encodes a file using a dictionary.
  • A decoder which decodes the file using an identical dictionary.

I'll discuss each of these in turn in the remainder of this article.

Creating the Dictionary

The program that creates the dictionary, MakeDictA.cpp, is show in Listing 1. This program is a testament to the power of the C++ standard library classes. It would literally triple in size if it were written in C, and would have been a nightmare to debug.

MakeDictA is invoked with a list of file names as command line arguments. It scans through each file looking for tokens, adds them to the dictionary if needed, and updates the token's frequency count.

After all of the tokens have been counted, the list is sorted by frequency. The program then goes through the list, starting with the most frequent tokens, and assigns them star-codes. As a final step, the dictionary is written to standard output.

Tokenizing and scanning

Unfortunately, the C++ standard library doesn't have a stream tokenizer. If my needs were more complicated, I might use the Boost Tokenizer found at the boost.org Tokenizer page, but in this case I spurned reuse and wrote just a few lines:

C++:
  1. string get_token( ifstream &file )
  2. {
  3.     string token;
  4.     while ( file )
  5.     {
  6.         char c;
  7.         file>> c;
  8.         if ( isalpha( c ) || c == '\'' )
  9.             token += c;
  10.         else if ( token.size() )
  11.             return token;
  12.     }
  13.     return token;
  14. }

This function is called repeatedly from main(). The map object named frequency is updated as each token is encountered. If the string is not found in frequency, it is inserted with a count of 1. Otherwise the count is incremented:

C++:
  1. string token = get_token( infile );
  2.     if ( token.size() == 0 )
  3.         break;
  4.     map<string, int>::const_iterator ii = frequency.find( token );
  5.     if ( ii == frequency.end() )
  6.         frequency[ token ] = 1;
  7.     else
  8.         frequency[ token ]++;

Sorting and Assigning Codes

After all the tokens from all the input files have been scanned, you have a map with an appearance count for every word encountered in the input text. The next step requires that you iterate through all the words starting at the most frequent and going all the way to the least.

Unfortunately, the frequency map is keyed on the word name, not the frequency. So before code assignment can start, you need to reorder the data. You can do this with just a few lines of code by inserting all the string/integer pairs into a map object called counts. Since this map is keyed on the integer value instead of the string, the values are automatically sorted in the form you want.

C++:
  1. multimap<int, string> counts;
  2.     for ( map<string, int>::iterator ii = frequency.begin() ;
  3.           ii != frequency.end();
  4.           ii++ )
  5.       counts.insert( pair<int,string>( (*ii).second, (*ii).first ) )

With the data sorted correctly, the remaining piece of hard work is to iterate over the counts map and assign a code to each one:

C++:
  1. set<string> used_codes;
  2. for ( multimap<int, string>::reverse_iterator jj = counts.rbegin() ;
  3.       jj != counts.rend() ;
  4.       jj++ )
  5. {
  6.   string code = create_star_code( (*jj).second, used_codes );
  7.   used_codes.insert( code );
  8.   codes[ (*jj).second ] = code;
  9.   cout <<(*jj).second <<" " <<code <<" " <<(*jj).first <<endl;
  10. }

A couple of interesting things of note show up in this loop. First, each time a new code is assigned to a token, the code value is stored in a set called used_codes. You have to keep track of the used code values during the assignment process so that no star code is inadvertently used twice.

Second, you can see that the value of each token and its code is being written to standard output. (I write the frequency count as well, although this is technically not needed. It helps me to keep an eye on the process and ensure it is working as desired.)

The Code Assignment Algorithm

In MakeDictA, I do star code assignment in a routine called create_star_code(). This algorithm uses a greedy heuristic to assign codes, attempting to use the highest possible number of '*' characters to each token.:

C++:
  1. for ( int star_count = token.size() ; star_count> 0 ; star_count-- )
  2. {
  3.   vector<char> pattern( token.size(), '*' );
  4.   for ( int i = star_count ; i <token.size() ; i++ )
  5.     pattern[ i ] = '-';
  6.   do
  7.   {
  8.     string test( token );
  9.     for ( int j = 0 ; j <token.size() ; j++ )
  10.       if ( pattern[ j ] == '*' )
  11.         test[ j ] = '*';
  12.     set<string>::const_iterator kk = used_codes.find( test );
  13.     if ( kk == used_codes.end() )
  14.       return test;
  15.   } while ( next_permutation( pattern.begin(), pattern.end() ) );
  16. }
  17. return token;

For a token of length N, the assignment routine first attempts to replace all characters in the token with stars, then all but one, then all but two, and so one. Note that the often-overlooked next_permutation function from the standard library helps out quite a bit with this.

Executing MakeDictA

To test this program, I went to Project Gutenberg and procured the text of eight of Shakespeare's plays. (The text is included in the source zip file for this project.) With allowances for line breaks, the text below shows the command line I used to create the dictionary file:

[c:\star] MakeDictA text\AsYouLikeIt.txt
text\Hamlet.txt
text\JuliusCaesar.txt
text\KingLear.txt
text\Macbeth.txt
text\RomeoAndJuliet.txt
text\TheTamingOfTheShrew.txt
text\TheTempest.txt
> DictA.txt

The program creates 16,927 codes in fairly short order, producing DictA.txt, a file that is shown below in somewhat abridged form:

the *** 4556
I * 4390
and **d 3676
to ** 3218
of *f 2899
you **u 2775
a a 2714
.
.[skipping 8500 lines]
.
vexations ****t**n* 1
vestall **s*a** 1
versall **r**l* 1
verities ***i***s 1
verge v*r** 1
.
. [skipping 8500 lines]
.
'Twentie *T****** 1
'Saue *S*** 1
'Prethee *P****** 1
'Pre *P** 1
'Boue *B***

With this somewhat limited vocabulary, you can see that even words that appear very rarely are replaced with a very high percentage of stars. The most frequent words are replaced completely.

Compressing and Decompressing the Transformed Data

Listing 2 and 3 show StarEncode.cpp and StarDecode.cpp, the short programs used to transform text files to and from star-encoded format. I won't go into the details of their operation - if you understand MakeDictA.cpp, you won't have any trouble with these two programs.

Both programs are invoked with the name of the dictionary file on the command line. The programs then transform standard input to standard output. Internally, they simply read the dictionary file into a map object that holds all the encodings. They then read in text, passing special characters through unchanged, and translating tokens whenever possible.

Figure 1 shows the encoding program in operation, as it encodes Romeo and Juliet using the dictionary created earlier:



Figure 1 - Encoding Romeo and Juliet

The result of this encoding should be a file that is now more compressible than the original. To test this theory, I compressed both the original and the encoded file in WinZip, with the results shown in Figure 2.



Figure 2 - Compressing the encoded file

The results are encouraging. The compressed file decreases in size by almost 12%, with the compression ratio improving from 60% to 64%. Tangible savings with only a small increase in computation.

Variant B

While building the encoder, I started wondering about the structure of my star codes. It seemed kind of arbitrary to insist that every star code have the same number of stars as the word it was encoding. For example, the most frequent word, the, was encoded as "***", while the second most frequent word, I, was encoded with a single star. Shouldn't the most frequent word get the fewest number of stars?

To test this theory, I created a second dictionary builder using a new heuristic for assigning codes. In this heuristic, a star code is simply a single star character concatenated with a plain-text code number, with the most frequent token being assigned code 0.

Using this scheme simplified my encoder quite a bit. The main loop that assigns codes to tokens no longer needed to keep track of used codes, and was simple enough to do code assignment inside the loop, instead of in a separate function:

C++:
  1. int star_code = 0;
  2. for ( multimap<int, string>::reverse_iterator jj = counts.rbegin() ;
  3.       jj != counts.rend() ;
  4.       jj++ )
  5. {
  6.   string token = (*jj).second;
  7.   if ( token.size()> 1 )
  8.   {
  9.     cout <<token
  10.          <<" "
  11.          <<'*'
  12.          <<star_code++
  13.      <<" "
  14.          <<(*jj).first
  15.          <<endl;
  16.   }
  17. }

Comparing the output from Variant B to that describe earlier shows how the two strategies create quite different dictionaries:

the *0 4556
and *1 3676
to *2 3218
of *3 2899
you *4 2775
.
.[skipping 8500 lines]
.
vexations *8480 1
vestall *8481 1
versall *8482 1
verities *8483 1
verge *8484 1
.
. [skipping 8500 lines]
.
'Twentie *16899 1
'Saue *16900 1
'Prethee *16901 1
'Pre *16902 1
'Boue *16903 1

One notable difference in this case is that I don't bother to encode strings of just a single character in length, as their encoded values would always be larger. To test this variant, I compiled MakeDictB.cpp, created DictB.txt, and ran RomeoandJuliet.txt through StarEncode. The five lines shown earlier in this article have now morphed into the following:

*43 *1069, *41 *349 *469 *1492 *2583 *12782?
*114 *7 *0 *2550, *1 *417 *7 *0 *931,
*5359 *193 *902 *1 *521 *0 *1548 *616,
*168 *7 *1283 *731 *1 *695 *13 *1042,
*34 *21 *27 *938 *144 *3817 *49 *193 *37 *67

The text is somewhat harder to read, but also a bit more compact. Note also that the character set is much more limited here, since every word is star-encoded using only the star character and the integers.

Comparing A and B

Figure 3 shows the listing of a Zip file that contains both encoded texts plus the original.



Figure 3 - A Zip file with both A and B variants

In this case, the difference between a standard Zip compressor and variant B runs nearly 15%, quite a nice savings. You can see that variant B clearly comes out ahead, with a delta that seems worth going for.

Recommendations and Caveats

Star-encoding is clearly a handy way to squeeze some additional fluff out of your files. Of course, there is a catch. For star-encoding to be effective, you have to know in advance what the vocabulary of your source files is going to be. There are many times when this will be the case. Typical examples might include:

  • Computer-generated reports
  • Email or newsgroup archives
  • Trace or log files
  • Cached web pages

I doubt that star-encoding could be used in a general purpose compressor. A hypothetical archiver called StarZip might attempt to create a good dictionary and tokenizer for various file types, but the task seems difficult.

The code I have written here is typical magazine code: short and simple. It could form the basis for production code, but would need some hardening first. For example, the programs shown here can't cope with something as simple as a star character embedded in the plain text. Caveat emptor.

There is plenty of room for experimentation here. I'd love to hear back from any readers who develop encodings that can outdo the ones shown here. Send me an email if you create something interesting here. Enjoy!

References

Robert Franceschini and Amar Mukherjee. Data Compression Using Encrypted Text. Proceedings of the third Forum on Research and Technology, Advances on Digital Libraries, ADL 96, 1996, pp. 130-138. Robert Franceschini, Holger Kruse, Nan Zhang, Raja Iqbal and Amar Mukherjee, Lossless, Reversible Transformations that Improve Text Compression Ratios.

Mark Nelson, Data Compression with the Burrows-Wheeler Transform, Dr. Dobb's Journal, September 1996.

Listings

Listing 1 - MakeDictA.cpp
Listing 2 - StarEncode.cpp
Listing 3 - StarDecode.cpp
Listing 4 - MakeDictB.cpp
Sample text - Shakespeare.zip