Star-Encoding in C++
Dr. Dobb's Journal August, 2002 |
This article describes an interesting data compression technique called star-encoding. I’ll explain what star-encoding is, describe a simple C++ implementation, and show you what sort of results you can expect to see when using it.
Pre-Compression Transformations
One interesting class of data compression techniques consists of transformation algorithms. It is often possible to perform a reversible transformation on a data set that increases its susceptibility to other compression techniques.
A great example of this is the Burrows-Wheeler Transform (BWT), which I described in an article in Dr. Dobb’s Journal in 1996. The BWT is a completely reversible sorting algorithm that tends to create long runs of identical characters in the output text. Using a Move To Front algorithm followed by Huffman or arithmetic coding gives compression that outperforms all but the best algorithms.
A similar example is that of the Discrete Cosine Transform (DCT). In the JPEG compression algorithm, 8X8 blocks of pixels are run through the DCT, which transforms the spatial data points to the frequency domain. At that point, lossy compression algorithms can then be effectively applied.
Of course, the transformations I’m describing here generally don’t compress data at all; they merely prep the data before compression is applied. Thus, a key factor in using any transform is understanding what compression technique can be effectively be used on the output. With the Burrows-Wheeler transform, the best results come from a three-stage compressor. The output of the BWT transform is usually piped through a move-to-front stage, then a run length encoder stage, and finally an entropy encoder, normally arithmetic or Huffman coding. The actual command line to perform this sequence will look something like this:
BWT < input-file | MTF | RLE | ARI > output-file
Star-Encoding
The Burrows-Wheeler transform is a fairly complicated sorting algorithm that radically changes the nature of an input file. Star-encoding is much simpler to understand, and can be implemented easily with the help of the standard C++ library.
Star-encoding works by creating a large dictionary of commonly used words expected in the input files. The dictionary must be prepared in advance, and must be known to the compressor and decompressor.
Each word in the dictionary has a star-encoded equivalent, in which as many letters a possible are replaced by the ‘*’ character. For example, a commonly used word such as the might be replaced by the string t**. The star-encoding transform simply replaces every occurrence of the word the in the input file with t**.
Ideally, the most common words will have the highest percentage of ‘*’ characters in their encodings. If done properly, this means that transformed file will have a huge number of ‘*’ characters. This ought to make the transformed file more compressible than the original plain text.
As an example, a section of text from Project Gutenberg’s version of Romeo and Juliet looks like this in the original text:
But soft, what light through yonder window breaks?
It is the East, and Iuliet is the Sunne,
Arise faire Sun and kill the enuious Moone,
Who is already sicke and pale with griefe,
That thou her Maid art far more faire then she
Running this text through the star-encoder yields the following text:
B** *of*, **a* **g** *****g* ***d*r ***do* b*e***?
It *s *** E**t, **d ***i** *s *** *u**e,
A***e **i** *un **d k*** *** e****** M****,
*ho *s a****** **c*e **d **le ***h ****fe,
***t ***u *e* *ai* *r* f*r **r* **i** ***n s**
You can clearly see that the encoded data has exactly the same number of characters, but is dominated by stars. It certainly looks as though it is more compressible. But before you can test the compressibility of the output data, you need to build the three programs this system requires:
- A program to create the dictionary.
- An encoder which star-encodes a file using a dictionary.
- A decoder which decodes the file using an identical dictionary.
I’ll discuss each of these in turn in the remainder of this article.
Creating the Dictionary
The program that creates the dictionary, MakeDictA.cpp
, is show in Listing 1. This program is a
testament to the power of the C++ standard library classes. It would literally triple in size if
it were written in C, and would have been a nightmare to debug.
MakeDictA
is invoked with a list of file names as command line arguments. It scans through each
file looking for tokens, adds them to the dictionary if needed, and updates the token’s frequency
count.
After all of the tokens have been counted, the list is sorted by frequency. The program then goes through the list, starting with the most frequent tokens, and assigns them star-codes. As a final step, the dictionary is written to standard output.
Tokenizing and scanning
Unfortunately, the C++ standard library doesn’t have a stream tokenizer. If my needs were more complicated, I might use the Boost Tokenizer found at the boost.org Tokenizer page, but in this case I spurned reuse and wrote just a few lines:
This function is called repeatedly from main()
. The map<string,int>
object named frequency
is updated as each token is encountered. If the string is not found in frequency
, it is
inserted with a count of 1. Otherwise the count is incremented:
Sorting and Assigning Codes
After all the tokens from all the input files have been scanned, you have a map with an appearance count for every word encountered in the input text. The next step requires that you iterate through all the words starting at the most frequent and going all the way to the least.
Unfortunately, the frequency map is keyed on the word name, not the frequency. So before code
assignment can start, you need to reorder the data. You can do this with just a few lines of code
by inserting all the string/integer pairs into a map<int,string>
object called counts
. Since
this map is keyed on the integer value instead of the string, the values are automatically sorted
in the form you want.
With the data sorted correctly, the remaining piece of hard work is to iterate over the counts
map and assign a code to each one:
A couple of interesting things of note show up in this loop. First, each time a new code is
assigned to a token, the code value is stored in a set<string>
called used_codes
. You have to
keep track of the used code values during the assignment process so that no star code is
inadvertently used twice.
Second, you can see that the value of each token and its code is being written to standard output. (I write the frequency count as well, although this is technically not needed. It helps me to keep an eye on the process and ensure it is working as desired.)
The Code Assignment Algorithm
In MakeDictA
, I do star code assignment in a routine called create_star_code()
. This
algorithm uses a greedy heuristic to assign codes, attempting to use the highest possible number
of ‘*’ characters to each token.:
For a token of length N, the assignment routine first attempts to replace all characters
in the token with stars, then all but one, then all but two, and so one. Note that the
often-overlooked next_permutation
function from the standard library helps out quite a bit with
this.
Executing MakeDictA
To test this program, I went to Project Gutenberg and procured the text of eight of Shakespeare’s plays. (The text is included in the source zip file for this project.) With allowances for line breaks, the text below shows the command line I used to create the dictionary file:
[c:\star] MakeDictA text\AsYouLikeIt.txt
text\Hamlet.txt
text\JuliusCaesar.txt
text\KingLear.txt
text\Macbeth.txt
text\RomeoAndJuliet.txt
text\TheTamingOfTheShrew.txt
text\TheTempest.txt
> DictA.txt
The program creates 16,927 codes in fairly short order, producing DictA.txt,
a file that is
shown below in somewhat abridged form:
the *** 4556
I * 4390
and **d 3676
to ** 3218
of *f 2899
you **u 2775
a a 2714
.
.[skipping 8500 lines]
.
vexations ****t**n* 1
vestall **s*a** 1
versall **r**l* 1
verities ***i***s 1
verge v*r** 1
.
. [skipping 8500 lines]
.
'Twentie *T****** 1
'Saue *S*** 1
'Prethee *P****** 1
'Pre *P** 1
'Boue *B***
With this somewhat limited vocabulary, you can see that even words that appear very rarely are replaced with a very high percentage of stars. The most frequent words are replaced completely.
Compressing and Decompressing the Transformed Data
Listing 2 and 3 show StarEncode.cpp
and StarDecode.cpp
, the short programs used to transform
text files to and from star-encoded format. I won’t go into the details of their operation - if
you understand MakeDictA.cpp
, you won’t have any trouble with these two programs.
Both programs are invoked with the name of the dictionary file on the command line. The programs
then transform standard input to standard output. Internally, they simply read the dictionary
file into a map<string,string>
object that holds all the encodings. They then read in text,
passing special characters through unchanged, and translating tokens whenever possible.
Figure 1 shows the encoding program in operation, as it encodes Romeo and Juliet using the dictionary created earlier:
Figure 1 - Encoding Romeo and Juliet
The result of this encoding should be a file that is now more compressible than the original. To test this theory, I compressed both the original and the encoded file in WinZip, with the results shown in Figure 2.
Figure 2 - Compressing the encoded file
The results are encouraging. The compressed file decreases in size by almost 12%, with the compression ratio improving from 60% to 64%. Tangible savings with only a small increase in computation.
Variant B
While building the encoder, I started wondering about the structure of my star codes. It seemed kind of arbitrary to insist that every star code have the same number of stars as the word it was encoding. For example, the most frequent word, the, was encoded as “***”, while the second most frequent word, I, was encoded with a single star. Shouldn’t the most frequent word get the fewest number of stars?
To test this theory, I created a second dictionary builder using a new heuristic for assigning codes. In this heuristic, a star code is simply a single star character concatenated with a plain-text code number, with the most frequent token being assigned code 0.
Using this scheme simplified my encoder quite a bit. The main loop that assigns codes to tokens no longer needed to keep track of used codes, and was simple enough to do code assignment inside the loop, instead of in a separate function:
Comparing the output from Variant B to that describe earlier shows how the two strategies create quite different dictionaries:
the *0 4556
and *1 3676
to *2 3218
of *3 2899
you *4 2775
.
.[skipping 8500 lines]
.
vexations *8480 1
vestall *8481 1
versall *8482 1
verities *8483 1
verge *8484 1
.
. [skipping 8500 lines]
.
'Twentie *16899 1
'Saue *16900 1
'Prethee *16901 1
'Pre *16902 1
'Boue *16903 1</code>
One notable difference in this case is that I don’t bother to encode strings of just a single
character in length, as their encoded values would always be larger. To test this variant, I
compiled MakeDictB.cpp
, created DictB.txt
, and ran RomeoandJuliet.txt
through StarEncode
.
The five lines shown earlier in this article have now morphed into the following:
*43 *1069, *41 *349 *469 *1492 *2583 *12782?
*114 *7 *0 *2550, *1 *417 *7 *0 *931,
*5359 *193 *902 *1 *521 *0 *1548 *616,
*168 *7 *1283 *731 *1 *695 *13 *1042,
*34 *21 *27 *938 *144 *3817 *49 *193 *37 *67
The text is somewhat harder to read, but also a bit more compact. Note also that the character set is much more limited here, since every word is star-encoded using only the star character and the integers.
Comparing A and B
Figure 3 shows the listing of a Zip file that contains both encoded texts plus the original.
Figure 3 - A Zip file with both A and B variants
In this case, the difference between a standard Zip compressor and variant B runs nearly 15%, quite a nice savings. You can see that variant B clearly comes out ahead, with a delta that seems worth going for.
Recommendations and Caveats
Star-encoding is clearly a handy way to squeeze some additional fluff out of your files. Of course , there is a catch. For star-encoding to be effective, you have to know in advance what the vocabulary of your source files is going to be. There are many times when this will be the case. Typical examples might include:
- Computer-generated reports
- Email or newsgroup archives
- Trace or log files
- Cached web pages
I doubt that star-encoding could be used in a general purpose compressor. A hypothetical archiver called StarZip might attempt to create a good dictionary and tokenizer for various file types, but the task seems difficult.
The code I have written here is typical magazine code: short and simple. It could form the basis for production code, but would need some hardening first. For example, the programs shown here can’t cope with something as simple as a star character embedded in the plain text. Caveat emptor.
There is plenty of room for experimentation here. I’d love to hear back from any readers who develop encodings that can outdo the ones shown here. Send me an email if you create something interesting here. Enjoy!
References
Robert Franceschini and Amar Mukherjee. Data Compression Using Encrypted Text. Proceedings of the third Forum on Research and Technology, Advances on Digital Libraries, ADL 96, 1996, pp. 130-138. Robert Franceschini, Holger Kruse, Nan Zhang, Raja Iqbal and Amar Mukherjee, Lossless, Reversible Transformations that Improve Text Compression Ratios .
Mark Nelson, Data Compression with the Burrows-Wheeler Transform, Dr. Dobb’s Journal, September 1996.
Listings
Listing 1 - MakeDictA.cpp
Listing 2 - StarEncode.cpp
Listing 3 - StarDecode.cpp
Listing 4 - MakeDictB.cpp
Sample text - Shakespeare.zip