This article first appeared in the
February, 1991 issue of
Dr. Dobb’s Journal

Update:I have a substantially improved article that covers Arithmetic Coding only which you can find here: Data Compression With Arithmetic Coding. I strongly encourage you to read this article first for an improved take on the topic.

Note:This article was originally published in two parts. Both parts have been combined here into one post – please note that when you reach the footnotes and attachments for the first part, you are only halfway through the article. Part 2 starts here.

Most of the data compression methods in common use today fall into one of two camps: dictionary based schemes and statistical methods. In the world of small systems, dictionary based data compression techniques seem to be more popular at this time. However, by combining arithmetic coding with powerful modeling techniques, statistical methods for data compression can actually achieve better performance. This two-part article discusses how to combine arithmetic coding with several different modeling methods to achieve some impressive compression ratios. The first part of the article details how arithmetic coding works. The second shows how to develop some effective models that can use an arithmetic coder to generate high performance compression programs.

Terms of Endearment

Data compression operates in general by taking symbols from an input text, processing them, and writing codes to a compressed file. For the purposes of this article, symbols are usually bytes, but they could just as easily be pixels, 80 bit floating point numbers, or EBCDIC characters. To be effective, a data compression scheme needs to be able to transform the compressed file back into an identical copy of the input text. Needless to say, it also helps if the compressed file is smaller than the input text!

Dictionary based compression systems operate by replacing groups of symbols in the input text with fixed length codes. A well-known example of a dictionary technique is LZW data compression. (See LZW Data Compression in the 10/89 issue of DDJ). LZW operates by replacing strings of essentially unlimited length with codes that usually range in size from 9 to 16 bits.

Statistical methods of data compression take a completely different approach. They operate by encoding symbols one at a time. The symbols are encoded into variable length output codes. The length of the output code varies based on the probability or frequency of the symbol. Low probability symbols are encoded using many bits, and high probability symbols are encoded using fewer bits.

In practice, the dividing line between statistical and dictionary methods is not always so distinct. Some schemes can’t be clearly put in one camp or the other, and there are always hybrids which use features from both techniques. However, the methods discussed in this article use arithmetic coding to implement purely statistical compression schemes.

Huffman Coding: The Retired Champion

It isn’t enough to just be able to accurately calculate the probability of characters in a datastream, we also need a coding method that effectively takes advantage of that knowledge. Probably the best known coding method based on probability statistics is Huffman coding. D. A. Huffman published a paper in 1952 describing a method of creating a code table for a set of symbols given their probabilities. The Huffman code table was guaranteed to produce the lowest possible output bit count possible for the input stream of symbols, when using fixed length codes. Huffman called these minimum redundancy codes, but the scheme is now universally called Huffman coding. Other fixed length coding systems, such as Shannon-Fano codes, were shown by Huffman to be non-optimal.

Huffman coding assigns an output code to each symbol, with the output codes being as short as 1 bit, or considerably longer than the input symbols, strictly depending on their probabilities. The optimal number of bits to be used for each symbol is the log base 2 of (1/p), where p is the probability of a given character. Thus, if the probability of a character is 1/256, such as would be found in a random byte stream, the optimal number of bits per character is log base 2 of 256, or 8. If the probability goes up to 1/2, the optimum number of bits needed to code the character would go down to 1.

The problem with this scheme lies in the fact that Huffman codes have to be an integral number of bits long. For example, if the probability of a character is 1/3, the optimum number of bits to code that character is around 1.6. The Huffman coding scheme has to assign either 1 or 2 bits to the code, and either choice leads to a longer compressed message than is theoretically possible.

This non-optimal coding becomes a noticeable problem when the probability of a character becomes very high. If a statistical method can be developed that can assign a 90% probability to a given character, the optimal code size would be 0.15 bits. The Huffman coding system would probably assign a 1 bit code to the symbol, which is 6 times longer than is necessary.

Adaptive Schemes

A second problem with Huffman coding comes about when trying to do adaptive data compression. When doing non-adaptive data compression, the compression program makes a single pass over the data to collect statistics. The data is then encoded using the statistics, which are unchanging throughout the encoding process.

In order for the decoder to decode the compressed data stream, it first needs a copy of the statistics. The encoder will typically prepend a statistics table to the compressed message, so the decoder can read it in before starting. This obviously adds a certain amount of excess baggage to the message.

When using very simple statistical models to compress the data, the encoding table tends to be rather small. For example, a simple frequency count of each character could be stored in as little as 256 bytes with fairly high accuracy. This wouldn’t add significant length to any except the very smallest messages. However, in order to get better compression ratios, the statistical models by necessity have to grow in size. If the statistics for the message become too large, any improvement in compression ratios will be wiped out by added length needed to prepend them to the encoded message.

In order to bypass this problem, adaptive data compression is called for. In adaptive data compression, both the encoder and decoder start with their statistical model in the same state. Each of them process a single character at a time, and update their models after the character is read in. This is very similar to the way most dictionary based schemes such as LZW coding work. A slight amount of efficiency is lost by beginning the message with a non-optimal model, but it is usually more than made up for by not having to pass any statistics with the message.

The problem with combining adaptive modeling with Huffman coding is that rebuilding the Huffman tree is a very expensive process. For an adaptive scheme to be efficient, it is necessary to adjust the Huffman tree after every character. Algorithms to perform adaptive Huffman coding were not published until over 20 years after Huffman coding was first developed. Even now, the best adaptive Huffman coding algorithms are still relatively time and memory consuming.

Arithmetic Coding: how it works

It has only been in the last ten years that a respectable candidate to replace Huffman coding has been successfully demonstrated: Arithmetic coding. Arithmetic coding completely bypasses the idea of replacing an input symbol with a specific code. Instead, it takes a stream of input symbols and replaces it with a single floating point output number. The longer (and more complex) the message, the more bits are needed in the output number. It was not until recently that practical methods were found to implement this on computers with fixed sized registers.

The output from an arithmetic coding process is a single number less than 1 and greater than or equal to 0. This single number can be uniquely decoded to create the exact stream of symbols that went into its construction. In order to construct the output number, the symbols being encoded have to have a set probabilities assigned to them. For example, if I was going to encode the random message “BILL GATES”, I would have a probability distribution that looks like this:

Character    Probability
---------    -----------
 SPACE          1/10
   A            1/10
   B            1/10
   E            1/10
   G            1/10
   I            1/10
   L            2/10
   S            1/10
   T            1/10

Once the character probabilities are known, the individual symbols need to be assigned a range along a probability line, which is nominally 0 to 1. It doesn’t matter which characters are assigned which segment of the range, as long as it is done in the same manner by both the encoder and the decoder. The nine character symbol set use here would look like this:

  Character    Probability      Range
 ---------    -----------   -----------
  SPACE          1/10       0.00 - 0.10
    A            1/10       0.10 - 0.20
    B            1/10       0.20 - 0.30
    E            1/10       0.30 - 0.40
    G            1/10       0.40 - 0.50
    I            1/10       0.50 - 0.60
    L            2/10       0.60 - 0.80
    S            1/10       0.80 - 0.90
    T            1/10       0.90 - 1.00

Each character is assigned the portion of the 0-1 range that corresponds to its probability of appearance. Note also that the character “owns” everything up to, but not including the higher number. So the letter ‘T’ in fact has the range 0.90 – 0.9999….

The most significant portion of an arithmetic coded message belongs to the first symbol to be encoded. When encoding the message “BILL GATES”, the first symbol is “B”. In order for the first character to be decoded properly, the final coded message has to be a number greater than or equal to 0.20 and less than 0.30. What we do to encode this number is keep track of the range that this number could fall in. So after the first character is encoded, the low end for this range is 0.20 and the high end of the range is 0.30.

After the first character is encoded, we know that our range for our output number is now bounded by the low number and the high number. What happens during the rest of the encoding process is that each new symbol to be encoded will further restrict the possible range of the output number. The next character to be encoded, ‘I’, owns the range 0.50 through 0.60. If it was the first number in our message, we would set our low and high range values directly to those values. But ‘I’ is the second character. So what we do instead is say that ‘I’ owns the range that corresponds to 0.50-0.60 in the new subrange of 0.2 – 0.3. This means that the new encoded number will have to fall somewhere in the 50th to 60th percentile of the currently established range. Applying this logic will further restrict our number to the range 0.25 to 0.26.

The algorithm to accomplish this for a message of any length is is shown below:

Set low to 0.0
Set high to 1.0
While there are still input symbols do
    get an input symbol
    code_range = high - low.
    high = low + range*high_range(symbol)
    low = low + range*low_range(symbol)
End of While
output low

Following this process through to its natural conclusion with our chosen message looks like this:

New Character   Low value        High Value
-------------   ---------        ----------
                0.0              1.0
    B           0.2              0.3
    I           0.25             0.26
    L           0.256            0.258
    L           0.2572           0.2576
  SPACE         0.25720          0.25724
    G           0.257216         0.257220
    A           0.2572164        0.2572168
    T           0.25721676       0.2572168
    E           0.257216772      0.257216776
    S           0.2572167752     0.2572167756

So the final low value, 0.2572167752 will uniquely encode the message “BILL GATES” using our present encoding scheme.

Given this encoding scheme, it is relatively easy to see how the decoding process will operate. We find the first symbol in the message by seeing which symbol owns the code space that our encoded message falls in. Since the number 0.2572167752 falls between 0.2 and 0.3, we know that the first character must be “B”. We then need to remove the “B” from the encoded number. Since we know the low and high ranges of B, we can remove their effects by reversing the process that put them in. First, we subtract the low value of B from the number, giving 0.0572167752. Then we divide by the range of B, which is 0.1. This gives a value of 0.572167752. We can then calculate where that lands, which is in the range of the next letter, “I”.

The algorithm for decoding the incoming number looks like this:

get encoded number 
Do 
    find symbol whose range straddles the encoded number 
    output the symbol 
    range = symbol low value - symbol high value 
    subtract symbol low value from encoded number 
    divide encoded number by range 
until no more symbols 

Note that I have conveniently ignored the problem of how to decide when there are no more symbols left to decode. This can be handled by either encoding a special EOF symbol, or carrying the stream length along with the encoded message.

The decoding algorithm for the “BILL GATES” message will proceed something like this:

Encoded Number  Output Symbol    Low   High   Range
--------------  -------------    ---   ----   -----
0.2572167752          B          0.2   0.3     0.1
0.572167752           I          0.5   0.6     0.1
0.72167752            L          0.6   0.8     0.2
0.6083876             L          0.6   0.8     0.2
0.041938            SPACE        0.0   0.1     0.1
0.41938               G          0.4   0.5     0.1
0.1938                A          0.2   0.3     0.1
0.938                 T          0.9   1.0     0.1
0.38                  E          0.3   0.4     0.1
0.8                   S          0.8   0.9     0.1
0.0

In summary, the encoding process is simply one of narrowing the range of possible numbers with every new symbol. The new range is proportional to the predefined probability attached to that symbol. Decoding is the inverse procedure, where the range is expanded in proportion to the probability of each symbol as it is extracted.

Practical Matters

The process of encoding and decoding a stream of symbols using arithmetic coding is not too complicated. But at first glance, it seems completely impractical. Most computers support floating point numbers of up to 80 bits or so. Does this mean you have to start over every time you finish encoding 10 or 15 symbols? Do you need a floating point processor? Can machines with different floating point formats communicate using arithmetic coding?

As it turns out, arithmetic coding is best accomplished using standard 16 bit and 32 bit integer math. No floating point math is required, nor would it help to use it. What is used instead is a an incremental transmission scheme, where fixed size integer state variables receive new bits in at the low end and shift them out the high end, forming a single number that can be as many bits long as are available on the computer’s storage medium.

In the previous section, I showed how the algorithm works by keeping track of a high and low number that bracket the range of the possible output number. When the algorithm first starts up, the low number is set to 0.0, and the high number is set to 1.0. The first simplification made to work with integer math is to change the 1.0 to 0.999…., or .111… in binary.

In order to store these numbers in integer registers, we first justify them so the implied decimal point is on the left hand side of the word. Then we load as much of the initial high and low values as will fit into the word size we are working with. My implementation uses 16 bit unsigned math, so the initial value of high is 0xFFFF, and low is 0. We know that the high value continues with FFs forever, and low continues with 0s forever, so we can shift those extra bits in with impunity when they are needed.

If you imagine our “BILL GATES” example in a 5 digit register, the decimal equivalent of our setup would look like this:

HIGH: 99999
LOW: 00000

In order to find our new range numbers, we need to apply the encoding algorithm from the previous section. We first had to calculate the range between the low value and the high value. The difference between the two registers will be 100000, not 99999. This is because we assume the high register has an infinite number of 9′s added on to it, so we need to increment the calculated difference. We then compute the new high value using the formula from the previous section:

high = low + high_range(symbol)

In this case the high range was .30, which gives a new value for high of 30000. Before storing the new value of high, we need to decrement it, once again because of the implied digits appended to the integer value. So the new value of high is 29999.

The calculation of low follows the same path, with a resulting new value of 20000. So now high and low look like this:

HIGH: 29999 (999…)
LOW: 20000 (000…)

At this point, the most significant digits of high and low match. Due to the nature of our algorithm, high and low can continue to grow closer to one another without quite ever matching. This means that once they match in the most significant digit, that digit will never change. So we can now output that digit as the first digit of our encoded number. This is done by shifting both high and low left by one digit, and shifting in a 9 in the least significant digit of high. The equivalent operations are performed in binary in the C implementation of this algorithm.

As this process continues, high and low are continually growing closer together, then shifting digits out into the coded word. The process for our “BILL GATES” message looks like this:

                        HIGH    LOW    RANGE   CUMULATIVE OUTPUT

Initial state           99999  00000   100000
Encode B (0.2-0.3)      29999  20000
Shift out 2             99999  00000   100000    .2
Encode I (0.5-0.6)      59999  50000             .2
Shift out 5             99999  00000   100000    .25
Encode L (0.6-0.8)      79999  60000   20000     .25
Encode L (0.6-0.8)      75999  72000             .25
Shift out 7             59999  20000   40000     .257
Encode SPACE (0.0-0.1)  23999  20000             .257
Shift out 2             39999  00000   40000     .2572
Encode G (0.4-0.5)      19999  16000             .2572
Shift out 1             99999  60000   40000     .25721
Encode A (0.1-0.2)      67999  64000             .25721
Shift out 6             79999  40000   40000     .257216
Encode T (0.9-1.0)      79999  76000             .257216
Shift out 7             99999  60000   40000     .2572167
Encode E (0.3-0.4)      75999  72000             .2572167
Shift out 7             59999  20000   40000     .25721677
Encode S (0.8-0.9)      55999  52000             .25721677
Shift out 5             59999  20000             .257216775
Shift out 2                                      .2572167752
Shift out 0                                      .25721677520

Note that after all the letters have been accounted for, two extra digits need to be shifted out of either the high or low value to finish up the output word.

A Complication

This scheme works well for incrementally encoding a message. There is enough accuracy retained during the double precision integer calculations to ensure that the message is accurately encoded. However, there is potential for a loss of precision under certain circumstances.

In the event that the encoded word has a string of 0s or 9s in it, the high and low values will slowly converge on a value, but may not see their most significant digits match immediately. For example, high and low may look like this:

High: 700004
Low: 699995

At this point, the calculated range is going to be only a single digit long, which means the output word will not have enough precision to be accurately encoded. Even worse, after a few more iterations, high and low could look like this:

High: 70000
Low: 69999

At this point, the values are permanently stuck. The range between high and low has become so small that any calculation will always return the same values. But, since the most significant digits of both words are not equal, the algorithm can’t output the digit and shift. It seems like an impasse.

The way to defeat this underflow problem is to prevent things from ever getting this bad. The original algorithm said something like “If the most significant digit of high and low match, shift it out”. If the two digits don’t match, but are now on adjacent numbers, a second test needs to be applied. If High and low are one apart, we then test to see if the 2nd most significant digit in high is a 0, and the 2nd digit in low is a 9. If so, it means we are on the road to underflow, and need to take action.

When underflow rears its ugly head, we head it off with a slightly different shift operation. Instead of shifting the most significant digit out of the word, we just delete the 2nd digits from high and low, and shift the rest of the digits left to fill up the space. The most significant digit stays in place. We then have to set an underflow counter to remember that we threw away a digit, and we aren’t quite sure whether it was going to end up as a 0 or a 9. The operation looks like this:

               Before    After
               ------    -----
High:          40344     43449
Low:           39810     38100
Underflow:     0         1

After every recalculation operation, if the most significant digits don’t match up, we can check for underflow digits again. If they are present, we shift them out and increment the counter.

When the most significant digits do finally converge to a single value, we first output that value. Then, we output all of the “underflow” digits that were previously discarded. The underflow digits will be all 9s or 0s, depending on whether High and Low converged to the higher or lower value. In the C implementation of this algorithm, the underflow counter keeps track of how many ones or zeros to put out.

Decoding

In the “ideal” decoding process, we had the entire input number to work with. So the algorithm has us do things like “divide the encoded number by the symbol probability.” In practice, we can’t perform an operation like that on a number that could be billions of bytes long. Just like the encoding process, the decoder can operate using 16 and 32 bit integers for calculations.

Instead of maintaining two numbers, high and low, the decoder has to maintain three integers. The first two, high and low, correspond exactly to the high and low values maintained by the encoder. The third number, code, contains the current bits being read in from the input bits stream. The code value will always lie in between the high and low values. As they come closer and closer to it, new shift operations will take place, and high and low will move back away from code.

The high and low values in the decoder correspond exactly to the high and low that the encoder was using. They will be updated after every symbol just as they were in the encoder, and should have exactly the same values. By performing the same comparison test on the upper digit of high and low, the decoder knows when it is time to shift a new digit into the incoming code. The same underflow tests are performed as well, in lockstep with the encoder.

In the ideal algorithm, it was possible to determine what the current encoded symbols was just by finding the symbol whose probabilities enclosed the present value of the code. In the integer math algorithm, things are somewhat more complicated. In this case, the probability scale is determined by the difference between high and low. So instead of the range being between 0.0 and 1.0, the range will be between two positive 16 bit integer counts. The current probability is determined by where the present code value falls along that range. If you were to divide (value-low) by (high-low+1), you would get the actual probability for the present symbol.

Where’s the Beef?

So far, this encoding process may look interesting, but it is not immediately obvious why it is an improvement over Huffman coding. The answer becomes clear when we examine a case where the probabilities are a little different. Lets examine the case where we have to encode the stream “AAAAAAA”, and the probability of “A” is known to be 0.90. This means there is a 90% chance that any incoming character will be the letter “A”. We set up our probability table so that the letter “A” occupies the range 0.0 to 0.9, and the End of message symbol occupies the 0.9 to 1.0 range. The encoding process then looks like this:

New Character   Low value        High Value
-------------   ---------        ----------
                0.0              1.0
    A           0.0              0.9
    A           0.0              0.81
    A           0.0              0.729
    A           0.0              0.6561
    A           0.0              0.59049
    A           0.0              0.531441
    A           0.0              0.4782969
   END          0.43046721       0.4782969

Now that we know what the range of low and high values are, all that remains is to pick a number to encode this message. The number “0.45″ will make this message uniquely decode to “AAAAAAA”. Those two decimal digits take slightly less than 7 bits to specify, which means that we have encoded eight symbols in less than 8 bits! An optimal Huffman coded message would have taken a minimum of 9 bits using this scheme.

To take this point even further, I set up a test that had only two symbols defined. The byte value ’0′ had a 16382/16383 probability, and an EOF symbol had a 1/16383 probability of occurring. I then created a test file which was filled with 100,000 ’0′s. After compression using this model, the output file was only 3 bytes long! The minimum size using Huffman coding would have been 12,501 bytes. This is obviously a contrived example, but it does illustrate the point that arithmetic coding is able to compress data at rates much better than 1 bit per byte when the symbol probabilities are right.

The Code

An encoding procedure written in C is shown in listing 1 and 2. It contains two parts, an initialization procedure, and the encoder routine itself. The code for the encoder as well as the decoder were first published in an article entitled Arithmetic Coding for Data Compression in the February 1987 issue of Communications of the ACM, by Ian H. Witten, Radford Neal, and John Cleary, and is being published here with the author’s permission. I have modified the code slightly so as to further isolate the modeling and coding portions of the program.

There are two major differences between the algorithms shown earlier and the code in listing 1-2. The first difference is in the way probabilities are transmitted. In the algorithms shown above, the probabilities were kept as a pair of floating point numbers on the 0.0 to 1.0 range. Each symbol had its own section of that range. In the programs that are shown here, a symbol has a slightly different definition. Instead of two floating point numbers, the symbol’s range is defined as two integers, which are counts along a scale. The scale is also included as part of the symbol definition. So, for the “BILL GATES” example, the letter L was defined previously as the high/low pair of 0.60 and 0.80. In the code being used here, “B” would be defined as the low and high counts of 6 and 8, with the symbol scale being 10. The reason it is done this way is because it fits in well with the way we keep statistics when modeling this type of data. The two methods are equivalent, it is just that keeping integer counts cuts down on the amount of work needing to be done. Note that the character actually owns the range up to but not including the high value.

The second difference in this algorithm is that all of the comparison and shifting operations are being done in base 2, rather than base 10. The illustrations given previously were done on base 10 numbers to make the algortihms a little more comprehensible. The algorithms work properly in base 10, but masking off digits and shifting in base 10 on most computers is difficult. So instead of comparing the top two digits, we now compare the top two bits.

There are two things missing that are needed in order to use the encoding and decoding algorithms. The first is a set of bit oriented input and output routines. These are shown in listing 3 and 4, and are presented here without comment. The modeling code is responsible for keeping track of the probabilities of each character, and performing two different transformations. During the encoding process, the modeling code has to take a character to be encoded and convert it to a probability range. The probability range is defined as a low count, a high count, and a total range. During the decoding process, the modeling code has to take a count derived from the input bit stream and convert it into a character for output.

Test Code

A short test program is shown Listing 3. It implements a compression/expansion program that uses the “BILL GATES” model discussed previously. Adding a few judiciously placed print statements will let you track exactly what is happening in this program.

BILL.C has its own very simple modeling unit, which has a set of fixed probabilities for each of the possible letters in the message. I added a new character, the null string terminator, so as to know when to stop decoding the message.

BILL.C encodes an arbitrarily defined input string, and writes it out to a file called TEST.CMP. It then decodes the message and prints it to the screen for verification. There are a few functions in BILL.C that are modeling functions, which haven’t been discussed yet. During the encoding process, a routine called convert_int_to_symbol is called. This routine gets a given input character, and converts it to a low count, high count, and scale using the current model. Since our model for BILL.C is a set of fixed probabilities, this just means looking up the probabilities in the table. Once those are defined, the encoder can be called.

During the decoding process, there are two functions associated with modeling. In order to determine what character is waiting to be decoded on the input stream, the model needs to be interrogated to determine what the present scale is. In our example, the scale (or range of counts) is fixed at 11, since there are always 11 counts in our model. This number is passed to the decoding routine, and it returns a count for the given symbol using that scale. Then, a modeling function called convert_symbol_to_int is called. It takes the given count, and determines what character matches the count. Finally, the decoder is called again to process that character out of the input stream.

Next Time

Once you successfully understand how to encode/decode symbols using arithmetic coding, you can then begin trying to create models that will give superior compression performance. Next month’s concluding article discusses some of the methods you can try for adaptive compression. A fairly simple statistical modeling program is able to provide compression superior to respected programs such as PKZIP or COMPRESS.

References

As I mentioned previously, the article in the June 1987 Communications of the ACM provides the definitive overview of arithmetic coding. Most of the article is reprinted in the book Text Compression, by Timothy C. Bell, John G. Cleary, and Ian H. Witten. This book provides an excellent overview of both statistical and dictionary based compression techniques. Another good book is Data Compression by James Storer.

  • Bell, Timothy C., Cleary, John G., Witten, Ian H, (1990) “Text Compression”, Prentice Hall, Englewood NJ
  • Nelson, Mark (1989) LZW Data Compression, Doctor Dobb’s Journal, October, pp 29-37
  • Storer, J.A., (1988) “Data Compression”, Computer Science Press, Rockville, MD
  • Witten, Ian H., Neal, Radford M., and Cleary, John G. (1987) Arithmetic Coding for Data Compression, Communications of the ACM, June, pp 520-540.
Listing 1 coder.h Constants, declarations, and prototypes needed to use the arithmetic coding routines. These declarations are for routines that need to interface with the arithmetic coding stuff in coder.c
Listing 2 coder.c This file contains the code needed to accomplish arithmetic coding of a symbol. All the routines in this module need to know in order to accomplish coding is what the probabilities and scales of the symbol counts are. This information is generally passed in a SYMBOL structure.
Listing 3 bitio.h This header file contains the function prototypes needed to use the bitstream i/o routines.
Listing 4 bitio.c This routine contains a set of bit oriented i/o routines used for arithmetic data compression.
Listing 5 bill.c

This short demonstration program will use arithmetic data compression to encode and then decode a string that only uses the letters out of the phrase “BILL GATES”. It uses a fixed, hardcoded table of probabilities.


Part 2 – Statistical Modeling Using Prediction by Partial Matching (PPM)

Note:This article was originally published in two parts. Both parts have been combined here into one post – at this point you are halfway through the combined article. Part 1 starts here.

The first article in this two part series explained in some detail exactly what arithmetic coding was. To summarize, arithmetic coding provides a way to encode symbols using an optimal number of bits. The number of bits used by each symbol is not necessarily an integer, as would be the case with Huffman coding. In order to use arithmetic coding to compress data, a model for the data is needed. The model needs to be able to do two things to effectively compress data:

  • The model needs to accurately predict the frequency/probability of symbols in the input data stream.
  • The symbol probabilities generated by the model need to deviate from a uniform distribution.

This article will discuss some ways to effectively accomplish these goals, leading to high performance compression.

Modeling

The need to accurately predict the probability of symbols in the input data is inherent in the nature of arithmetic coding. The principal of this type of coding is to reduce the number of bits needed to encode a character as its probability of appearance increases. So if the letter “e” represents 25% of the input data, it would only take 2 bits to code. If the letter “Z” represents only .1% of the input data, it might take 10 bits to code. If the model is not generating probabilities accurately, it might take 10 bits to represent “e” and 2 bits to represent “Z”, causing data expansion instead of compression.

The second condition is that the model needs to make predictions that deviate from a uniform distribution. The better the model is at making predictions that deviate from uniform, the better the compression ratios will be. For example, a model could be created that assigned all 256 possible symbols a uniform probability of 1/256. This model would create an output file that was exactly the same size as the input file, since every symbol would take exactly 8 bits to encode. Only by correctly finding probabilities that deviate from a normal distribution can the number of bits be reduced, leading to compression. The increased probabilities have to accurately reflect reality, of course, as prescribed by the first condition.

It may seem that the probability of a given symbol occurring in a data stream is fixed, but this is not quite true. Depending on the model being used, the probability of the character can change quite a bit. For example, when compressing a C program, the probability of a new-line character in the text might be 1/40. This probability could be determined by scanning the entire text and dividing the number of occurrences of the character by the total number of characters. But if we use a modeling technique that looks at a single previous character, the probabilities change. In that case, if the previous character was a “}”, the probability of a new-line character goes up to 1/2. This improved modeling technique leads to better compression, even though both models were generating accurate probabilities.

Finite Context Modeling and PPM

The type of modeling I will present in this article is a type of as finite-context modeling. This type of modeling is based on a very simple idea: the probabilities for each incoming symbol are calculated based on the context the symbol appears in. In all of the examples I will show here, the context consists of nothing more than the symbols that have previously been encountered. The order of the model refers to the number of previous symbols that make up the context.

The specific algorithm I am implementing is often called Predication by Partial Matching, or PPM. PPM programs generally make statistical predictions about upcoming characters based on finite-context models

The simplest finite-context model would be an order-0 model. This means that the probability of each symbol is independent of any previous symbols. In order to implement this model, all that is needed is a single table containing the frequency counts for each symbol that might be encountered in the input stream. For an order-1 model, you will be keeping track of 256 different tables of frequencies, since you need to keep a separate set of counts for each possible context. Likewise, an order-2 model needs to be able to handle 65,536 different tables of contexts.

Adaptive Modeling

It seems logical that as the order of the model increases, the compression ratios ought to improve as well. For example, the probability of the symbol “u” appearing in this article may only be 5%, but if the previous context character is “q”, the probability goes up to 95%. Being able to predict characters with high probability lowers the number of bits needed, and larger contexts ought to let us make better predictions.

Unfortunately, as the order of the model increases linearly, the memory consumed by the model increases exponentially. With an order 0 model, the space consumed by the statistics could be as small as 256 bytes. Once the order of the model increases to 2 or 3, even the most cleverly designed models are going to consume hundreds of kilobytes.

One conventional way of compressing data is to take a single pass over the symbols to be compressed to gather the statistics for the model. A second pass is then made to actually encode the data. The statistics are then usually prepended to the compressed data so the decoder will have a copy of the statistics to work with. This approach obviously is going to have serious problems if the statistics for the model consume more space than the data to be compressed.

The solution to this problem is to perform adaptive compression. In adaptive data compression, both the compressor and the decompressor start with the same model. The compressor encodes a symbol using the existing model, then updates the model to account for the new symbol. The decompressor likewise decodes a symbol using the existing model, then updates the model. As long as the algorithm to update the model operates identically for the compressor and the decompressor, the process can operate perfectly without the need to pass a statistics table from the compressor to the decompressor.

Adaptive data compression does have a slight disadvantage in that it starts compressing with less than optimal statistics. However, by subtracting the cost of transmitting the statistics with the compressed data, an adaptive algorithm will usually perform better than a fixed statistical model.

The place where adaptive compression really does suffer is in the cost of updating the model. When updating the count for a particular symbol using arithmetic coding, the update code has the potential cost of updating the cumulative counts for all of the other symbols as well, leading to code that on the average has to perform 128 arithmetic operations for every single symbol encoded or decoded.

Because of the high cost in both memory and CPU time, higher order adaptive models have only become practical in perhaps the last 10 years. It is somewhat ironic that as the cost of disk space and memory goes down, the cost of compressing the data stored there also goes down. As these costs continue to decline, we will be able to implement even more effective programs than are practical today.

A Simple Example

An example of an order-0 compression program is shown here in Listings 6 through 9. (Listings 1-5 are in last month’s article). COMP-1.C is the compression program driver, EXPAND-1.C is the expansion program driver. MODEL.H and MODEL-1.C comprise the modeling unit used to implement the order-0 statistical model.

The compression program itself is quite simple when using a fixed order model such as this one. The program just has to sit in a loop, reading in characters from the plain text file, converting the character to an arithmetic code, encoding the symbol, then updating the model.

The modeling code in MODEL-1.C is the key to understanding how this code works. In last month’s article, we saw how each character owned a probability range in the scale ranging from 0.0 to 1.0. In order to implement arithmetic coding using integer math, this ownership was restated to be a low count and a high count along a integer range from 0 to the maximum count. This is exactly how the statistics are kept in MODEL-1.C.

The counts for all of the possible symbols using MODEL-1 are kept in an integer array referred to as totals[]. For each symbol c, the low count is stored in totals[c] and the high count is in totals[c+1]. The total range for all the symbols is stored in totals[256]. These three counts are what need to be passed to the arithmetic encoder in order to encode a given symbol.

This makes the operation of looking up the counts for a symbol very straightforward. However, updating the totals[] array after a symbol is encoded or decoded is another matter. In order to update the cumulative counts for symbol c, the every count in totals[] from c up to 256 has to be incremented. This means that on the average, 128 increment operations have to be performed for every character encoded or decoded. For a simple demonstration program such as the ones that use MODEL-1.C, this is not a major problem, but a production program ought to be modified to be more efficient.

One way to cut back on the number of increment operations is to move the counts for the most frequently accessed symbols to the top of the array. This means that the model has to keep track of each symbol’s position in the totals[] array, but it has the positive side effect of reducing the number of increment operations by an order of magnitude. This is a relatively simple enhancement to make to this program. A very good example of a program that uses this technique has been published as part of the paper Arithmetic Coding for Data Compression in the June 1987 issue of Communications of the ACM. This paper by Ian H. Witten, Radford Neal, and John Cleary is an excellent source of information regarding arithmetic coding, with some sample C source code illustrating the text.

Performance

COMP-1 is a short program that needs very little memory. Because it is only maintaining order-0 statistics, it doesn’t compress particularly well when compared to commonly used compression programs. For example, when compressing C code, COMP-1 will usually compress the text to about 5 bits per byte. While this is not spectacular, it could be useful for implementations that need to run in an environment where memory is at a premium. If just a little more memory is available, maintaining a sorted list will increase the speed of compression and expansion, without affecting the compression ratio.

Improvements

COMP-1 makes an adequate demonstration program, but is probably not going to be too exciting to anyone. It will compress a little better than a comparable order-0 Huffman coding program, but nowhere near as well as a program such as ARC or PKZIP. In order to surpass the compression ratio of these programs, we need to start adding enhancements to the modeling code. The culmination of all enhancements will be found in COMP-2.C, EXPAND-2.C, and MODEL-2.C, which are used to build a compression program and an expansion program.

Highest-Order Modeling

The first enhancement to the model used here is to increase the order of the model. An order-0 model doesn’t take in to account any of the previous symbols from the text file when calculating the probabilities for the current symbol. By looking at the previous characters in the text file, or the context, we can more accurately predict the incoming symbols.

When we move up to an order-2 or order-3 model, one problem becomes apparent right away. A good example to demonstrate the use of a previous context is to try to predict which character comes after “REQ” in an word processing text file. The probability of the next letter being a ‘U’ ought to be at least 90%, letting us encode that symbol in less than one bit. But when we are using an adaptive model, it may be some time before we build up a large enough body of statistics to start predicting the “U” with a high probability.

The problem is that in a fixed order model, each character has to have a finite non-zero probability, so that it can be encoded if and when it appears. Since we don’t have any advance knowledge about what type of statistics our input text will have, we generally have to start by assigning each character an equal probability. This means that if we include the EOF symbol, every byte from -1 up to 255 has a 1/257 chance of appearing after the “REQ” symbols. Now, even if the letter ‘U’ follows this particular context 10 times in a row, its probability will only get built up to 11/267. The rest of the symbols are taking up valuable space in the probability table, when they in all likelihood will never be seen.

The solution to this is to set the initial probabilities of all of the symbols to 0 for a given context, and to have a method to fall-back to a different context when a previously unseen symbol occurs. This is done by emitting a special code, referred to as an Escape code. For the previous context of “REQ”, we could set the Escape code to a count of 1, and have all other symbols set to a count of 0. The first time the character ‘U’ followed “REQ”, we would have to emit an Escape code, followed by the code for ‘U’ in a different context. During the model update immediately following that, we could increment the count for ‘U’ in the “REQ” context to 1, so it now had a probability of 1/2. The next time it appeared, it would be encoded in only 1 bit, with the probability increasing and the number of bits decreasing with each appearance.

The obvious question then is: what do we use as our fallback context to go to after emitting an Escape code? In MODEL-2, if the order-3 context generates an Escape code, the next context tried is the order-2 context. This means that the first time the context “REQ” is used, and ‘U’ needs to be encoded, an Escape code is then generated. Following that, the MODEL-2 program drops back to an order-2 model, and tries to encode the character ‘U’ using the context “EQ”. This continues on down through the order-0 context. If the Escape code is still generated at order-0, we fall back to a special order (-1) context. The -1 context is set up at initialization to have a count of 1 for every possible symbol, and doesn’t ever get updated. So it is guaranteed to be able to encode every symbol.

Using this escape code technique means only a slight modification to the driver program. The COMP-2.C program now sits in a loop trying to encode its characters:

 do {
      escaped = convert_int_to_symbol( c, &s );
      encode_symbol( compressed_file, &s );
 } while ( escaped );

The modeling code is responsible for keeping track of what the current order is, and decrementing it whenever an escape is emitted. Even more complicated is the modeling module’s job of keeping track of which context table needs to be used for the current order.

Updating the Model

The use of the highest-order modeling algorithm requires that instead of keeping just one set of context tables for the highest order, we need to keep a full set of context tables for every order up to the highest order. This means that if we are doing order 2 modeling, there will be a single order-0 table, 256 order-1 tables, and 65535 order-2 tables. When a new character has been encoded or decoded, the modeling code needs to update one of these tables for each order. In the previous example, a ‘U’ following “REQ”, the modeling code would update the ‘U’ counter in the order-3 “REQ” table, the ‘U’ counter in the order-2 “EQ” table, the ‘U’ counter in the order-1 “Q” table, and the ‘U’ counter in the order-0 table.

The code to update all of these tables will look like this:

  for ( order = 0 ; order <= max_order ; order++ )
      update_model( order, symbol );

A slight modification to this algorithm will results in both faster updates as well as better compression. Instead of updating all of the different order models for the current context, we can instead update only those models that were actually used to encode the symbol. For example, if 'U' was encoded after "REQ" as ESCAPE, 'U', we would only increment the counter in the "REQ" and "EQ" models, since the count of 'U' was found in the "EQ" table. The "R" and "" tables would both be unaffected.

This modification of full updating is called update exclusion, as it excludes lower order models that weren't used from being updated. It will generally give a small but noticeable improvement in the compression ratio. Update exclusion works due to the fact that if symbols are showing up frequently in the higher order models, they won't be seen as often in the lower order models, which means we shouldn't increment their counters in the lower order models. The modified update code for update exclusion will look like this:

 for ( order=encoding_order ; order <= max_order ; order++ )
      update_model( order, symbol );

Escape Probabilities

When we first start encoding a text stream, it stands to reason that we will be emitting quite a few escape codes. Given this, the number of bits used to encode escape characters will probably have a large effect on the compression ratio achieved, particularly for smaller files. In MODEL-2A.C, we set the escape count to 1 and left it there, regardless of the state of the rest of the context table. This is equivalent to what Bell, Cleary and Witten call Method A. Method B, on the other hand, sets the count for the escape character to be the number of symbols presently defined for the context table. Thus, if eleven different characters have been seen so far, the escape symbol count will be set to eleven, regardless of what the counts are.

Both Method A and Method B seem to work fairly well. The code in MODEL-2.C can be easily modified to support either of these Methods. Probably the best thing about Methods A and B are that they are not computationally intensive. When using Method A, the addition of the Escape symbol to the table can be done so that takes almost no more work to update the table with the symbol than without it.

In MODEL-2.C, I have implemented a slightly more complicated Escape count calculation algorithm. This algorithm tries to take into account three different factors when calculating the escape probability. First, it would seem to make sense that as the number of symbols defined in the context table increases, the escape probability will decrease. This reaches its maximum when the table has all 256 symbols defined, making the escape probability 0.

Second, this algorithm tries to take into account a measure of randomness in the table. It calculates this by dividing the maximum count found in the table by the average count. The higher this ratio is, the less random the table will be. A good example would be in the "REQ" table. It may have only three symbols defined: 'U' with a count of 50, 'u' with a count of 10, and '.' with a count of 3. The ratio of 'U's count, 50, to the average, 21, is fairly high. This means the character 'U' is being predicted with a fairly high amount of accuracy, and the escape probability ought to be lower. In a table where the high count was 10, and the average was 8, things would seem to be a little more random, and the Escape probability should be higher.

The final factor to be taken into account when determining the escape probability is simply the raw count of how many symbols have been seen for the particular table. As the number of symbols increases, the predictability of the table should go up, making the Escape probability go down.

The formula I am using for calculating the number of counts for the Escape symbol is shown here:

     count = (256 - number of symbols seen)*number of symbols seen
     count = count /(256 * the highest symbol count)
     if count is less than 1
         count = 1

The missing variable in this equation is the raw symbol count. This is implicit in the calculation, because the escape probability is the escape count divided by the raw count. The raw count will automatically scale the Escape count to a probability.

Scoreboarding

When using a highest-order modeling technique, there is an additional enhancement that can be used to improve compression efficiency. When we first try to compress a symbol using the highest order context, we can either generate the code for the symbol or generate an Escape code. If we generate an Escape code, it means the symbol hadn't previously occurred in that context, so we had a count of 0. But we do gain some information about the symbol just by generating an Escape. We can look at the Escaped context and generate a list of symbols that did not match the symbol to be encoded. These symbols can then temporarily have their counts set to 0 when we calculate the probabilities for lower order models. The counts will be reset back to their permanent values after the encoding for the particular character is complete.

An example is shown below. If the present context is "HAC", and the next symbol is 'K', we will use the tables below to encode the K. Without using scoreboarding, the "HAC" context generates an Escape, with a probability of 1/6. The "AC" context generates an Escape, with a probability of 1/7. The "C" context generates an Escape with a probability of 1/40, and the "" context finally generates a 'K' with a probability of 1/73.

     ""            "C"            "AC"           "HAC"
  --------       --------       --------        -------
  ESC   1        ESC   1        ESC    1        ESC   1
  'K'   1        'H'   20       'C'    5        'E'   3
  'E'   40       'T'   11       'H'    2        'L'   1
  'I'   22       'L'   5                        'C'   1
  'A'   9        'A'   3

If we use scoreboarding to exclude counts of previously seen characters, we can make a significant improvement in these probabilities. The first encoding of an Escape from "HAC" isn't affected, since no characters have been seen before. However, the "AC" Escape code gets to eliminate the 'C' symbol from its calculations, resulting in a probability of 1/3. The "C" Escape code excludes the 'H' and the 'A' counts from its probabilities, increasing the probability from 1/40 to 1/17. And finally, the "" context gets to exclude the 'E' and 'A' counts, reducing that probability from 1/73 to 1/24. This results in a reduction in the number of bits required to encode the symbol from 14.9 to 12.9, a significant savings.

Keeping a symbol scoreboard will almost always result in some improvement in compression, and it will never make things worse. The major problem with scoreboarding is that the probability tables for all of the lower order contexts have to be recalculated every time the table is accessed. This results in a big increase in the CPU time required to encode text. Scoreboarding is left in MODEL-2.C in order to demonstrate the gains possible when compressing text using it.

Data Structures

All of the improvements to the basic statistical modeling assume that higher order modeling can actually be accomplished on the target computer. The problem with increasing the order is one of memory. For the order-0 model in MODEL-1, the cumulative totals table occupied 516 bytes of memory. If we were to use the same data structures for an order-1 model, the memory used would shoot up to 133 Kbytes, which is still probably an acceptable number. But going to order-2 will increase the RAM requirements for the statistics unit to 34 Mbytes! Since we would like to be able to try out orders even higher than 2, we need to redesign the data structures used to hold the counts.

In order to save memory space, we have to redesign the context statistics tables. In MODEL-1, the table is about as simple as can be, with each symbol being used as an index directly into a pair of counts. In the order-1 model, the appropriate context tables would be found by indexing once into an array of context tables, then indexing again to the symbol in question, something like this:

      low_count = totals[last_char][current_char];
      high_count = totals[last_char][current_char+1];
      range = totals[last_char][256];

This is convenient, but enormously wasteful. First of all, the full context space is allocated even for unused tables. Secondly, within the tables, space is allocated for all symbols, whether they have been seen or not. Both of these factors lead to the waste of enormous amounts of memory in higher order models.

The solution to the first problem, that of reserving space for unused contexts, is to organize the context tables as a tree. We can start by placing our order-0 context table at a known location, and then using it to contain pointers to order-1 context tables. The order-1 context tables will contain their own statistics, and pointers to order-2 context tables. This continues until the "leaves" of the context tree, which contain order-n tables, but no pointers to higher orders.

By using a tree structure, we can keep the unused pointer nodes set to NULL pointers until a context is seen. Once the context is seen, a table is created and added to the parent node of the tree.

The second problem is the creation of a table of 256 counts every time a new context is created. In reality, the highest order contexts will frequently only have a few symbols, so we can save a lot of space by only allocating space for symbols that have been seen for a particular context.

After implementing these changes, I have a set of data structures that looks like this:

typedef struct {
                unsigned char symbol;
                unsigned char counts;
               } STATS;

typedef struct {
                struct context *next;
               } LINKS;

typedef struct context {
                         int max_index;
                         STATS *stats;
                         LINKS *links;
                         struct context *lesser_context;
                       } CONTEXT;

The new CONTEXT structure has four major elements. The first is the counter, max_index. max_index tells how many symbols are presently defined for this particular context table. When a table is first created, it has no defined symbols, and max_index is -1. A completely defined table will have a max_index of 255. max_index tells how many elements are allocated for the arrays pointed to by *stats and *links. stats is an array of structures, each containing a symbol and a count for that symbol. If the CONTEXT table is not one of the highest order tables, it will also have a links array. For each symbol defined in the stats array, there will be a pointer to the next higher order CONTEXT table in the links table.

A sample of the CONTEXT table tree is shown in Figure 1. The table shown is the one that will have just been created after the input text "ABCABDABE" when keeping maximum order-3 statistics. Just 9 input symbols have already generated a fairly complicated data structure, but it is orders of magnitude smaller than one consisting of arrays of arrays.

    Order 0              Order 1             Order 2            Order 3
|------------------|------------------|-------------------|------------------|
|Context: ""       |Context: "A"      |Context: "AB"      |Context: "ABC"    |
| Lesser: NULL     | Lesser: ""       | Lesser: "B"       | Lesser: "BC"     |
| Symbol Count Link| Symbol Count Link| Symbol Count Link | Symbol Count Link|
|   A      3   "A" |   B      3   "AB"|   C      1   "ABC"|   A      1   NULL|
|   B      3   "B" |------------------|   D      1   "ABD"|------------------|
|   C      1   "C" |Context: "B"      |   E      1   "ABE"|Context: "BCA"    |
|   D      1   "D" | Lesser: ""       |-------------------| Lesser: "CA"     |
|   E      1   "E" | Symbol Count Link|Context: "BC"      | Symbol Count Link|
|------------------|   C      1   "BC"| Lesser: "C"       |   B      1   NULL|
                   |   D      1   "BD"| Symbol Count Link |------------------|
                   |   E      1   "BE"|   A      1   "BCA"|Context: "CAB"    |
                   |------------------|-------------------| Lesser: "AB"     |
                   |Context: "C"      |Context: "CA"      | Symbol Count Link|
                   | Lesser: ""       | Lesser: "A"       |   D      1   NULL|
                   | Symbol Count Link| Symbol Count Link |------------------|
                   |   A      1   "CA"|   B      1   "CAB"|Context: "ABD"    |
                   |------------------|-------------------| Lesser: "BD"     |
                   |Context: "D"      |Context: "BD"      | Symbol Count Link|
                   | Lesser: ""       | Lesser: "D"       |   A      1   NULL|
                   | Symbol Count Link| Symbol Count Link |------------------|
                   |   A      1   "DA"|   A      1   "BDA"|Context: "BDA"    |
                   |------------------|-------------------| Lesser: "DA"     |
                   |Context: "E"      |Context: "BE"      | Symbol Count Link|
                   | Lesser: ""       | Lesser: "E"       |   B      1   NULL|
                   | Symbol Count Link| Symbol Count Link |------------------|
                   |------------------|-------------------|Context: "DAB"    |
                                                          | Lesser: "AB"     |
                                                          | Symbol Count Link|
                                                          |   E      1   NULL|
                                                          |------------------|
                                                          |Context: "ABE"    |
                                                          | Lesser: "BE"     |
                                                          | Symbol Count Link|
                                                          |------------------|

"ABCABDABE"
Figure 1

The one element in this structure that I haven't explained is the lesser_context pointer. The need for this pointer becomes evident when using higher order models. If my modeling code is trying to locate an order 3 context table, it first has to scan through the order-0 symbol list looking for the first symbol match, then the order-1 symbol list, and so on. If the symbol lists in the lower orders are relatively full, this can be a lengthy process. What is even worse, every time an Escape is generated, the process has to be repeated when looking up the lower order context. These searches can consume an inordinate amount of CPU time.

The solution to this is to maintain a pointer for each table that points to the table for the context one order less. For example, the context table "ABC" should have its back pointer point to "BC", which should have a back pointer to "C", which should have a pointer to "", the null table. Then, the modeling code needs to always keep a pointer to the current highest order context. Given that, finding the order-i context table is simply a matter of performing (n-i) pointer operations.

For example, in the table shown in Figure 1, suppose the next incoming text symbol is 'X', and the current context is "ABE". Without the benefit of the lesser context pointers, I have to check the order-3, 2, 1, and 0 tables for the symbol 'X'. This takes a total of 15 symbol comparisons, and 3 table lookups. Using reverse pointers eliminates all the symbol comparisons, and just has me perform 3 table lookups.

The place where the work comes in maintaining back pointers is during the update of the model. When updating the Figure 1 context tree to contain an 'X' after "ABE", the modeling code has to perform a single set of lookups for each order/context. This code is shown in MODEL-2.C in the add_character_to_model() routine. Every time a new table is created, it needs to have its back pointer created properly at the same time, which requires a certain amount of care in the design of the update procedures.

The Finishing Touch: Tables -1 and -2

The final touch to the context tree in MODEL-2.C is the addition of two special tables. The order -1 table has been discussed previously. This is a table with a fixed probability for every symbol. In the event that a symbol is not found in any of the higher order models, we are guaranteed that it will show up in the order -1 model. This is the table of last resort. Since it has to guarantee that it will always be able to provide a code for ever symbol, we don't update this table, which means it uses a fixed probability for every symbol.

In addition, I have added a special order -2 table that is used for passing control information from the encoder to the decoder. For example, the encoder can pass a -1 to the decoder to indicate an EOF condition. Since the normal symbols are always defined as unsigned values ranging from 0 to 255, the modeling code recognizes a negative number as a special symbol that will generate escapes all the way back to the order -2 table. The modeling code can also detect that since it is a negative number, when the update_model() code is called, the symbol should just be ignored.

Model Flushing

The creation of the order -2 model lets me pass a second control code from the encoder to the expander. This is the Flush code, which tells the decoder to flush statistics out of the model. I perform this operation when the performance of the compressor starts to slip. The ratio is adjustable, but I have been using 90% as the worst compression ratio I will tolerate. When this ratio is exceeded, I flush the model by dividing all counts by two. This will give more weight to newer statistics, which ought to help improve the compression performance of the model.

In reality the model should probably be flushed whenever the input symbol stream drastically changes character. For example, if the program is compressing an executable file, the statistics accumulated during the compression of the executable code are probably of no value when compressing the program's data. Unfortunately, it isn't easy to define an algorithm that detects a "change in the nature" of the input.

Implementation

Even with the somewhat byzantine data structures used here, the compression and expansion programs built around MODEL-2.C have prodigious memory requirements. When running on DOS machines limited to 640K, these programs have to be limited to order 1, or perhaps order 2 for text that has a higher redundancy ratio.

In order to examine compression ratios for higher orders on binary files, there are three choices for these programs. First, they can be compiled using Zortech C and use EMS __handle pointers. Second, they can be built using a DOS Extender, such as Rational Systems 16/M. Third, they can be built on a machine that supports virtual memory, such as VMS. The code distributed here was written in an attempt to be portable across all three of these options.

I found that with an extra megabyte of EMS, I could compress virtually any ASCII file on my PC using order 3 compression. Some binary files require more memory. My Xenix system had no problem with order 3 compression, and turned in the best performance overall in terms of speed. I had mixed results with DOS Extenders. For unknown reasons, my tests with Lattice C 286 and Microsoft C + DOS 16/M ran much slower than Zortech's EMS code, when logic would indicate that the opposite should be true. This was not an optimization problem either, since the Lattice and Microsoft implementations ran faster than Zortech's when inside 640K. Executables built with the Lattice C/286 and the Zortech 2.06 code are available on the DDJ listing service. The Rational Systems 16/M license agreement requires royalty payments, so that code cannot be distributed.

Testing and Comparing: CHURN

In order to test compression programs, I have created a general purpose test program called CHURN. CHURN simply churns through a directory and all of its subdirectories, compressing and then expanding all the files it finds there. After each compression/expansion, the original input and the final output are compared to ensure that they are identical. The compression statistics for both time and space are added to the cumulative totals, and the program continues. After the program is complete, it prints out the cumulative statistics for the run so they can be examined. (CHURN and a Unix variant, CHURNX, are not printed here for reasons of space. Both are available from the DDJ listing service).

CHURN is called with a single parameter, the directory name containing the data to be tested. I generally direct the output of CHURN to a log file so it can be analyzed later, like this:

     churn c:\textdata > textdtat.log

The compression programs write some information to stderr, so this information can be displayed on the screen while the compression is running. Note that as various different programs are tested, CHURN.C frequently has to be modified in terms of what spawn() or system() calls it makes.

Shown below in Table 1 are the results returned when testing various compression programs against two different bodies of input. The first sample, TEXTDATA, is roughly 1 megabyte of text-only files, consisting of source code, word processing files, and documentation. The second sample, BINDATA, is roughly a megabyte of randomly selected files, containing executables, database files, binary images, etc.

I ran complete tests on both of these data sets with seven different compression programs. MODEL-1 is a the simple order-0 model discussed at the beginning of the article. MODEL-1A implements an order-1 algorithm without escape codes. MODEL-2A implements a highest-context order-1 algorithm, which will generate escape codes for previously unseen symbols. Finally, MODEL-2 is a highest-context order-3 model which implements all of the improvements discussed in this article.

For comparison, three dictionary based coding schemes were also run on the datasets. COMPRESS is a 16 bit LZW implemention which is in widespread use on Unix systems. The PC implementation that uses 16 bits takes up about 500K of RAM. LZHUF is is an LZSS program with an adaptive Huffman coding stage written by Haruyasu Yoshikazi, later modified and posted on Fidonet by Paul Edwards. This is essentially the same compression used in the LHARC program. Finally, the commercial product PKZIP 1.10 by PKWare, (Glendale, Wisconsin), was also tested on the datasets. PKZIP uses a proprietary dictionary based scheme which is discussed in the program's documentation.

The results show that the fully optimized MODEL-2 algorithm provides superior compression performance on the datasets tested here, and in particular performs extremely well on text based data. Binary data seems to not be as well suited to statistical analysis, where the dictionary based schemes turned in performances nearly identical to MODEL-2.

TEXTDATA - Total input bytes:    987070    Text data
Model-1   Fixed context order-0.          548218 bytes    4.44 bits/byte
Model-1A  Fixed context order-1.          437277 bytes    3.54 bits/byte
Model-2A  Highest context, max order-1.   381703 bytes    3.09 bits/byte
COMPRESS  Unix 16 bit LZW implementation. 351446 bytes    2.85 bits/byte
LZHUF     Sliding window dictionary       313541 bytes    2.54 bits/byte
PKZIP     Proprietary dictionary based.   292232 bytes    2.37 bits/byte
Model-2   Highest context, max order-3.   239327 bytes    1.94 bits/byte

BINDATA - Total input bytes:   989917     Binary data
Model-1   Fixed context order-0.          736785 bytes    5.95 bits/byte
COMPRESS  Unix 16 bit LZW implementation. 662692 bytes    5.35 bits/byte
Model-1A  Fixed context order-1.          660260 bytes    5.34 bits/byte
Model-2A  Highest context, max order-1.   641161 bytes    5.18 bits/byte
PKZIP     Proprietary dictionary based.   503827 bytes    4.06 bits/byte
LZHUF     Sliding window dictionary       503224 bytes    4.06 bits/byte
Model-2   Highest context, max order-3.   500055 bytes    4.03 bits/byte

Conclusions

In terms of compression ratios, these tests show that statistical modeling can perform at least as well as dictionary based methods. However, these programs at present are somewhat impractical due to the high resource requirements they have. MODEL-2 is fairly slow, compressing data with speeds in the range of 1 Kbyte per second, and needs huge amounts of memory to use higher order modeling. However, as memory becomes cheaper and processors become more powerful, schemes such as the ones shown here may become practical. They could be applied today to circumstances where either storage or transmission costs are extremely high.

Order 0 adaptive modeling using arithmetic coding could be usefully applied today to situations requiring extremely low consumption of memory. The compression ratios might not be as good as more sophisticated models, but the memory consumption is minimized.

Enhancements

The performance of these algorithms could be improved significantly beyond the implementations discussed here. The first area for improvement would be in memory management. Right now, when the programs run out of memory, they abort. A more sensible approach would be to start with fixed amount of memory available for statistics. When the statistics fill up the space, the program would then stop updating the tables, and just use what it had. This would mean implementing internal memory management routines instead of using the C run time library routines.

Another potential improvement would be in the tree structure for the context tables. Locating tables through the use of hashing could be quite a bit faster, and might require less memory. The context tables themselves could also be improved. When a table reaches the point where it has over 50% of the potential symbols defined for it, an alternate data structure could be used, with the counts stored in a linear array. This would allow for faster indexing, and would reduce memory requirements.

Finally, it might be valuable to try ways to adaptively modify the order of the model being used. When compressing using order-3 statistics, early portions of the text input generate a lot of escapes while the statistics tables fill up. It ought to be possible to start encoding using order-0 statistics, while keeping order-3 statistics. As the table fills up, the order used for encoding could be incremented until it reaches the maximum.

Listing 6 comp-1.c The driver program for an order 0 fixed context compression program. It follows the model shown in BILL.C for a compression program, by reading in symbols from a file,converting them to a high, low, range set, then encoding them.
Listing 7 expand-1.c The driver program for the decompression routine using the order 0 fixed context model.
Listing 8 model.h The function prototypes and external variable declarations needed to interface with the modeling code found in model-1.c or model-2.c.
Listing 9 model-1.c The modeling module for an order 0 fixed context data compression program.
Listing 10 comp-2.c This module is the driver program for a variable order finite context compression program. The maximum order is determined by command line option. This particular version also monitors compression ratios, and flushes the model whenever the local (last 256 symbols) compression ratio hits 90% or higher.
Listing 11 expand-2.c This module is the driver program for a variabler order finite context expansion program. The maximum order is determined by command line option. This particular version can respond to the FLUSH code inserted in the bit stream by the compressor.
Listing 12 model-2.c This module contains all of the modeling functions used with comp-2.c and expand-2.c. This modeling unit keeps track of all contexts from 0 up to max_order, which defaults to 3.
Listing 13 churn.c This is a utility program used to test compression/decompression programs for accuracy, speed, and compression ratios. Calling CHURN.EXE with a single argument will cause CHURN to compress then decompress every file in the specified directory tree, checking the compression ratio and the accuracy of the operation. This is a good program to run overnight when you think your new algorithm works properly.
Listing 14 churnx.c A version of the churn program that works with UNIX systems.