|Dr. Dobb's Journal January, 1996|
This page contains my original text and figures for the article that appeared in the January 1996 DDJ. Please download the new source code if you want to work with Visual C++ 6.0.
When I wrote the chapter on Huffman coding for my Data Compression book, my sample programs emphasized clarity at the expense of efficiency. I felt it was much more important for my code to be easy to understand than it was to be fast or tight.
Were I writing that book today, I wouldn’t have to wrestle with this choice. The addition of the Standard Template Library to the ANSI/ISO C++ standard provided me with a tool that is perfect for the creation of Huffman coding trees: the priority_queue container adaptor. The STL lets me have the best of both worlds. I can write easy to read code that runs with optimal efficiency.
This article will first show you just what a priority queue is. Next, you will see how the STL implements a priority queue using its heap algorithms. Finally, I’ll show you some sample code that implements a Huffman encoder using the STL priority queue.
What is a priority queue?
When you read your first book on data structures, you undoubtedly ran into a container type known as a queue. Another name for the queue, FIFO, describes the container by the way it works: First In, First Out. Figure 1 shows a representation of the way a queue normally processes data.
Figure 1 — A FIFO Queue
The standard interface to the FIFO queue simply consists of a
push() function, which adds new
elements, and a
pop() function, which always removes the oldest element.
This data structure is relatively easy to implement. However, there are many times when a more sophisticated form of queue is needed. The one this article is concerned with is the priority queue.
Figure 2 — A Priority Queue
Figure 2 shows a representation of a priority queue. This type of queue assigns a priority to
every element that it stores. New elements are added to the queue using the
just as with a FIFO queue. This queue also has a
pop() function, which differs from the FIFO
pop() in one key area. When you call
pop() for the priority queue, you don’t get the
oldest element in the queue. Instead, you get the element with the highest priority.
The priority queue obviously fits in well with certain types of tasks. For example, the scheduler in an operating system might use a priority queue to track processes running in the operating system. In this article, I’ll use a priority queue to build a Huffman encoding tree.
Implementation with a heap
There are a few brute force ways to implement a priority queue. In my inefficient Huffman encoder,
I kept the elements in the queue in an unsorted list, then searched the entire list for the
highest priority item when it came time to
pop(). As an alternative, you could perform a sorted
insertion when the
push() function is called, so that
pop() has an easy job of it. This
could be accomplished using some sort of sorted tree.
In the case of the priority queue, there is an approach that is superior to either of these methods. A data structure known as a heap can be used to keep the queue elements in a partially-sorted order. Insertions and extractions can then be done in log(N) time, with virtually no extra memory overhead.
A heap has the following key characteristics:
- The elements of a heap are stored in a contiguous array. If the heap holds N elements, they will be stored at locations 1 through N of the array. (Please note that this convention of arrays starting at 1 and running through N is used in this article, mostly for notational convenience.)
- The elements in the array are also part of an implicit binary tree. Every element X (except the root node) has a parent at location X/2.
- Each node has a priority that is greater then or equal to both of its children.
Figure 3 shows what a typical heap might look like. The important thing to note about a heap is that is not a completely ordered tree. Even though parent nodes are always greater than children, there is no ordering among siblings. Notice for example that node 10 has a higher priority (39), than node 3, which is two levels higher up in the tree.
Figure 3 — A Heap
Even though a heap is not completely sorted, it has one very useful characteristic: the node with
the highest priority will always be at the top of the tree. This makes it suitable for a
priority queue, since the
pop() operation only requires the highest element. Combining this
characteristic with the negligible overhead and fast insertions and removals makes the heap an
ideal data structure for implementation of a priority queue.
The STL heap functions
The STL has four template functions that work with heaps:
||Takes an existing sequence of elements and rearranges it to form a heap.|
||Adds a new element to an existing heap.|
||Removes the highest priority value from a heap.|
||Takes a heap and turns it into a sorted sequence.|
All four of these template functions are parameterized by two different types:
class RandomAccessIterator and
class Compare. The iterator class simply defines the iterator
type used to point to the elements in the heap. The iterator type will usually be a type defined
by the container the heap is stored in. If the heap is stored in a built-in C/C++ array, the
iterator will be a standard pointer type.
Compare class is used to define an STL comparison object that is used to order the elements
in the heap. The STL gives us maximum flexibility when it comes to determining how heap elements
are compared. The comparison class defaults to
less<T>, which is a predefined STL comparison
less<T> simply compares two objects using the less than relational
The STL uses only three of these heap functions to implement a priority queue.
used to create a heap out of unordered data.
pop_heap() are used to insert
and remove individual items from the heap. I’m only going to look at
pop_heap() in this article.
make_heap() uses techniques very similar to those of
push_heap() to create an initial heap, so understanding the way the latter works should make the
former easy to follow.
The first two arguments to this function are iterators that define the start and end of the
sequence of elements that make up the heap. (In canonical STL fashion,
first points to the
first element in the sequence, and
last points one past the last element.) The third
parameter is the optional comparison object that is used to order the elements in the heap.
push_heap() isn’t organized the way you might expect. The algorithms supplied with the STL in
general work on sequences of elements, and don’t perform any operations that require knowledge of
how to add or remove elements from containers. (This is one reason why so many of the algorithms
work with built-in arrays.)
Given this restriction,
push_heap() operates a little differently than you might expect. If
you have an existing heap with N elements number 1 through N,
push_heap() requires you to place
a new element at location N+1 before calling
push_heap(). This has the effect of creating a
heap that has a new element in the last position.
push_heap() with the heap in this configuration. All
push_heap() has to do at that
point is correct the heap to take into account this new element. This is done by repeatedly
swapping the new node with its parent node until it reaches a position where its parent is
greater than it. The procedure for doing this can be described with code something like this:
Figure 4 — Inserting a new node
Figure 4 graphically illustrates how this procedure works. The new node, with a value of 55, is
added to the heap at the next available position.
push_heap() is called, which then begins
moving the node up the heap. The initial parent of the new node has a value of 49, which is less
than 55. The two elements are swapped, and the comparison loop repeats.
After the first swap, the new node is in a valid position in the heap. It is greater than all of its children, and less than all of its parents. This means it is in the correct spot.
If the new node had a value greater than 60, a second swap would have been performed, and the new node would then appear at the root of the heap.
Note also that every time I talk about a comparison being performed, I refer to it as testing for
greater than or less than. This doesn’t mean the comparison has to be performed with
operator<(). It could just as easily be done with the comparison object passed as an optional
pop_heap() requires that you work using a set of idiosyncratic conventions similar to those for
pop_heap() doesn’t want to have to worry about changing the size of your
container. What it does instead is to move the current root node of the heap to position N.
pop_heap() then adjusts the heap to be correct for positions 1 through N-1, and control passes
back to the calling function.
This means that after you call
pop_heap(), you should expect a couple of things. First, the
size of the heap is now N-1 instead of N. However, there is still a single element in the last
position of your array. (This can be removed from STL containers using the container’s
pop_back() function.) Second, if the heap still contains any data, the highest value is now
contained in the root node of the heap.
The actual mechanics of this operation are similar to those for
pop_heap(). First, the root
node is swapped with the least node in the heap. Then, the new root node is moved down
the tree, swapping the greater of its two children, until it is greater than both of its children.
You might write code like this to implement this function:
This code is slightly more complicated than that of
push_heap(), but you can clearly see that
it is performing the reverse operation of
push_heap(). It moves the least element to the root
of the heap, then loops while checking to see if the root node is less than one of its children.
If it is, the node is swapped with its largest child, and the process repeats.
Looking at both of these functions, you can see why inserting and removing items from the heap is a lg(N) operation. Regardless of what sort of number is inserted or extracted, the adjustment process will only have to go from the root of the tree to its lowest level, or vice versa. Because of the way the tree is designed for the heap, this will never take more than lg(N) swaps operations.
The priority_queue container adaptor
The STL implements priority queues using something called a container adaptor. A container
adaptor is a simple wrapper class. It uses the framework of an existing container to implement a
new container type. The STL has three container adaptor types:
priority_queue class uses an underlying container which can either be an
deque. In most cases, a vector object is probably the best choice.
When you create a
priority_queue, you have to specify the container type, and optionally a
comparison object type. Typical declarations look like this:
The priority_queue adaptor presents you with a very limited interface. It has a pair of
constructors, one which creates an empty queue, and one which loads a sequence of initial values
into the queue. The copy constructor, assignment operator, and destructor are all implicitly
defined by the compiler. The constructors and destructors take care of initializing two
protected data members:
Container c and the comparison object,
In addition, the container offers up just five member functions. (Six, if you count the
mutable versions of
top() as two separate functions.) Like the
itself, these member functions amount to nothing more than wrappers around container member
functions. The Hewlett Packard STL release defines these functions as shown below:
These member functions are all nearly self-explanatory.
size() are used to
determine the number of elements in the heap at a give time.
top() returns a reference to the
top element in the heap, which should be the element with the highest priority.
pop() are used to insert and remove elements from the queue.
pop() both deal with the idiosyncratic behavior of the STL heap
push() first adds the new element to the end of the container that holds the heap.
It then calls the STL
push_heap() function to walk the new element up to its proper place.
pop() first calls the STL
pop_heap() function to move the top element to the end of the
container. A subsequent call to
pop_back() removes that element and shrinks the container.
As a programmer, I feel the most exciting thing about this container adaptor is the relative ease with which it was built around an existing container class. It would take hundreds of lines of code to create a new container class from scratch. But this example shows that you can hijack existing container capabilities to define new types without even breaking a sweat.
An example: Huffman coding
At the start of this article, I mentioned that priority queue containers would be ideal for developing Huffman coding trees. A Huffman coding tree is a binary tree made up of internal nodes and leaf nodes. Coding starts at the root, and moves down the tree, issuing 0s and 1s until a leaf node is reached. Leaf nodes signify that a character has been completely encoded, while internal nodes are a stop along the way. Each node has two children, one designated with a 0 bit, and one designated with a 1 bit. To encode a particular character, you take the path down the tree that leads to to the target leaf.
Building the Hufman encoding tree is done by first creating a list of unattached leaf nodes, each having an internal weight proportional to the frequency of the character they are encoding. The encoding process then executes a simple loop that combines nodes until there is only one root node. That root node is then the starting point for the Huffman encoder.
Each pass through the loop you have to identify the two available nodes with the lowest weights. Those two nodes are then removed from the pool of available nodes (the priority queue.) A new internal node is then created, and it is assigned those two nodes as children. This internal node is then entered into the priority queue.
Figure 5 — A Huffman encoding tree with the node weight circled
Figure 5 shows the tree that results from processing the following input data:
AAAAAAAAAAAAAA BBB C D E F GGG HHHH
This data is found in INPUT.DAT, the file used to exercise the test program, PQHUFF.CPP. As you can see, the nodes with the lowest frequency counts are found the farthest down the tree, meaning they will take more bits to encode. High frequency characters, such as the ‘A’, will take fewer bits to encode. The result? Data compression.
The test program PQHUFF.CPP is shown in Listing 1. PQHUFF.CPP builds the Huffman encoding tree
using a priority queue filled with objects of class
node. The node objects are simply C++
representations of the nodes shown in Figure 5 above. Each node has a weight, which determines
its place in the eventual Huffman coding tree.
The utility of the priority queue finally shows up in
main() of PQHUFF.CPP. After counting all
the characters in the input file, and adding the initial count values to priority queue, the loop
that builds the encoding tree is deceptively simple:
In the loop, the two lowest weight nodes are popped from the priority queue. A new node that has those two nodes for children is then created, and inserted into the queue. The process continues until there is just one node left in the queue, which is then the root node for the encoding tree.
The output from PQHUFF.EXE for the data shown above is:
Char Symbol Code A 14 0 B 3 100 G 3 101 H 4 110 C 1 11100 F 1 11101 D 1 11110 E 1 11111
The STL template class priority_queue provides an efficient encapsulation of a priority queue. The details of both heap management and container support are provided in a fairly transparent manner.
Having this class available as part of the C++ standard library gives programmers a potent new tool to bring to bear for a variety of problems. I find that it makes my Huffman table creation much more efficient, while maintaining the readability I like.
AAAAAAAAAAAAAA BBB C D E F GGG HHHH