There are a few shortcomings to the straight Huffman compression. First of all, you need to send the Huffman tree at the beginning of the compressed file, or the decompressor will not be able to decode it. This can cause some overhead.
Also, Huffman compression looks at the statistics of the whole file, so that if a part of the code uses a character more heavily, it will not adjust during that section. Not to mention the fact that sometimes the whole file is not available to get the counts from (such as in live information).
The solution to all of these problems is to use an Adaptive method.
The basic concept behind an adaptive compression algorithm is very simple:
Decompression works the same way. As long as both sides have the same initialize and update model algorithms, they will have the same information.
The problem is how to update the model. To make Huffman compression adaptive, we could just re-make the Huffman tree every time a character is sent, but that would cause an extermely slow algorithm. The trick is to only update the part of the tree that is affected.
The Huffman tree is initialized with a single node, known as the Not-Yet-Transmitted (NYT) or escape code. This code will be sent every time that a new character, which is not in the tree, is ecountered, followed by the ASCII encoding of the character. This allows for the decompressor to destinguish between a code and a new character. Also, the procedure creates a new node for the character and a new NYT from the old NYT node.
Whenever a character that is already in the tree is encountered, the code is sent and the weight is increased.
In order to for this algorithm to work, we need to add some additional information to the Huffman tree. In addition to each node having a weight, it will now also be assigned a unique node number. Also, all the nodes that have the same weight are said to be in the same block.
These node numbers will be assigned in such a way that:
This is known as the sibling property, and the update algorithm simply swaps nodes to make sure that this property is upheld. Obviously, the root node will have the highest node number because it has the highest weight.
Here is a flowchart of the update procedure, as taken from Introduction to Data Compression by by Sayood Khalid:
After a count is increased, the update procedure moves up the tree and inspects the ancestors of the node one at a time. It checks to make sure that the node have the highest node in its block, and if not, swaps it with the highest node number. It then increases the node weight and goes to the parent. It continues until it reaches the root node.
As you will see, this assures that the nodes with the highest weight are closer to the top and have shorter codes.
I realize that the update procedure is a little hard to understand, so I made this applet to demonstrate exactly what happens. The appplet creates and updates the tree while interactively sending over the characters. It also monitors the compression ratio of the string so far.
You can set the applet to Play-Pause or Continuous mode. In Play-Pause, a "Next Step" button pops up every time something is going to happen. You need to press this button to continue. In Continuous mode, the applet only pauses for 1 second in between steps.
You can run the applet with a 10-point font or an 18-point type. The 18 point type is easier to read, but will not fit as large of a tree.
Try and follow the udpate flowchart while putting in a string like "aabbb" and watch the swap at the end.
The source code to this applet is available in AHuffman.zip