LZW compression has its roots in the work of Jacob Ziv and Abraham Lempel. In 1977, they published a paper on "sliding-window" compression, and followed it with another paper in 1978 on "dictionary" based compression. These algorithms were named LZ77 and LZ78, respectively. Then in 1984, Terry Welch made a modification to LZ78 which became very popular and was dubbed LZW (guess why). The LZW algorithm is what we are going to talk about here.
Many files, especially text files, have certain strings that repeat very often, for example " the ". With the spaces, the string takes 5 bytes, or 40 bits to encode. But what if we were to add the whole string to the list of characters after the last one, at 256. Then every time we came across " the ", we could send the code 256 instead of 32,116,104,101,32. This would take 9 bits instead of 40 (since 256 does not fit into 8 bits).
This is exactly the approach that LZW compression takes. It starts with a "dictionary" of all the single character with indexes 0..255. It then starts to expand the dictionary as information gets sent through. Pretty soon, redundant strings will be coded as a single bit, and compression has occured.
Ok, so how is this done? Here is the basic algorithm:
So what happens here? The program reads one character at a time. If the code is in the dictionary, then it adds the character to the current work string, and waits for the next one. This occurs on the first character as well. If the work string is not in the dictionary, (such as when the second character comes across), it adds the work string to the dictionary and sends over the wire the works string without the new character. It then sets the work string to the new character.
How about decompression?
The nice thing is that the decompressor builds its own dictionary on its side, that matches exactly the compressor's, so that only the codes need to be sent.
If you want more detail, there is an article on LZW Compression by Mark Nelson, the author of the Data Compression Book.
To help demonstrate exactly what happens in the LZW algorithm, I wrote an Interactive LZW Compressor/Decompressor. Simply type in a string into the input and the applet will build the dictionaries and send the string across.
The screen is split into two parts. On the left is the compression and on the right us the decompression. The only communication that the two have is through the codes that run at the bottom of the screen. Both sides build a dictionary that you can see growing. The program constantly monitors the level of compression.
To see it working well, try a very reduntant string such as:
Some notes: the "base" dictionary are the ASCII characters from 32-126, because they are the displayable ones. However, the codes are one bit longer than the raw data because usually the base dictionary is 256 characters, and the additional entries are added at 256.
If you would like the source code, the .java files are in lzw.zip. I warn you, I haven't cleaned up and properly commented the code, so its pretty ugly.