Data Compression (Ch 7)

Huffman Tutorial

This expands on the Shortest Symbol trick from Nine Algorithms and introduces a concrete algorithm for coming up with a shortest symbol code called Huffman encoding.

Say we have the message “abcabcdeab”. If we do a count of the letters, we get 3 a’s, 3 b’s, 2 c’s, 1 d and 1 e. ../_images/he6.png
We take the least common pair and join them. The new circle has a value of 2 because it represents the combined d and e (1 of each). ../_images/he5.png
Now we do it again to join the smallest two ungrouped circles together. The new circle has a 4 because it represents 2 c’s, and the 2 d’s and e’s (1 d of each). ../_images/he4.png
Now we want to do the join trick again with the smallest two remaining circles. The two smallest circles are the a and b (3 instances of each). So we will join them into a group of 6.

You always merge the smallest circles. Do not just work from right to left!
../_images/he3.png
Finally, we can merge the two remaining circles into one final circle. Once we get down to one circle we are ready to make the code. ../_images/he2.png
Think of each circle as choosing between two branches. Put a 0 on the left branch and a 1 on the right.

To get the code for a letter, read off the 0’s and 1’s along the branches you take to get from the top circle to the given letter.
‘a’ would be coded as 00
’d’ would be coded as 110
../_images/he1.png

The full table of codes is shown in the table below. Note that the process of building the code resulted in two important features: 1) The most common symbols have the shortest codes. 2) No code is a prefix of another code.

a 00
b 01
c 10
d 110
e 111

Using the code, the sequence aabcea is encoded as “00 00 01 10 111 00”. But because of the way we designed the code, we don’t even need the spaces. We could decode “0000011011100” by working our way from left to right. as shown below:

Code Decoded Notes
0000011011100   Start
0000011011100   Is 0 a code? No, continue…
0000011011100 a Is 00 a code? Yes, it represents a. We have used up the 00.
0000011011100 a Is 0 a code? No, continue…
0000011011100 aa Is 00 a code? Yes, it is another a. We have used up that 00.
0000011011100 aa Is 0 a code? No, continue…
0000011011100 aab Is 01 a code? Yes, it represents b. We have used up the 01.
0000011011100 aab Is 1 a code? No, continue…
0000011011100 aabc Is 10 a code? Yes, it represents c. We have used up the 10.
0000011011100 aabc Is 1 a code? No, continue…
0000011011100 aabc Is 11 a code? No, continue…
0000011011100 aabce Is 111 a code? Yes, it represents e. We have used up the 111.
0000011011100 aabce Is 0 a code? No, continue…
0000011011100 aabcea Is 00 a code? Yes, it is another a. We are now done.

Optional: Related Readings

Here are three articles related to data compression you might find interesting: