Consider the following sentence:
dead beef cafe deeded dad. dad faced a faded cab. dad acceded. dad be bad.
NOTE: This sentence is 77 characters long.
A fixed-length code could be used:
code | character |
---|---|
000 | (space) |
001 | a |
010 | b |
011 | c |
100 | d |
101 | e |
110 | f |
111 | . |
NOTE: Then the sentence, which is of length 77, consumes 77 * 3 = 231 bits.
Alternative, a Variable length code is used like this:
code | character |
---|---|
100 | (space) |
110 | a |
11110 | b |
1110 | c |
0 | d |
1010 | e |
11111 | f |
1011 | . |
NOTE: The optimal number of bits needed to represent each letter is obtained by the Huffman Algorithm.
The text here can be encoded in 3 * 12 + 4 * 5 + 5 * 4 + 19 * 1 + 12 * 4 + 4 * 5 + 17 * 3 + 4 * 4 = 230 bits.
This code must be a prefix code, where we can distinguish where one code stops and another starts;
If the characters have non-uniform frequency distributions, then finding such a code can lead to great savings in storage space.
A prefix code can be represented as a binary tree, where the leaves are the characters and the codes are derived by tracing a path from root to leaf, using 0 when we go left and 1 when we go right.
The Variable-length code above would be represented by this tree:
_@_ _/ \_ _/ \_ _/ \_ _/ \_ _/ \_ _/ \_ d _@_ / \ / \ / \ / \ / \ _@_ _@_ / \ / \ / \ / \ (space) @ a @ / \ / \ / \ / \ e "." c @ / \ b f
NOTE: In this tree, the code for e is found by going right, left, right, left, i.e., 1010.
Label each internal node recursively with the sum of the values of its children, starting at the leaves.
The tree looks like this:
_77 _/ \_ _/ \_ _/ \_ _/ \_ _/ \_ _/ \_ d _58 19 / \ / \ / \ / \ / \ _33 _25 / \ / \ / \ / \ (space) 16 a 13 17 / \ 12 / \ / \ / \ e "." c 8 12 4 5 / \ b f 4 4
NOTE: The root node has value 77, which is just the number of characters.
total_bits_for_a_character = f(c) * dT
Huffman (C) n = the size of C // Insert all the elements of C into Q, using the value of the node as the priority. for i in 1..n-1 do z = a new tree node x = Extract-Minimum (Q) y = Extract-Minimum (Q) left node of z = x right node of z = y f[z] = f[x] + f[y] Insert (Q, z) end for return Extract-Minimum (Q) as the complete tree
NOTE: In the following algorithm,
At first, the queue contains all the leaf nodes as a “forest” of singleton binary trees.
Here are the contents of Q after each step through the for loop:
Initially, all nodes are leaf nodes.
We stick all 8 in Q:
(space) a b c d e f . 17 12 4 5 19 12 4 4
We join two of the nodes with least value; now there are 7 things in Q:
8 / \ (space) a b c d e f . 17 12 4 5 19 12 4 4
Then the next two with least value, Q has 6 elements:
8 9 / \ (space) a / \ d e f . 17 12 b c 19 12 4 4 4 5
Now the two nodes with least values are the two trees we just made, and Q has 5 elements:
17 __/ \__ __/ \__ / \ 8 9 / \ / \ d e (space) a f . b c 19 12 17 12 4 4 4 5 <code> ---- The next two with least value, Q has 4 elements: <code> 17 __/ \__ __/ \__ / \ 24 8 9 / \ / \ / \ d e a (space) f . b c 19 12 12 17 4 4 4 5
The next two with least value, three items left:
34 ______/ \_____ ______/ \_____ / (space) 17 17 __/ \__ __/ \__ / \ 24 8 9 / \ / \ / \ e a d f . b c 12 12 19 4 4 4 5
The next two with least value, two big trees left:
34 ______/ \_____ ______/ \_____ / (space) 17 17 __/ \__ 43 __/ \__ / \ / \ 24 d 8 9 / \ 19 / \ / \ e a f . b c 12 12 4 4 4 5
Finally, we join the whole thing up:
77 __________________/ \_______ / \ 34 43 ______/ \_____ / \ ______/ \_____ 24 d / (space) / \ 19 17 17 e a __/ \__ 12 12 __/ \__ / \ 8 9 / \ / \ f . b c 4 4 4 5
NOTE: At each point, we chose the joining that would force less frequent characters to be deeper in the tree.
So an optimal prefix code is:
01 (space) 101 a 0010 b 0011 c 11 d 100 e 0000 f 0001 .
And
B(T) = 17 * 2 + 12 * 3 + 4 * 4 + 5 * 4 + 19 * 2 + 12 * 3 + 4 * 4 = 196 bits
NOTE: A savings of 15% in the size of the data.