Chess - Programming - Huffman - Example

Consider the following sentence:

dead beef cafe deeded dad.  dad faced a faded cab.  dad acceded.  dad be bad.

NOTE: This sentence is 77 characters long.

There are 12 a's, 4 b's, 5 c's, 19 d's, 12 e's, 4 f's, 17 spaces, and 4 periods.

Using a Fixed-length code

A fixed-length code could be used:

code	character
000	(space)
001	a
010	b
011	c
100	d
101	e
110	f
111	.

NOTE: Then the sentence, which is of length 77, consumes 77 * 3 = 231 bits.

Each character is taking up 3 bits.

Using a Variable-length code

Alternative, a Variable length code is used like this:

code	character
100	(space)
110	a
11110	b
1110	c
0	d
1010	e
11111	f
1011	.

NOTE: The optimal number of bits needed to represent each letter is obtained by the Huffman Algorithm.

It uses the frequency of the letters in the text to determine the optimal code for each letter.

The text here can be encoded in 3 * 12 + 4 * 5 + 5 * 4 + 19 * 1 + 12 * 4 + 4 * 5 + 17 * 3 + 4 * 4 = 230 bits.

That is a savings of 1 bit.
It does not seem like much, but it is a start.

This code must be a prefix code, where we can distinguish where one code stops and another starts;

One code may not be a prefix of another code or there will be confusion.

If the characters have non-uniform frequency distributions, then finding such a code can lead to great savings in storage space.

A process with this effect is called data compression.
This can be applied to any data where the frequency distribution is known or can be computed, not just sentences in languages.
Examples are computer graphics, digitized sound, binary executables, etc.

A prefix code can be represented as a binary tree, where the leaves are the characters and the codes are derived by tracing a path from root to leaf, using 0 when we go left and 1 when we go right.

Represent the Variable-length code in a tree

The Variable-length code above would be represented by this tree:

                                _@_
                             _/     \_
                           _/         \_
                         _/             \_
                       _/                 \_
                     _/                     \_
                   _/                         \_
                  d                            _@_
                                              /   \
                                             /     \ 
                                            /       \
                                           /         \
                                          /           \
                                         _@_          _@_
                                        /   \        /   \
                                       /     \      /     \
                                   (space)    @    a       @
                                             / \          / \
                                            /   \        /   \
                                           e    "."     c     @
                                                             / \
                                                            b   f

NOTE: In this tree, the code for e is found by going right, left, right, left, i.e., 1010.

Label each internal node

Label each internal node recursively with the sum of the values of its children, starting at the leaves.

The tree looks like this:

                                _77
                             _/     \_
                           _/         \_
                         _/             \_
                       _/                 \_
                     _/                     \_
                   _/                         \_
                  d                            _58
                 19                           /   \
                                             /     \ 
                                            /       \
                                           /         \
                                          /           \
                                         _33          _25
                                        /   \        /   \
                                       /     \      /     \
                                   (space)    16   a      13
                                     17      / \  12      / \
                                            /   \        /   \
                                           e    "."     c     8
                                          12     4      5    / \
                                                            b   f
                                                            4   4

NOTE: The root node has value 77, which is just the number of characters.

The number of bits needed to encode the data is the the sum, for each character, of the number of bits in its code times its frequency.
Let T be the tree, C be the set of characters c that comprise the alphabet, and f© be the frequency of character c.
Since the number of bits is the same as the depth in the binary tree, we can express the sum in terms of dT, the depth of character c in the tree:
```
total_bits_for_a_character = f(c) * dT
```
For example:
- a had frequency of 12 and depth of 3, so total_bits = 36.
- b had frequency of 4 and depth of 5, so total_bits = 20.
- c had frequency of 5 and depth of 4, so total_bits = 20.
- d had frequency of 19 and depth of 1, so total_bits = 19.
- e had frequency of 12 and depth of 4, so total_bits = 48.
- f had frequency of 4 and depth of 5, so total_bits = 20.
- (spaces) had frequency of 17 and depth of 3, so total_bits = 51.
- . had frequency of 4 and depth of 4, so total_bits = 16.

This gives a total of 36 + 20 + 20 + 19 + 48 + 20 + 51 + 16 = 230 bits.
- This is the sum we want to minimize.
- Lets call it the cost, B(T) of the tree.

Build the low-cost tree

Huffman (C)
  n = the size of C

  // Insert all the elements of C into Q, using the value of the node as the priority.
  for i in 1..n-1 do
    z = a new tree node
    x = Extract-Minimum (Q)
    y = Extract-Minimum (Q)
    left node of z = x
    right node of z = y
    f[z] = f[x] + f[y]
    Insert (Q, z)
  end for
  
  return Extract-Minimum (Q) as the complete tree

NOTE: In the following algorithm,

B(T) is defined as above;
- it can be stored efficiently in an array indexed by characters.
- It is extended as needed to accommodate the values of internal tree nodes.
C is the set of characters represented as leaf tree nodes.
Q is a priority queue of tree nodes where we can quickly extract the minimum element;
- this can be done with a heap where the heap property is reversed.
- We build the tree in a bottom up manner, starting with the individual characters and ending up with the root of the tree as the only element of the queue.

At first, the queue contains all the leaf nodes as a “forest” of singleton binary trees.

Then the two nodes with least value are grabbed out of the queue, joined by a new tree node, and put back on the queue.
After n-1 iterations, there is only one node left in the queue: the root of the tree.

Stepping through the code using the Huffman algorithm

Here are the contents of Q after each step through the for loop:

Initially, all nodes are leaf nodes.

We stick all 8 in Q:

(space)  a    b    c    d    e    f    .
  17    12    4    5   19   12    4    4

We join two of the nodes with least value; now there are 7 things in Q:

                 8
                / \   (space) a    b    c    d    e
               f   .    17   12    4    5   19   12
               4   4

Then the next two with least value, Q has 6 elements:

                                       
                 8                    9 
                / \   (space) a      / \     d    e
               f   .    17   12     b   c   19   12
               4   4                4   5

Now the two nodes with least values are the two trees we just made, and Q has 5 elements:

                          17
                      __/   \__
                   __/         \__
                  /               \
                 8                 9 
                / \               / \     d    e  (space)  a
               f   .             b   c   19   12    17    12
               4   4             4   5
<code>

----

The next two with least value, Q has 4 elements:

<code>
                          17
                      __/   \__
                   __/         \__
                  /               \              24
                 8                 9            /  \
                / \               / \     d    e    a   (space)
               f   .             b   c   19   12    12    17
               4   4             4   5

The next two with least value, three items left:

                                            34
                                    ______/   \_____
                             ______/                \_____
                            /                            (space)
                          17                               17
                      __/   \__
                   __/         \__
                  /               \                      24
                 8                 9                    /  \
                / \               / \                  e    a      d
               f   .             b   c                12    12     19
               4   4             4   5

The next two with least value, two big trees left:

                                            34
                                    ______/   \_____
                             ______/                \_____
                            /                          (space)
                          17                             17
                      __/   \__                             43
                   __/         \__                         /  \
                  /               \                      24    d
                 8                 9                    /  \  19
                / \               / \                  e    a      
               f   .             b   c                12    12
               4   4             4   5

Finally, we join the whole thing up:

                                                                  77
                                               __________________/  \_______
                                              /                             \
                                            34                               43
                                    ______/   \_____                        /  \
                             ______/                \_____                24    d
                            /                          (space)           /  \  19
                          17                             17             e   a  
                      __/   \__                                        12  12
                   __/         \__ 
                  /               \
                 8                 9
                / \               / \
               f   .             b   c
               4   4             4   5

NOTE: At each point, we chose the joining that would force less frequent characters to be deeper in the tree.

So an optimal prefix code is:

01	(space)
101	a
0010	b
0011	c
11	d
100	e
0000	f
0001	.

And

B(T) = 17 * 2 + 12 * 3 + 4 * 4 + 5 * 4 + 19 * 2 + 12 * 3 + 4 * 4 = 196 bits

NOTE: A savings of 15% in the size of the data.

Wiki

Table of Contents