Huffman Coding

Ishaan Verma
6 min readMar 21, 2021

Have you ever wondered how does file compression work in modern laptops? What does the computer system do such that the size of files reduces?

Today let us explore one such famous file compression algorithm devised by David A. Huffman in 1951. Huffman, while doing his final term paper devised this algorithm using frequency sorted binary trees. This algorithm since then has been called Huffman Coding.

As we all know data is stored in 1s and 0s, one only has to find a way to reduce this sequence in an optimal and lossless manner so as to produce a lossless file compression algorithm. A normal ASCII character set consists of about 100 ‘printable’ characters and in order to distinguish these characters one has to use [log (100)] i.e., 7 bits for each character. 7 bits will be able to store 128 characters so there are also some ‘non-printable’ characters that come in these 128 characters. One other bit is added for parity check while encoding. The important point is that, if there are N characters in a character set then [log (N)] bits would be required to represent all those character in standard encoding.

Let us say that a file has 7 characters with multiple occurrences, suppose 15 e’s, 11 a’s, 2 i’s, 8 o’s, 10 u’s, 13 blanks and 5 new lines. See the table below:

As we can see the total number of bits required would be 192 since there are total 64 characters and each character requires 3 bits. In real life the files maybe considerably large and there usually is a great disparity between the most frequent and least frequent characters. For example, a document may contain a large number of digits, blank spaces, vowels but a vey small number of maybe q’s and z’s. We may want to compress the file if we are transmitting this data over a very slow phone line.

The solution to this is a general strategy that allows the code length of a character to be of variable length and to give short codes to the characters that occur frequently. This strategy saves about 25% on a typical large file and about 50–60% on very large data files.

Let’s see how an alphabet in binary code can be represented. See the figure below:

Fig. 1

In this figure, all the characters are stored in the leaf nodes of a binary tree. The code of an alphabet can be found out by following the path taken to that alphabet from the root node. For every left that we take we record a 0 and for every right we take we record a 1. For example, for finding ‘o’ we first take a left from the root node that gives us a 0. Next, we take a right which gives us a 1. Now we have 01 as the partial code for ‘o’. Next, we take a right to reach ‘o’ recording another 1. So, the final code for ‘o’ would be 011 and this can be verified by looking at the table above. This data structure is sometimes referred to as a trie. If character ci is at depth di and occurs fi times, then the cost of the code is equal to ∑difi. The traversal stops once we reach a leaf node.

We can still practically improve this tree by moving ‘nl’ to its parent node as it’s the only child for that parent node. By doing this we would get

Fig. 2

Doing this the code for ‘nl’ now would be 11 and 5 more bits would be saved as ‘nl’ appears 5 times in the document resulting in 187 bits now instead of 192. But as you can see this is not exactly substantial.

The tree in Fig. 2 can be identified as a full tree where all nodes are either leaves or have 2 children. An optimal tree would always have this property, since otherwise, as we have seen one node could always move up if it’s the only child node.

The sequence can always be decoded without ambiguity if the characters are all located at the leaf nodes. Let’s suppose 0100111100010110001000111 is the encoded string. 0 is not a character code, 01 is not a character code, but 010 represents ‘i’ so the first character is ‘i’. Then 011 follows giving an ‘o’. Then 11 follows giving a new line and similarly the others. As you can see the first 2 characters are represented using 3 bits each but the new line character is represented using 2 bits, but still there is no ambiguities and this will stay true as long as no character code is a prefix of another character code. This type of code is called prefix code.

So, now all we have to do is to create a tree with the minimum cost and this coding tree was given by Huffman and hence this procedure is called Huffman code. The main procedure of Huffman coding is to keep track of a forest of trees and to select 2 trees with the least weight (i.e., frequency) each time and combine them. This procedure would be done C-1 times where C is the number of characters in the set.

Let us understand Huffman coding taking the data in the above table as an example. We first create a forest of trees along with their weights.

Initial stage of Huffman’s algorithm

Next, we combine 2 trees with the least weights in a single tree and call it T1 to ensure that future merges can be stated unambiguously.

Huffman’s algorithm after first merge

Next, we combine 2 trees with the least weights in a single tree and call it T2. This step includes the tree T1.

Huffman’s algorithm after second merge

Next, as we can see that ‘u’ and ‘a’ has the least weights, we combine these two into a tree and call it T3.

Huffman’s algorithm after third merge

Now we combine ‘sp’ with T2 for simplicity can name the tree T4. We can combine it with ‘e’ also but in this example, we chose not to.

Huffman’s algorithm after fourth merge

Next, we combine T3 and ‘e’ into a tree and name it T5

Huffman’s algorithm after fifth merge

Lastly, we combine T5 and T4

Huffman’s algorithm after final merge

And now our code tree is complete. We can verify that this is optimal by looking at ‘i’ and ‘nl’, which are the least frequent characters and so are present at the bottom of the code tree.

Now we can create the code table as follows:

As you can see this is an 11% decrease from the previous code table. On a larger file this would save a lot more than just 11%. This is the Huffman code algorithm. There are two details that must be considered. First, the encoding information must be transmitted at the start of the compressed file, since otherwise it will be impossible to decode. For small files, the cost of transmitting this table will override any possible savings in compression, and the result will probably be file expansion. Of course, this can be detected and the original left intact. For large files, the size of the table is not significant. The second problem is that, as described, this is a two-pass algorithm. The first pass collects the frequency data, and the second pass does the encoding. This is obviously not a desirable property for a program dealing with large files.

--

--