Compression - Huffman Coding (AQA GCSE Computer Science): Revision Note
Huffman Coding
What is huffman coding?
Huffman coding is a method of lossless compression primarily used on text based data (documents)
A huffman coding tree is used to compress the data whilst keeping all the data so that it can be uncompressed back to its original state
Example (step 1)
A text file contains the following characters
Text file | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
I |
| L | O | V | E |
| C | O | M | P | U | T | E | R | S |
The text file contain 16 characters including spaces
A frequency analysis shows there is some repetition in the characters, for example there are 2 x E's
Frequency | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
I | space | L | O | V | E | C | M | P | U | T | R | S |
1 | 2 | 1 | 2 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Huffman Trees
What is a huffman tree?
A huffman tree, also known a binary tree is used to lossless compress text based data as part of huffman coding
A huffman tree consists of nodes which can have either 0, 1 or 2 child nodes
Binary trees are covered in much more detail at A-Level
Example (step 2)
The frequency analysis data is ordered from least to most frequent
data:image/s3,"s3://crabby-images/c4518/c45183fbe81385de0f90e490a458b2399eb27438" alt="Table with two rows; the first row has the letters "I L V C M P U T R S" and "space O E," and the second row numbers "1 1 1 1 1 1 1 1 1 1 2 2 2"."
The 2 least frequent characters (I, L) are joined together into a node as part of a binary tree as below
data:image/s3,"s3://crabby-images/6b1fd/6b1fdc3a85ee5971c6f30d3e40917f073c1687a2" alt="huffman-coding-image1"
Update the frequency data to show the combined characters
data:image/s3,"s3://crabby-images/205f4/205f4383058c537fe1fa8fb3004db2aa05235b1c" alt="Table with letters "V C M P U T R S I L space O E" in the first row and corresponding numbers "1 1 1 1 1 1 1 1 2 2 2 2 2" in the second row."
Repeat steps until all characters are combined
data:image/s3,"s3://crabby-images/8b4de/8b4de2f17f5babf573857329223bac223c973b56" alt="huffman-coding-image2"
Next, follow each branch and place a 0 on left branches and a 1 on right branches, this creates a unique code for each character
data:image/s3,"s3://crabby-images/08649/08649b687c502e1efff7811574665f8ac957c46d" alt="huffman-coding-image3"
This allows unique bit patterns for each character to be created and characters that are more frequent have a smaller bit pattern making it more efficient
Character | Bit pattern | Times used | Total bits |
---|---|---|---|
space |
| 2 | 3*2=6 |
O |
| 2 | 3*2=6 |
E |
| 2 | 3*2=6 |
V |
| 1 | 4*1=4 |
C |
| 1 | 4*1=4 |
M |
| 1 | 4*1=4 |
P |
| 1 | 4*1=4 |
U |
| 1 | 4*1=4 |
T |
| 1 | 4*1=4 |
R |
| 1 | 4*1=4 |
S |
| 1 | 4*1=4 |
|
| Total number of bits | 50 |
In this example, using huffman coding we have compressed the original text files to 50 bits
Uncompressed, using ASCII, the text file would be 16 characters x 7 bits per character (16*7=112 bits) which means huffman coding has saved 62 bits of storage (112-50=62)
You've read 0 of your 5 free revision notes this week
Sign up now. It’s free!
Did this page help you?