CSCI 2720
Data Structures
Spring 2007
Project 3 -- String Compression
Assigned: Thursday, March 8th
Due: Thursday, April 12th
See the Project 3 Grade Sheet.
and
questions and answers regarding project 3 .
Also, here's a
sample program
using command-line parameters
Description
In this project you will implement two programs:
- A string compression program.
- A corresponding uncompression program.
The compression program is to be implemented using one of three
algorithms:
- "Standard" 2-pass Huffman encoding. In the first pass, you
will count the frequency of occurence of the characters. You will
then construct a Huffman encoding tree, and build a lookup table.
Next, for full credit, you should create the "compressed" output
file containing a representation of the tree followed by the
encoded characters. For partial credit (-5) you may choose to
create the "compressed" output file containing a representation
of the encoding table followed by the encoded characters.
In a real compression program, this encoding would consist of
the bit representation for each character.
- Adaptive Huffman encoding. In this algorithm, you begin
with a Huffman encoding tree in which the frequency of each
ASCII character is set to 0, and the code for each character
is the binary representation of its ASCII code. As characters
occur in the input file, write out the current code for the
character, update the frequency of occurence of that character,
and then update the tree.
- Lempel-Ziv encoding. In this algorithm, create a table
(or other data structure) large enough to contain 4096 elements
(a 12-bit address). Initialize the first 256 elements with the
ASCII character set. Using the Lempel-Ziv algorithm discussed
in class and found in the text, iterate through the input
string, outputting the code for each prefix removed from the
string, and updating the table accordingly.
For these simulated compression algorithms, you can write out
the 0s and 1s representing bits as characters, if you find it
more convenient to do so.
For the selected compression algorithm, write the corresponding
uncompression algorithm. To test your program we will
compress an input file, view the compressed version to verify
the format of the compressed file, and the uncompress the
file to verify that the compression/uncompression was lossless.
Both the compression and uncompression algorithms should accept
command-line parameters specifying the name of the input file
and the name of the output file (in that order). You may
choose to accept additional command-line parameters.
Structure your program so that if a user types in the program
name without the parameters, he/she will receive a help message
describing the order and meaning of the expected parameters.
A representation of the Huffman Encoding Tree
One way to encode the Huffman Encoding tree is to encode each
leaf with a "1" followed by the character that the leaf represents,
and to encode each internal node with a "0", followed by an
encoding of its left subtree, followed by an encoding of its
right subtree. (A preorder traversal). Since each node in
a Huffman Encoding tree is either a leaf node or has two
non-null subtrees, this will be a unique representation.