Pieces o' Eight: December 2011

I've been doing some research on data compression and differencing recently and came across an excellent article by Mark Nelson on the LZW compression algorithm: http://marknelson.us/2011/11/08/lzw-revisited/

It's well worth a read, however all the code examples are in C++ so I decided to implement LZW in Python to enhance my understanding of the algorithm.

To start with we need something to compress. I picked the opening paragraph from George Orwell's 1984:

string = """It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him."""

The next step is to initialize a dictionary with all possible single character codes:

codes = dict([(chr(x), x) for x in range(256)])

Here's the compressor:

compressed_data = []       
code_count = 257
current_string = ""
for c in string:
    current_string = current_string + c
    if not (codes.has_key(current_string)):
        codes[current_string] = code_count
        compressed_data.append(codes[current_string[:-1]])
        code_count += 1
        current_string = c
compressed_data.append(codes[current_string])

A couple of quick notes: I'm using a list - compressed_data - to store the compressed data, in the real world you'd probably want to use a file. You may also notice that code_count is initialized to 257, whilst the dictionary only has codes 0-255. What happened to 256? This is reserved for a control character, which I haven't implemented.

To decompress the data we just compressed, we need to initialize another dictionary containing all possible single character codes. In order for the algorithm to work, this dictionary needs to be the same as the dictionary used to compress the data:

strings = dict([(x, chr(x)) for x in range(256)])

You might have noticed that whilst this new dictionary contains the same data as the first the Keys and Values have swapped positions.

Here's the decompressor:

next_code = 257
decompressed_string = ""
previous_string = ""
for c in compressed_data:
    if not (strings.has_key(c)):
        strings[c] = previous_string + (previous_string[0])
    decompressed_string += strings[c]
    if not(len(previous_string) == 0):
        strings[next_code] = previous_string + (strings[c][0])
        next_code +=1
    previous_string = strings[c]

Again, next_code reserves 256 for an un-implemented control character. Also you'll see that I'm decompressing the data to a string, decompressed_string.

Finally, let's take a peek at the codes the compressor created and also display the decompressed string:

print "".join(str(compressed_data))
print decompressed_string

Conclusion

LZW is a pretty clever algorithm, but is also pretty easy to implement. Part of the cleverness is the way the decompressor is able to reconstruct the code dictionary used by the compressor on-the-fly and does not require the dictionary to be sent with the data.

My implementation is incomplete as it doesn't put a restriction on the number of codes generated and used, specify the control character or interface with files or streams external to the script, but even so I found it to be a useful examination of LZW and a good introduction to the techniques of lossless data compression.

Pieces o' Eight

Saturday, 31 December 2011

Understanding eBay/PayPal fees

Sunday, 11 December 2011

An implementation of LZW compression in Python