Google+ Pieces o' Eight: December 2011

Saturday 31 December 2011

Understanding eBay/PayPal fees

Selling things on eBay used to be a pretty simple and straightforward affair. The costs were fairly easy to understand and people, generally speaking, were happy. Then eBay bought PayPal which should have made things even easier for all concerned, but strangely (or not so strangely) has resulted in a multitude of costs, fees and other considerations that directly impact your bottom line. I wouldn't exactly call these "hidden" costs, but nor would I describe them as "well advertised" and could certainly catch the unwary seller off-guard. Here is a brief guide on what to watch out for.

Sunday 11 December 2011

An implementation of LZW compression in Python

I've been doing some research on data compression and differencing recently and came across an excellent article by Mark Nelson on the LZW compression algorithm: http://marknelson.us/2011/11/08/lzw-revisited/

It's well worth a read, however all the code examples are in C++ so I decided to implement LZW in Python to enhance my understanding of the algorithm.

To start with we need something to compress. I picked the opening paragraph from George Orwell's 1984:

string = """It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him."""

The next step is to initialize a dictionary with all possible single character codes:

codes = dict([(chr(x), x) for x in range(256)])

Here's the compressor:

compressed_data = []       
code_count = 257
current_string = ""
for c in string:
    current_string = current_string + c
    if not (codes.has_key(current_string)):
        codes[current_string] = code_count
        compressed_data.append(codes[current_string[:-1]])
        code_count += 1
        current_string = c
compressed_data.append(codes[current_string])

A couple of quick notes: I'm using a list - compressed_data - to store the compressed data, in the real world you'd probably want to use a file. You may also notice that code_count is initialized to 257, whilst the dictionary only has codes 0-255. What happened to 256? This is reserved for a control character, which I haven't implemented.

To decompress the data we just compressed, we need to initialize another dictionary containing all possible single character codes. In order for the algorithm to work, this dictionary needs to be the same as the dictionary used to compress the data:

strings = dict([(x, chr(x)) for x in range(256)])

You might have noticed that whilst this new dictionary contains the same data as the first the Keys and Values have swapped positions.

Here's the decompressor:

next_code = 257
decompressed_string = ""
previous_string = ""
for c in compressed_data:
    if not (strings.has_key(c)):
        strings[c] = previous_string + (previous_string[0])
    decompressed_string += strings[c]
    if not(len(previous_string) == 0):
        strings[next_code] = previous_string + (strings[c][0])
        next_code +=1
    previous_string = strings[c]

Again, next_code reserves 256 for an un-implemented control character. Also you'll see that I'm decompressing the data to a string, decompressed_string.

Finally, let's take a peek at the codes the compressor created and also display the decompressed string:

print "".join(str(compressed_data))
print decompressed_string

Conclusion

LZW is a pretty clever algorithm, but is also pretty easy to implement. Part of the cleverness is the way the decompressor is able to reconstruct the code dictionary used by the compressor on-the-fly and does not require the dictionary to be sent with the data.

My implementation is incomplete as it doesn't put a restriction on the number of codes generated and used, specify the control character or interface with files or streams external to the script, but even so I found it to be a useful examination of LZW and a good introduction to the techniques of lossless data compression.