I have a lot of short English strings I'd like to compress in order to reduce the size of a database. That is, I'd like a compression function that takes a string like (for example) "George Washington" and returns a shorter string, with luck maybe 6 bytes or so. One obvious idea is take the gzip function, compress some large text corpus with it in streaming mode and throw away the output (but setting up the internal state to model the statistics of English text), then put in "George Washington" and treat the additional output as the compressed string. Obviously to get reasonable speed there would have to be a way to save the internal state after initializing from the corpus.
Anyone know if this has been done and if there's code around for it? Maybe I'm better off with freezing a dynamic Markov model? I think there's DMM code around but am not sure where to look. Thanks. -- http://mail.python.org/mailman/listinfo/python-list