i don't see anybody mentioning huffman encoding. i think it just works per byte, so it's not as tight as gzip or whatever. but it sounds like it would be easy to implement and wouldn't require any corpus-wide compression information. except a character frequency count if you wanted to be optimal.
<i don't know anything about compression> "Paul Rubin" <http://[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] >I have a lot of short English strings I'd like to compress in order to > reduce the size of a database. That is, I'd like a compression > function that takes a string like (for example) "George Washington" > and returns a shorter string, with luck maybe 6 bytes or so. One > obvious idea is take the gzip function, compress some large text > corpus with it in streaming mode and throw away the output (but > setting up the internal state to model the statistics of English > text), then put in "George Washington" and treat the additional output > as the compressed string. Obviously to get reasonable speed there > would have to be a way to save the internal state after initializing > from the corpus. > > Anyone know if this has been done and if there's code around for it? > Maybe I'm better off with freezing a dynamic Markov model? I think > there's DMM code around but am not sure where to look. > > Thanks. -- http://mail.python.org/mailman/listinfo/python-list