Chris Lasher wrote:
Given that the information content is 2 bits per character that is taking up 8 bits of storage, there must be a good reason for storing and/or transmitting them this way? I.e., it it easy to think up a count-prefixed compressed format packing 4:1 in subsequent data bytes (except for the last byte which have less than 4 2-bit codes).
My guess for the inefficiency in storage size is because it is human-readable, and because most in-silico molecular biology is just a bunch of fancy string algorithms. This is my limited view of these things at least.
Yeah, that pretty much matches my guess (not that I'm involved in anything related to computational molecular biology or genetics). Given the current technology, the cost of the extra storage size is presumably lower than the cost of translating into/out of a packed format. Heck, hard drives cost less than $1/GB now.
And besides, for long-term archiving purposes, I'd expect that zip et al on a character-stream would provide significantly better compression than a 4:1 packed format, and that zipping the packed format wouldn't be all that much more efficient than zipping the character stream.
Jeff Shannon Technician/Programmer Credit International
-- http://mail.python.org/mailman/listinfo/python-list