On Tue, 10 May 2005 07:59:31 +0000 (UTC), Thomas Bellman <[EMAIL PROTECTED]> wrote:
>John Machin <[EMAIL PROTECTED]> writes: > >> Which raises a question: who or what is going to read your file? If a >> Unicode-aware application, and never a human, you might like to >> consider encoding the text as utf-16. > >Why would one want to use an encoding that is neither semi-compatible >with ASCII (the way UTF-8 is), nor uses fixed-with characters (like >UTF-32 does)? UTF-32 is yet another encoding. You still need to decode it into the internal form supported by your processing software. With UTF-32xE, you can only skip the decoding step when file's x == software's x and your software uses 32 bits internally. Python (2.4.1) doesn't have a utf_32 codec. Perhaps that's because there isn't much call for it (yet). Let's pretend there is such a codec in Python. Once you have done codecs.open('inputfile', 'rb', 'utf_32') or receivedstring.decode('utf_32'), what do you care whether your *external representation* has fixed-width characters or not? Putting it another way, any advantage of fixed-width characters is to be found in *internal* storage, not *external* transmission or storage. At the other end, if you don't have to squeeze your data through an 8-bit-wide non-binary channel, and you have no need for legibility to humans, then the remaining considerations are efficiency and (if you have no control over what's used at the other end) whether the necessary codec is widely implemented. So rather than utf-16, perhaps I should have written something like: """ Consider utf-8 or utf-16. Consider following this by compression using a widely-implemented protocol (gzip/zip/bzip2). """ Cheers, John -- http://mail.python.org/mailman/listinfo/python-list