=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <[EMAIL PROTECTED]> wrote:
>Thomas Bellman wrote: >> Fixed-with characters *do* have advantages, even in the external >> representation. With fixed-with characters you don't have to >> parse the entire file or stream in order to read the Nth character; >> instead you can skip or seek to an octet position that can be >> calculated directly from N. > OTOH, encodings that are free of null bytes and ASCII compatible > also have advantages. Indeed, indeed. But that's no reason to choose UTF-16 over UTF-32, since you don't get those advantages then. >> And not the least, UTF-32 is *beautiful* compared to UTF-16. > But ugly compared to UTF-8. Not only does it have the null byte > and the ASCII incompatibility problem, but it also has the > endianness problem. So for exchanging Unicode between systems, > I can see no reason to use anything but UTF-8 (unless, of course, > one end, or the protocol, already dictates a different encoding). UTF-8 beats UTF-32 in the practicality department, due to its compatibility with legacy software, but in my opinion UTF-32 wins over UTF-8 for shear beauty, even with the endianness problem. I do wish they had standardized on one single endianness for UTF-32 (and UTF-16), instead of allowing both to exist. In the mid 1990's I had to work with files in the TIFF format, which allows both endianesses. The specification *requires* you to read both, but it was a rare sight to find MS Windows software that didn't barf on big endian TIFF files. :-( Unix software tended to be better at reading both endians, but generally wrote in the native format, meaning big endian on Sun Sparc. Luckily I could convert files using tiffcp on our Unix machines, but it was irritating to have to introduce that extra step. I fully expect the same problem to happen with UTF-16 and UTF-32 too. Anyway, back to UTF, my complaint is that UTF-16 doesn't give you the advantages of *either* UTF-8, nor UTF-32, so if you have the choice, UTF-16 is always the worst alternative of those three. I see no reason to recommend UTF-16 at all. -- Thomas Bellman, Lysator Computer Club, Linköping University, Sweden "God is real, but Jesus is an integer." ! bellman @ lysator.liu.se ! Make Love -- Nicht Wahr!
-- http://mail.python.org/mailman/listinfo/python-list