On 06/05/2014 05:02 PM, Steven D'Aprano wrote: >[...] > But Linux Unicode support is much better than Windows. Unicode support in > Windows is crippled by continued reliance on legacy code pages, and by > the assumption deep inside the Windows APIs that Unicode means "16 bit > characters". See, for example, the amount of space spent on fixing > Windows Unicode handling here: > > http://www.utf8everywhere.org/
While not disagreeing with the the general premise of that page, it has some problems that raise doubts in my mind about taking everything the author says at face value. For example "Q: Why would the Asians give up on UTF-16 encoding, which saves them 50% the memory per character?" [...] in fact UTF-8 is used just as often in those [Asian] countries. That is not my experience, at least for Japan. See my comments in https://mail.python.org/pipermail/python-ideas/2012-June/015429.html where I show that utf8 files are a tiny minority of the text files found by Google. He then gives a table with the size of utf8 and utf16 encoded contents (ie stripped of html stuff) of an unnamed Japanese wikipedia page to show that even without a lot of (html-mandated) ascii, the space savings are not very much compared to the theoretical "50%" savings he stated: " Dense text (Δ UTF-8) UTF-8 ... 222 KB (0%) UTF-16 ... 176 KB (−21%)" Note that he calculates the space saving as (utf8-utf16)/utf8. Yet by that metric the theoretical saving is *NOT* 50%, it is 33%. For example 1000 Japanese characters will use 2000 bytes in utf16 and 3000 in utf8. I did the same test using http://ja.wikipedia.org/wiki/%E7%B9%94%E7%94%B0%E4%BF%A1%E9%95%B7 I stripped html tags, javascript and redundant ascii whitespace characters The stripped utf-8 file was 164946 bytes, the utf-16 encoded version of same was 117756. That gives (using the (utf8-utf16)/utf16 metric he used to claim 50% idealized savings) 40% which is quite a bit closer to the idealized 50% than his 21%. I would have more faith in his opinions about things I don't know about (such as unicode programming on Windows) if his other info were more trustworthy. IOW, just because it's on the internet doesn't mean it's true. -- https://mail.python.org/mailman/listinfo/python-list