On Sun, Mar 13, 2016 at 12:18 AM, BartC <b...@freeuk.com> wrote: > On 12/03/2016 12:13, Marko Rauhamaa wrote: >> >> BartC <b...@freeuk.com>: >> >>> If you're looking at fast processing of language source code (in a >>> thread partly about efficiency), then you cannot ignore the fact that >>> the vast majority of characters being processed are going to have >>> ASCII codes. >> >> >> I don't know why you would optimize for inputting program source code. >> Text in general has left ASCII behind a long time ago. Just go to >> Wikipedia and click on any of the other languages. >> >> Why, look at the *English* page on Hillary Clinton: >> >> Hillary Diane Rodham Clinton /ˈhɪləri daɪˈæn ˈrɒdəm ˈklɪntən/ (born >> October 26, 1947) is an American politician. >> <URL: https://en.wikipedia.org/wiki/Hillary_Clinton> >> >> You couldn't get past the first sentence in ASCII. > > > I saved that page locally as a .htm file in UTF-8 encoding. I ran a modified > version of my benchmark, and it appeared that 99.7% of the bytes had ASCII > codes. The other 0.3% presumably were multi-byte sequences, so that the > actual proportion of Unicode characters would be even less. > > I then saved the Arabic version of the page, which visually, when rendered, > consists of 99% Arabic script. But the .htm file was still 80% ASCII! > > So what were you saying about ASCII being practically obsolete ... ?
Now take the same file and save it as plain text. See how much smaller it is. If you then take that text and embed it in a 10GB file consisting of nothing but byte value 246, it will be plainly obvious that ASCII is almost completely obsolete, and that we should optimize our code for byte 246. Or maybe, all you've proven is that *the framing around the text* is entirely ASCII, which makes sense, since HTML is trying to be compatible with a wide range of messy encodings (many of them eight-bit ASCII-compatible ones). The text itself may also consist primarily of ASCII characters, but that's a separate point. In the Arabic version, that is far less likely to be true (there'll still be a good number of ASCII characters in it, as U+0020 SPACE is heavily used in Arabic text, but a far smaller percentage). But neither of those says that ASCII is "practically obsolete", any more than you could say that the numbers from 1 to 10 become obsolete once a child learns to count further than that. The ASCII characters are an important part of the Unicode set; you can't ignore the rest of Unicode, but you certainly can't ignore ASCII, and there'll be very few pieces of human-language text which include no ASCII characters whatsoever. That's why UTF-8 is so successful; even Chinese text is often more compact in UTF-8 than in UTF-16 (despite many characters fitting into a single UTF-16 code unit, but requiring three bytes in UTF-8), when framed in HTML. However, once again, we have a sharp distinction: semantically, you support all Unicode characters equally, but then you optimize for the common ones. ChrisA -- https://mail.python.org/mailman/listinfo/python-list