On Fri, Nov 27, 2015 at 9:27 AM, BartC <b...@freeuk.com> wrote: > On 26/11/2015 13:15, Chris Angelico wrote: >> >> On Thu, Nov 26, 2015 at 11:53 PM, BartC <b...@freeuk.com> wrote: > > >>> http://pastebin.com/JrVTher6 > > >> #14 and #15: Are you assuming that a character is a byte and that >> diacritical-free English is the only language in the world? > > > I don't think that need be the assumption. Any UTF8 string that fits within > 8 bytes could also be represented by an integer value.
Okay, so you're making UTF-8 your visible string representation. That's better than assuming character==byte, but it still has the case insensitivity problem. >> Case >> insensitivity is a *pain* when you try to be language-agnostic; for >> instance, the case-folding rules of English state that U+0069 LATIN >> SMALL LETTER I and U+0049 LATIN CAPITAL LETTER I are identical, but >> Turkish would upper-case the first to U+0130 LATIN CAPITAL LETTER I >> WITH DOT ABOVE and lower-case the second to U+0131 LATIN SMALL LETTER >> DOTLESS I. German has U+00DF LATIN SMALL LETTER SHARP S (also called >> eszett), which traditionally upper-cases to "SS", which lower-cases to >> "ss". > > > I use Windows which is also case insensitive with regard to filenames and > such. How does it solve those problems? How about web-site names, email > addresses and Google searches? Windows: I'm not sure, and frankly, I don't trust it. A quick test showed a couple of failures: C:\Users\Rosuav\Desktop>dir /b TE* teßting C:\Users\Rosuav\Desktop>dir /b TESST* File Not Found C:\Users\Rosuav\Desktop>dir /b ParıldıYOR* Parıldıyor Parts & Pieces C:\Users\Rosuav\Desktop>dir /b PARILDIYOR* File Not Found It might be case insensitive only for ASCII. (Note: This test was done on Windows 7, because that's the VM I had handy. Things might be different on newer Windowses, but I can't check. Web site names: Presumably you mean DNS. It started out as an ASCII-only protocol, and grew a number of gross hacks to support "internationalized domain names". I'm not sure where the case insensitivity is applied; but it doesn't matter too much, because conflicts can be resolved at registration. Also, you'll generally see IDNs in country-specific TLDs, so there'll be only one language (or a small family of languages) used, reducing the likelihood of collisions. Google searches are (deliberately) a LOT more sloppy than just case sensitivity. You can search for something without diacriticals and get back results with diacriticals; you can transpose letters, omit letters, have extra letters, and it'll generally figure out what you want. This is absolutely awesome for a search engine, but equally horrifying for name lookups in a program. None of these is something I'd recommend following. > Within a program source code (where you have mainly technical users), you > can just impose some restrictions on keywords and identifiers otherwise > there are plenty of problems even without case switching, if you want to > allow Unicode here. I would strongly support ASCII-only *language keywords*. You don't have many of them (compared to the number of identifiers in a program), and everyone has to type them. But for identifiers, Python 3 defines character validity based on Unicode categories, and performs NFKC normalization on all names. That's pretty straight-forward. No case sensitivity hassles, no messy non-transitive equalities, it's easy. ChrisA -- https://mail.python.org/mailman/listinfo/python-list