Le mardi 3 décembre 2013 06:06:26 UTC+1, Steven D'Aprano a écrit : > On Mon, 02 Dec 2013 16:14:13 -0500, Ned Batchelder wrote: > > > > > On 12/2/13 3:38 PM, Ethan Furman wrote: > > >> On 11/29/2013 04:44 PM, Steven D'Aprano wrote: > > >>> > > >>> Out of the nine tests, Python 3.3 passes six, with three tests being > > >>> failures or dubious. If you believe that the native string type should > > >>> operate on code-points, then you'll think that Python does the right > > >>> thing. > > >> > > >> I think Python is doing it correctly. If I want to operate on > > >> "clusters" I'll normalize the string first. > > >> > > >> Thanks for this excellent post. > > >> > > >> -- > > >> ~Ethan~ > > > > > > This is where my knowledge about Unicode gets fuzzy. Isn't it the case > > > that some grapheme clusters (or whatever the right word is) can't be > > > normalized down to a single code point? Characters can accept many > > > accents, for example. In that case, you can't always normalize and use > > > the existing string methods, but would need more specialized code. > > > > That is correct. > > > > If Unicode had a distinct code point for every possible combination of > > base-character plus an arbitrary number of diacritics or accents, the > > 0x10FFFF code points wouldn't be anywhere near enough. > > > > I see over 300 diacritics used just in the first 5000 code points. Let's > > pretend that's only 100, and that you can use up to a maximum of 5 at a > > time. That gives 79375496 combinations per base character, much larger > > than the total number of Unicode code points in total. > > > > If anyone wishes to check my logic: > > > > # count distinct combining chars > > import unicodedata > > s = ''.join(chr(i) for i in range(33, 5000)) > > s = unicodedata.normalize('NFD', s) > > t = [c for c in s if unicodedata.combining(c)] > > len(set(t)) > > > > # calculate the number of combinations > > def comb(r, n): > > """Combinations nCr""" > > p = 1 > > for i in range(r+1, n+1): > > p *= i > > for i in range(1, n-r+1): > > p /= i > > return p > > > > sum(comb(i, 100) for i in range(6)) > > > > > > I'm not suggesting that all of those accents are necessarily in use in > > the real world, but there are languages which construct arbitrary > > combinations of accents. (Or so I have been lead to believe.) > > >
from one of my libs, bmp only >>> import fourbiunicode5 >>> print(len(fourbiunicode5.AllCombiningMarks)) 240 jmf -- https://mail.python.org/mailman/listinfo/python-list