On 24/03/2006 2:19 PM, Jean-Paul Calderone wrote: > On Fri, 24 Mar 2006 09:33:19 +1100, John Machin <[EMAIL PROTECTED]> > wrote: > >> On 24/03/2006 8:36 AM, Peter Otten wrote: >> >>> John Machin wrote: >>> >>>> You can replace ALL of this upshifting and accent removal in one >>>> blow by >>>> using the string translate() method with a suitable table. >>> >>> >>> Only if you convert to unicode first or if your data maintains 1 byte >>> == 1 >>> character, in particular it is not UTF-8. >>> >> >> I'm sorry, I forgot that there were people who are unaware that >> variable-length gizmos like UTF-8 and various legacy CJK encodings are >> for storage & transmission, and are better changed to a >> one-character-per-storage-unit representation before *ANY* data >> processing is attempted. > > > Unfortunately, unicode only appears to solve this problem in a sane > manner. Most people conveniently forget (or never learn in the first > place) about combining sequences and denormalized forms. Consider > u'e\u0301', u'U\u0301', or u'C\u0327'.
Yes, and many people don't even bother to look at their data. If they did, and found combining forms, then they would treat them as I said as "variable-length gizmos" which are "better changed to a one-character-per-storage-unit representation before *ANY* data processing is attempted." In any case, as the OP is upshifting and stripping accents [presumably as elementary preparation for some sort of fuzzy matching], all that is needed is to throw away the combining accents (0301, 0327, etc). > These difficulties can be > mitigated to some degree via normalization (see unicodedata.normalize), > but this step is often forgotten It's not a matter of forget or not. People should bother to examine their data and see what characters are in use; then they would know whether they had a problem or not. > and, for things like u'\u0565\u0582' > (ARMENIAN SMALL LIGATURE ECH YIWN), it does not even work. Sorry, I don't understand. 0565 is stand-alone ECH 0582 is stand-alone YIWN 0587 is the ligature. What doesn't work? At first guess, in the absence of an Armenian informant, for pre-matching normalisation, I'd replace 0587 by the two constituents -- just like 00DF would be expanded to "ss" (before upshifting and before not caring too much about differences caused by doubled letters). -- http://mail.python.org/mailman/listinfo/python-list