On 30 mai, 13:54, Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5- a470-7603bd3aa...@spamschutz.glglgl.de> wrote: > Am 30.05.2012 08:52 schrieb ru...@yahoo.com: > > > > > This breaks a lot of my code because in python 2 > > re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A'] > > but in python 3 (the result of running 2to3), > > re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A'] > > > I can remove the "r" prefix from the regex string but then > > if I have other regex backslash symbols in it, I have to > > double all the other backslashes -- the very thing that > > the r-prefix was invented to avoid. > > > Or I can leave the "r" prefix and replace something like > > r'[ \u3000]' with r'[ ]'. But that is confusing because > > one can't distinguish between the space character and > > the ideographic space character. It also a problem if a > > reader of the code doesn't have a font that can display > > the character. > > > Was there a reason for dropping the lexical processing of > > \u escapes in strings in python3 (other than to add another > > annoyance in a long list of python3 annoyances?) > > Probably it is more consequent. Alas, it makes the whole stuff > incompatible to Py2. > > But if you think about it: why allow for \u if \r, \n etc. are > disallowed as well? > > > And is there no choice for me but to choose between the two > > poor choices I mention above to deal with this problem? > > There is a 3rd one: use r'[ ' + '\u3000' + ']'. Not very nice to read, > but should do the trick... > > Thomas
I suggest to take the problem differently. Python 3 succeeded to put order in the missmatch of the "coding of the characters" Python 2 was proposing. In your case, the >>> import unicodedata as ud >>> ud.name('\u3000') 'IDEOGRAPHIC SPACE' "character" (in fact a unicode code point), is just a "character" as a >>> ud.name('a') 'LATIN SMALL LETTER A' The code point / unicode logic, Python 3 proposes and follows, becomes just straightforward. >>> s = 'a\u3000é\u3000€' >>> s.split('\u3000') ['a', 'é', '€'] >>> >>> import re >>> re.split('\u3000', s) ['a', 'é', '€'] The backslash, used as "real backslash", remains what it really was in Python 2. Note, the absence of r'...' . >>> s = 'a\\b\\c' >>> print(s) a\b\c >>> s.split('\\') ['a', 'b', 'c'] >>> re.split('\\\\', s) ['a', 'b', 'c'] >>> hex(ord('\\')) '0x5c' >>> re.split('\u005c\u005c', s) ['a', 'b', 'c'] jmf -- http://mail.python.org/mailman/listinfo/python-list