[issue14068] problem with re split

Ezio Melotti Tue, 21 Feb 2012 18:01:18 -0800

Ezio Melotti <ezio.melo...@gmail.com> added the comment:

As long as you don't mix str and unicode everything works.


With strings:
>>> s = '与清新。阿德莱'
>>> re.split('。', s)
['\xe4\xb8\x8e\xe6\xb8\x85\xe6\x96\xb0', '\xe9\x98\xbf\xe5\xbe\xb7\xe8\x8e\xb1']
>>> s.split('。')
['\xe4\xb8\x8e\xe6\xb8\x85\xe6\x96\xb0', '\xe9\x98\xbf\xe5\xbe\xb7\xe8\x8e\xb1']

With unicode:
>>> u = u'与清新。阿德莱'
>>> re.split(u'。', u)
[u'\u4e0e\u6e05\u65b0', u'\u963f\u5fb7\u83b1']
>>> u.split(u'。')
[u'\u4e0e\u6e05\u65b0', u'\u963f\u5fb7\u83b1']

Mixing str and unicode:
>>> re.split(u'。', s)
['\xe4\xb8\x8e\xe6\xb8\x85\xe6\x96\xb0\xe3\x80\x82\xe9\x98\xbf\xe5\xbe\xb7\xe8\x8e\xb1']
>>> re.split('。', u)
[u'\u4e0e\u6e05\u65b0\u3002\u963f\u5fb7\u83b1']
>>>
>>> s.split(u'。')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal 
not in range(128)
>>> u.split('。')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal 
not in range(128)


The syntax error is raised for byte literals and can't be backported to 2.7.  
Raising an error when str and unicode are mixed in re is not backward 
compatible, and re does work as long as both are ASCII only.  I'm therefore 
closing this as invalid.

----------
nosy: +mrabarnett
resolution:  -> invalid
stage:  -> committed/rejected
status: open -> closed

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue14068>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue14068] problem with re split

Reply via email to