Jeffrey Barish wrote: > I have a regular expression that I use to extract the surname: > > surname = r'(?u).+ (\w+)' > > However, when I apply it to this Unicode string, I get only the first 3 > letters of the surname: > > name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k'
That's a byte string. You can either modify the literal name = u'Anton\xedn Dvo\u0159\xe1k' or decode it with the proper encoding name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k' name = name.decode("utf-8") > surname_re = re.compile(surname) > m = surname_re.search(name) > m.groups() > ('Dvo\xc5',) > > I suppose that there is an encoding problem, but I don't understand > Unicode well enough to know what to do to digest properly the Unicode > characters in the surname. >>> name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k' >>> re.compile(r"(?u).+ (\w+)").search(name.decode("utf-8")).groups() (u'Dvo\u0159\xe1k',) >>> print _[0] Dvořák Peter -- http://mail.python.org/mailman/listinfo/python-list