Re: Regular expressions and Unicode

2008-10-02 Thread Peter Otten
Jeffrey Barish wrote: > I have a regular expression that I use to extract the surname: > > surname = r'(?u).+ (\w+)' > > However, when I apply it to this Unicode string, I get only the first 3 > letters of the surname: > > name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k' That's a byte string. You

Re: Regular expressions and Unicode

2008-10-02 Thread skip
Jeffrey> However, when I apply it to this Unicode string, I get only the Jeffrey> first 3 letters of the surname: Jeffrey> name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k' Maybe name = unicode('Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k', "utf-8") ? Yup, that works: >>> name = unico