On 3 Jul., 18:54, MRAB <pyt...@mrabarnett.plus.com> wrote: > Tep wrote: > > On 3 Jul., 16:58, "Mark Tolonen" <metolone+gm...@gmail.com> wrote: > >> "Tep" <petshm...@googlemail.com> wrote in message > > >>news:46d36544-1ea2-4391-8922-11b8127a2...@o6g2000yqj.googlegroups.com... > > >>> On 3 Jul., 06:40, Simon Forman <sajmik...@gmail.com> wrote: > >>>> On Jul 2, 4:31 am, Tep <petshm...@googlemail.com> wrote: > >> [snip] > >>>>>>>> how can I replace '—' sign from string? Or do split at that > >>>>>>>> character? > >>>>>>>> Getting unicode error if I try to do it: > >>>>>>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in > >>>>>>>> position > >>>>>>>> 1: ordinal not in range(128) > >>>>>>>> Thanks, Pet > >>>>>>>> script is # -*- coding: UTF-8 -*- > >> [snip] > >>>> I just tried a bit of your code above in my interpreter here and it > >>>> worked fine: > >>>> |>>> data = 'foo — bar' > >>>> |>>> data.split('—') > >>>> |['foo ', ' bar'] > >>>> |>>> data = u'foo — bar' > >>> |>>> data.split(u'—') > >>>> |[u'foo ', u' bar'] > >>>> Figure out the smallest piece of "html source code" that causes the > >>>> problem and include that with your next post. > >>> The problem was, I've converted "html source code" to unicode object > >>> and didn't encoded to utf-8 back, before using split... > >>> Thanks for help and sorry for not so smart question > >>> Pet > >> You'd still benefit from posting some code. You shouldn't be converting > > > I've posted code below > > >> back to utf-8 to do a split, you should be using a Unicode string with > >> split > >> on the Unicode version of the "html source code". Also make sure your file > >> is actually saved in the encoding you declare. I print the encoding of > >> your > >> symbol in two encodings to illustrate why I suspect this. > > > File was indeed in windows-1252, I've changed this. For errors see > > below > > >> Below, assume "data" is your "html source code" as a Unicode string: > > >> # -*- coding: UTF-8 -*- > >> data = u'foo — bar' > >> print repr(u'—'.encode('utf-8')) > >> print repr(u'—'.encode('windows-1252')) > >> print data.split(u'—') > >> print data.split('—') > > >> OUTPUT: > > >> '\xe2\x80\x94' > >> '\x97' > >> [u'foo ', u' bar'] > >> Traceback (most recent call last): > >> File > >> "C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", > >> line 427, in ImportFile > >> exec codeObj in __main__.__dict__ > >> File "<auto import>", line 1, in <module> > >> File "x.py", line 6, in <module> > >> print data.split('—') > >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: > >> ordinal not in range(128) > > >> Note that using the Unicode string in split() works. Also note the decode > >> byte in the error message when using a non-Unicode string to split the > >> Unicode data. In your original error message the decode byte that caused > >> an > >> error was 0x97, which is 'EM DASH' in Windows-1252 encoding. Make sure to > >> save your source code in the encoding you declare. If I save the above > >> script in windows-1252 encoding and change the coding line to windows-1252 > >> I > >> get the same results, but the decode byte is 0x97. > > >> # coding: windows-1252 > >> data = u'foo — bar' > >> print repr(u'—'.encode('utf-8')) > >> print repr(u'—'.encode('windows-1252')) > >> print data.split(u'—') > >> print data.split('—') > > >> '\xe2\x80\x94' > >> '\x97' > >> [u'foo ', u' bar'] > >> Traceback (most recent call last): > >> File > >> "C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", > >> line 427, in ImportFile > >> exec codeObj in __main__.__dict__ > >> File "<auto import>", line 1, in <module> > >> File "x.py", line 6, in <module> > >> print data.split('ק) > >> UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0: > >> ordinal not in range(128) > > >> -Mark > > > #! /usr/bin/python > > # -*- coding: UTF-8 -*- > > import urllib2 > > import re > > def getTitle(input): > > title = re.search('<title>(.*?)</title>', input) > > The input is Unicode, so it's probably better for the regular expression > to also be Unicode: > > title = re.search(u'<title>(.*?)</title>', input) > > (In the current implementation it actually doesn't matter.) > > > title = title.group(1) > > print "FULL TITLE", title.encode('UTF-8') > > parts = title.split(' — ') > > The title is Unicode, so the string with which you're splitting should > also be Unicode: > > parts = title.split(u' — ') >
Oh, so simple. I'm new to python and still feel uncomfortable with unicode stuff. Thanks to all for help! > > > > return parts[0] > > > def getWebPage(url): > > user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' > > headers = { 'User-Agent' : user_agent } > > req = urllib2.Request(url, '', headers) > > response = urllib2.urlopen(req) > > the_page = unicode(response.read(), 'UTF-8') > > return the_page > > > def main(): > > url = "http://bg.wikipedia.org/wiki/ > > %D0%91%D0%B0%D1%85%D1%80%D0%B5%D0%B9%D0%BD" > > title = getTitle(getWebPage(url)) > > print title[0] > > > if __name__ == "__main__": > > main() > > > Traceback (most recent call last): > > File "C:\user\Projects\test\src\new_main.py", line 29, in <module> > > main() > > File "C:\user\Projects\test\src\new_main.py", line 24, in main > > title = getTitle(getWebPage(url)) > > FULL TITLE Бахрейн — Ð£Ð¸ÐºÐ¸Ð¿ÐµÐ´Ð¸Ñ > > File "C:\user\Projects\test\src\new_main.py", line 9, in getTitle > > parts = title.split(' — ') > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position > > 1: ordinal not in range(128) -- http://mail.python.org/mailman/listinfo/python-list