On 3 Jul., 16:58, "Mark Tolonen" <metolone+gm...@gmail.com> wrote: > "Tep" <petshm...@googlemail.com> wrote in message > > news:46d36544-1ea2-4391-8922-11b8127a2...@o6g2000yqj.googlegroups.com... > > > > > > > On 3 Jul., 06:40, Simon Forman <sajmik...@gmail.com> wrote: > > > On Jul 2, 4:31 am, Tep <petshm...@googlemail.com> wrote: > [snip] > > > > > > > how can I replace '—' sign from string? Or do split at that > > > > > > > character? > > > > > > > Getting unicode error if I try to do it: > > > > > > > > UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in > > > > > > > position > > > > > > > 1: ordinal not in range(128) > > > > > > > > Thanks, Pet > > > > > > > > script is # -*- coding: UTF-8 -*- > [snip] > > > I just tried a bit of your code above in my interpreter here and it > > > worked fine: > > > > |>>> data = 'foo — bar' > > > |>>> data.split('—') > > > |['foo ', ' bar'] > > > |>>> data = u'foo — bar' > > |>>> data.split(u'—') > > > |[u'foo ', u' bar'] > > > > Figure out the smallest piece of "html source code" that causes the > > > problem and include that with your next post. > > > The problem was, I've converted "html source code" to unicode object > > and didn't encoded to utf-8 back, before using split... > > Thanks for help and sorry for not so smart question > > Pet > > You'd still benefit from posting some code. You shouldn't be converting
I've posted code below > back to utf-8 to do a split, you should be using a Unicode string with split > on the Unicode version of the "html source code". Also make sure your file > is actually saved in the encoding you declare. I print the encoding of your > symbol in two encodings to illustrate why I suspect this. File was indeed in windows-1252, I've changed this. For errors see below > > Below, assume "data" is your "html source code" as a Unicode string: > > # -*- coding: UTF-8 -*- > data = u'foo — bar' > print repr(u'—'.encode('utf-8')) > print repr(u'—'.encode('windows-1252')) > print data.split(u'—') > print data.split('—') > > OUTPUT: > > '\xe2\x80\x94' > '\x97' > [u'foo ', u' bar'] > Traceback (most recent call last): > File > "C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", > line 427, in ImportFile > exec codeObj in __main__.__dict__ > File "<auto import>", line 1, in <module> > File "x.py", line 6, in <module> > print data.split('—') > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: > ordinal not in range(128) > > Note that using the Unicode string in split() works. Also note the decode > byte in the error message when using a non-Unicode string to split the > Unicode data. In your original error message the decode byte that caused an > error was 0x97, which is 'EM DASH' in Windows-1252 encoding. Make sure to > save your source code in the encoding you declare. If I save the above > script in windows-1252 encoding and change the coding line to windows-1252 I > get the same results, but the decode byte is 0x97. > > # coding: windows-1252 > data = u'foo — bar' > print repr(u'—'.encode('utf-8')) > print repr(u'—'.encode('windows-1252')) > print data.split(u'—') > print data.split('—') > > '\xe2\x80\x94' > '\x97' > [u'foo ', u' bar'] > Traceback (most recent call last): > File > "C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", > line 427, in ImportFile > exec codeObj in __main__.__dict__ > File "<auto import>", line 1, in <module> > File "x.py", line 6, in <module> > print data.split('ק) > UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0: > ordinal not in range(128) > > -Mark #! /usr/bin/python # -*- coding: UTF-8 -*- import urllib2 import re def getTitle(input): title = re.search('<title>(.*?)</title>', input) title = title.group(1) print "FULL TITLE", title.encode('UTF-8') parts = title.split(' — ') return parts[0] def getWebPage(url): user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers = { 'User-Agent' : user_agent } req = urllib2.Request(url, '', headers) response = urllib2.urlopen(req) the_page = unicode(response.read(), 'UTF-8') return the_page def main(): url = "http://bg.wikipedia.org/wiki/ %D0%91%D0%B0%D1%85%D1%80%D0%B5%D0%B9%D0%BD" title = getTitle(getWebPage(url)) print title[0] if __name__ == "__main__": main() Traceback (most recent call last): File "C:\user\Projects\test\src\new_main.py", line 29, in <module> main() File "C:\user\Projects\test\src\new_main.py", line 24, in main title = getTitle(getWebPage(url)) FULL TITLE Бахрейн — УикипедиÑ� File "C:\user\Projects\test\src\new_main.py", line 9, in getTitle parts = title.split(' — ') UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128) -- http://mail.python.org/mailman/listinfo/python-list