On Sun, Apr 17, 2011 at 8:01 PM, Nikunj Badjatya <nikunjbadja...@gmail.com>wrote:
> Hi All, > > I am working on a self project for grabbing certain URL's from the web. Do > some processing and store the final contents in text/pdf file. > > I am also using html2text ( > https://github.com/aaronsw/html2text/archives/master ) for converting the > fetched page into text format. > As a first step I tried with fetching and converting to text using > following > code. > > Code : > {{{ > #!/bin/python > > import os > import urllib > > fetch = urllib.urlopen("some-web-link.htm") > > mainfile = open ('main.html', 'w' ) > > mainfile.write(fetch.read()) > > os.system('python2.6 html2text.py main.html > main.txt') > > }}} > > It flags an error: > {{{ > Traceback (most recent call last): > File "html2text.py", line 447, in <module> > data = open(arg, 'r').read().decode(encoding) > File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x88 in position 11366: > invalid start byte > > }}} > > I also tried with > {{{ > + import codecs > > ... > ... > - mainfile = open ('main.html', 'w' ) > +mainfile = codecs.open('xyz.htm', 'w', None, 'ignore') > > ... > ... > }}} > > Result is coming the same. > > Please tell as to what can be done to avoid this error.? > > Try this from django.utils.encoding import smart_str myunistr = smart_str(YOUR_STRING) This will solve the issue -- ********************************** JAGANADH G http://jaganadhg.freeflux.net/blog *ILUGCBE* http://ilugcbe.techstud.org _______________________________________________ BangPypers mailing list BangPypers@python.org http://mail.python.org/mailman/listinfo/bangpypers