Hi All, I am working on a self project for grabbing certain URL's from the web. Do some processing and store the final contents in text/pdf file.
I am also using html2text ( https://github.com/aaronsw/html2text/archives/master ) for converting the fetched page into text format. As a first step I tried with fetching and converting to text using following code. Code : {{{ #!/bin/python import os import urllib fetch = urllib.urlopen("some-web-link.htm") mainfile = open ('main.html', 'w' ) mainfile.write(fetch.read()) os.system('python2.6 html2text.py main.html > main.txt') }}} It flags an error: {{{ Traceback (most recent call last): File "html2text.py", line 447, in <module> data = open(arg, 'r').read().decode(encoding) File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x88 in position 11366: invalid start byte }}} I also tried with {{{ + import codecs ... ... - mainfile = open ('main.html', 'w' ) +mainfile = codecs.open('xyz.htm', 'w', None, 'ignore') ... ... }}} Result is coming the same. Please tell as to what can be done to avoid this error.? Thanks, Nikunj Bangalore, India _______________________________________________ BangPypers mailing list BangPypers@python.org http://mail.python.org/mailman/listinfo/bangpypers