Veek M <vek.m1...@gmail.com> writes: > dieter wrote: > >> Veek M <vek.m1...@gmail.com> writes: >>> UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in >>> position 8: illegal multibyte sequence >> >> You give us very little context. > > It's a longish chunk of code: basically, i'm trying to download using the > 'requests.Session' module and that should give me Unicode once it's told > what encoding is being used 'gbk'. > > def get_page(s, url): > print(url) > r = s.get(url, headers = { > 'User-Agent' : user_agent, > 'Keep-Alive' : '3600', > 'Connection' : 'keep-alive', > }) > s.encoding='gbk'
It looks strange that you can set "s.encoding" after you have called "s.get" - but, as you apparently get an error related to the "gbk" encoding, it seems to work. > text = r.text > return text > > # Open output file > fh=codecs.open('/tmp/out', 'wb') > fh.write(header) > > # Download > s = requests.Session() > ------------ > > If 'text' is NOT proper unicode because the server introduced some junk, > then when i do anchor.getparent() on my 'text' it'll traceback.. > ergo the question, how do i set a replacement char within 'requests' I see the following options for you: * you look at the code (of "requests.Session"), determine where the "s.encoding" is taken care of and look around whether there it also support a replacement strategy. Then, you use this knowledge to set up your replacement. * you avoid the "unicode" translating functionality of "requests.Session". If it does not immediately supports this, you can trick it using the "iso-8859-1" encoding (this maps bytes to the first 256 unicode codepoints in a one-to-one way) and then do the unicode handling in your own code -- with facilities you already know of (including replacement) * you contact the website administrator and ask him why the delivered pages do not contain valid "gbk" content. -- https://mail.python.org/mailman/listinfo/python-list