William Johnston, 29.07.2010 14:12:
I have a Python app that parses XML files and then writes to text files.

XML or HTML?


However, the output text file is "sometimes" encoded in some Asian language.

Here is my code:


encoding = "iso-8859-1"

clean_sent = nltk.clean_html(sent.text)

clean_sent = clean_sent.encode(encoding, "ignore");


I also tried "UTF-8" encoding, but received the same results.

What result?

Maybe the NLTK cannot determine the encoding of the HTML file (because the file is broken and/or doesn't correctly specify its own encoding) and thus fails to decode it?

Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to