William Johnston, 29.07.2010 14:12:
I have a Python app that parses XML files and then writes to text files.
XML or HTML?
However, the output text file is "sometimes" encoded in some Asian language.
Here is my code:
encoding = "iso-8859-1"
clean_sent = nltk.clean_html(sent.text)
clean_sent = clean_sent.encode(encoding, "ignore");
I also tried "UTF-8" encoding, but received the same results.
What result?
Maybe the NLTK cannot determine the encoding of the HTML file (because the
file is broken and/or doesn't correctly specify its own encoding) and thus
fails to decode it?
Stefan
--
http://mail.python.org/mailman/listinfo/python-list