william tanksley wrote: > william tanksley <[EMAIL PROTECTED]> wrote: >> I'm still puzzled why I'm getting some non-Unicode out of an >> ElementTree's text, though. > > Now I know. > > Okay, my answer is that cElementTree (in Python 2.5) is simply > deranged when it comes to Unicode. It assumes everything's ASCII.
It does not "assume" that. It *requires* byte strings to be ASCII. If it didn't enforce that, how could it possibly know what encoding they were using, i.e. what they were supposed to mean at all? Read the Python Zen, in the face of ambiguity, ElementTree refuses the temptation to guess. Python 2.x does exactly the same thing when it comes to implicit conversion between encoded strings and Unicode strings. If you want to pass plain ASCII strings, you can either pass a byte string or a Unicode string (that's a plain convenience feature). If you want to pass anything that's not ASCII, you *must* pass a Unicode string. > Reference: http://codespeak.net/lxml/compatibility.html > > (Note that the lxml version also doesn't handle Unicode correctly; it > errors when XML declares its encoding.) It definitely does "handle Unicode correctly". Let me guess, you tried passing XML as a Unicode string into the parser, and your XML declared itself as having a byte encoding (<?xml encoding="..."?>). How can that *not* be an error? > This is unpleasant, but at least now I know WHY it was driving me > insane. You should *really* read a bit about Unicode and byte encodings. Not understanding a topic is not a good excuse for complaining about it being broken for you. Stefan -- http://mail.python.org/mailman/listinfo/python-list