On Mon, Jun 3, 2013 at 4:46 PM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > Then, when > you try to read the file names in UTF-8, you hit an illegal byte, half of > a surrogate pair perhaps, and everything blows up.
Minor quibble: Surrogates are an artifact of UTF-16, so they're 16-bit values like 0xD808 or 0xDF45. Possibly what you're talking about here is a continuation byte, which in UTF-8 are used only after a lead byte. For instance: 0xF0 0x92 0x8D 0x85 is valid UTF-8, but 0x41 0x92 is not. There is one other really annoying thing to deal with, and that's the theoretical UTF-8 encoding of a UTF-16 surrogate. (I say "theoretical" because strictly, these are invalid; UTF-8 does not encode invalid codepoints.) 0xED 0xA0 0x88 and 0xED 0xBD 0x85 encode the two I mentioned above. Depending on what's reading the filename, you might actually have these throw errors, or maybe not. Python's decoder is correctly strict: >>> str(b'\xed\xa0\x88','utf-8') Traceback (most recent call last): File "<pyshell#9>", line 1, in <module> str(b'\xed\xa0\x88','utf-8') UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2: invalid continuation byte Actually, I'm not sure here, but I think that error message may be wrong, or at least unclear. It's perfectly possible to decode those bytes using the UTF-8 algorithm; you end up with the value 0xD808, which you then reject because it's a surrogate. But maybe the Python UTF-8 decoder simplifies some of that. ChrisA -- http://mail.python.org/mailman/listinfo/python-list