On Wed, May 27, 2015, at 07:15, anatoly techtonik wrote: > The solution is to have filter preprocess the binary string to escape all > non-unicode symbols so that the following lossless transformation > becomes possible: > > binary -> escaped utf-8 string -> unicode -> binary > > I want to know if that's real? I need to accomplish that with > Python 2.x, but the use case is probably valid for Python 3 as well.
In Python 3, you could *in principle* use surrogateescape (this would be more of a binary -> escaped unicode workflow), but see below. It is worth noting that when you *read* posix filenames in unicode form (e.g. listdir with a unicode argument), they are decoded with surrogateescape, and can be returned to bytes format with fn.encode(sys.getfilesystemencoding(), errors='surrogateescape'). However keep in mind that on *windows*, the native filename format is a sequence of 16-bit WCHAR values, not a sequence of bytes. > This stuff is critical to port SCons to Python 3.x and I expect for other > similar tools that have to deal with unknown ascii-binary strings too. Even if your filename *is* valid UTF-8 (or whatever other encoding), it might contain invisible control characters that make it difficult to read. You'd probably be better off simply working directly with the binary representation, iterating over it and replacing all non-*ascii*-printable bytes with an escaped representation. As it happens, the repr() function should work well for doing exactly this. (note: repr on a *unicode* string in python 3 will pass non-ascii characters, but ideally you're working with byte strings.) There's no real need to go beyond this unless you're working in a problem domain where filenames are likely to legitimately include non-ascii characters (e.g. user documents of non-technical users who use languages other than English). -- https://mail.python.org/mailman/listinfo/python-list