On Thu, 8 Dec 2005, "Martin v. Löwis" wrote:
utabintarbo wrote:
Fredrik, you are a God! Thank You^3. I am unworthy </ass-kiss-mode>
For all those who followed this thread, here is some more explanation:
Apparently, utabintarbo managed to get U+2592 (MEDIUM SHADE, a filled
50% grayish square) and U+2524 (BOX DRAWINGS LIGHT VERTICAL AND LEFT, a
vertical line in the middle, plus a line from that going left) into a
file name. How he managed to do that, I can only guess: most likely, the
Samba installation assumes that the file system encoding on the Solaris
box is some IBM code page (say, CP 437 or CP 850). If so, the byte on
disk would be \xb4. Where this came from, I have to guess further:
perhaps it is ACUTE ACCENT from ISO-8859-*.
Anyway, when he used listdir() to get the contents of the directory,
Windows applies the CP_ACP encoding (known as "mbcs" in Python). For
reasons unknown to me, the US and several European versions of XP map
this to \xa6, VERTICAL BAR (I can somewhat see that as meaningful for
U+2524, but not for U+2592).
So when he then applies isfile to that file name, \xa6 is mapped to
U+00A6, which then isn't found on the Samba side.
So while Unicode here is the solution, the problem is elsewhere; most
likely in a misconfiguration of the Samba server (which assumes some
encoding for the files on disk, yet the AIX application uses a different
encoding).
Isn't the key thing that Windows is applying a non-roundtrippable
character encoding? If i've understood this right, Samba and Windows are
talking in unicode, with these (probably quite spurious, but never mind)
U+25xx characters, and Samba is presenting a quite consistent view of the
world: there's a file called "double bucky backlash grey box" in the
directory listing, and if you ask for a file called "double bucky backlash
grey box", you get it. Windows, however, maps that name to the 8-bit
string "double bucky blackslash vertical bar", but when you pass *that*
back to it, it gets encoded as the unicode string "double bucky backslash
vertical bar", which Sambda then doesn't recognise.
I don't know what Windows *should* do here. I know it shouldn't do this -
this leads to breaking of some very basic invariants about files and
directories, and so the kind of confusion utabintarbo suffered. The
solution is either to apply an information-preserving encoding (UTF-8,
say), or to refuse to do it at all (ie, raise an error if there are
unencodable characters), neither of which are particularly beautiful
solutions. I think Windows is in a bit of a rock/hard place situation
here, poor thing.
Incidentally, for those who haven't come across CP_ACP before, it's not
yet another character encoding, it's a pseudovalue which means 'the
system's current default character set'.
tom
--
Women are monsters, men are clueless, everyone fights and no-one ever
wins. -- cleanskies
--
http://mail.python.org/mailman/listinfo/python-list