On Thu, 8 Dec 2005, "Martin v. Löwis" wrote:

utabintarbo wrote:

Fredrik, you are a God! Thank You^3. I am unworthy </ass-kiss-mode>

For all those who followed this thread, here is some more explanation:

Apparently, utabintarbo managed to get U+2592 (MEDIUM SHADE, a filled 50% grayish square) and U+2524 (BOX DRAWINGS LIGHT VERTICAL AND LEFT, a vertical line in the middle, plus a line from that going left) into a file name. How he managed to do that, I can only guess: most likely, the Samba installation assumes that the file system encoding on the Solaris box is some IBM code page (say, CP 437 or CP 850). If so, the byte on disk would be \xb4. Where this came from, I have to guess further: perhaps it is ACUTE ACCENT from ISO-8859-*.

Anyway, when he used listdir() to get the contents of the directory, Windows applies the CP_ACP encoding (known as "mbcs" in Python). For reasons unknown to me, the US and several European versions of XP map this to \xa6, VERTICAL BAR (I can somewhat see that as meaningful for U+2524, but not for U+2592).

So when he then applies isfile to that file name, \xa6 is mapped to U+00A6, which then isn't found on the Samba side.

So while Unicode here is the solution, the problem is elsewhere; most likely in a misconfiguration of the Samba server (which assumes some encoding for the files on disk, yet the AIX application uses a different encoding).

Isn't the key thing that Windows is applying a non-roundtrippable character encoding? If i've understood this right, Samba and Windows are talking in unicode, with these (probably quite spurious, but never mind) U+25xx characters, and Samba is presenting a quite consistent view of the world: there's a file called "double bucky backlash grey box" in the directory listing, and if you ask for a file called "double bucky backlash grey box", you get it. Windows, however, maps that name to the 8-bit string "double bucky blackslash vertical bar", but when you pass *that* back to it, it gets encoded as the unicode string "double bucky backslash vertical bar", which Sambda then doesn't recognise.

I don't know what Windows *should* do here. I know it shouldn't do this - this leads to breaking of some very basic invariants about files and directories, and so the kind of confusion utabintarbo suffered. The solution is either to apply an information-preserving encoding (UTF-8, say), or to refuse to do it at all (ie, raise an error if there are unencodable characters), neither of which are particularly beautiful solutions. I think Windows is in a bit of a rock/hard place situation here, poor thing.

Incidentally, for those who haven't come across CP_ACP before, it's not yet another character encoding, it's a pseudovalue which means 'the system's current default character set'.

tom

--
Women are monsters, men are clueless, everyone fights and no-one ever
wins. -- cleanskies
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to