Martin, > Yes. It does so when it fails to decode the byte string according to the > file system encoding (which, in turn, bases on the locale). That's at least one way I can weed-out filenames that are going to give me trouble; if Python itself can't figure out how to decode it, then I can also fail with honour.
> > I will try the technique given > > on:http://www.pyzine.com/Issue008/Section_Articles/article_Encodings.html > >#guessing-the-encoding Perhaps that will help. > I would advise against such a strategy. Instead, you should first > understand what the encodings of the file names actually *are*, on > a real system, and draw conclusions from that. I don't follow you here. The encoding of file names *on* a real system are (for Linux) byte strings of potentially *any* encoding. os.listdir() may even fail to grok some of them. So, I will have a few elements in a list that are not unicode, I can't ask the O/S for any help and therefore I should be able to pass that byte string to a function as suggested in the article to at least take one last stab at identifying it. Or is that a waste of time because os.listdir() has already tried something similar (and prob. better)? > I notice that this doesn't include "to allow the user to enter file > names", so it seems there is no input of file names, only output. I forgot to mention the command-line interface... I actually had trouble with that too. The user can start the app like this: fontypython /some/folder/ or fontypython SomeFileName And that introduces input in some kind of encoding. I hope that locale.getprefferedencoding() will be the right one to handle that. Is such input (passed-in via sys.argv) in byte-strings or unicode? I can find out with type() I guess. As to the rest, no, there's no other keyboard input for filenames. There *is* a 'filter' which is used as a regex to filter 'bold', 'italic' or whatever. I fully expect that to give me a hard time too. > Then I suggest this technique of keeping bytestring/unicode string > pairs. Use the Unicode string for display, and the byte string for > accessing the disc. Thanks, that's a good idea - I think I'll implement a dictionary to keep both and work things that way. > I see no problem with that: > >>> u"M\xd6gul".encode("ascii","ignore") > 'Mgul' > >>> u"M\xd6gul".encode("ascii","replace") > 'M?gul' Well, that was what I expected to see too. I must have been doing something stupid. \d -- http://mail.python.org/mailman/listinfo/python-list