> I have found that os.listdir() does not always return unicode objects when > passed a unicode path. Sometimes "byte strings" are returned in the list, > mixed-in with unicodes.
Yes. It does so when it fails to decode the byte string according to the file system encoding (which, in turn, bases on the locale). > I will try the technique given > on:http://www.pyzine.com/Issue008/Section_Articles/article_Encodings.html#guessing-the-encoding > Perhaps that will help. I would advise against such a strategy. Instead, you should first understand what the encodings of the file names actually *are*, on a real system, and draw conclusions from that. > I gather you mean that I should get a unicode path, encode it to a byte > string > and then pass that to os.listdir > Then, I suppose, I will have to decode each resulting byte string (via the > detect routines mentioned in the link above) back into unicode - passing > those I simply cannot interpret. That's what I meant, yes. Again, you have a number of options - passing those that you cannot interpret is but one option. Another option is to accept moji-bake. >> Then, if the locale's encoding cannot decode the file names, you have >> several options >> a) don't try to interpret the file names as character strings, i.e. >> don't decode them. Not sure why you need the file names - if it's >> only to open the files, and never to present the file name to the >> user, not decoding them might be feasible > So, you reckon I should stick to byte-strings for the low-level file open > stuff? It's a little complicated by my using Python Imaging to access the > font files. It hands it all over to Freetype and really leaves my sphere of > savvy. > I'll do some testing with PIL and byte-string filenames. I wish my memory was > better, I'm pretty sure I've been down that road and all my results kept > pushing me to stick to unicode objects as far as possible. I would be surprised if PIL/freetype would not support byte string file names if you read those directly from the disk. OTOH, if the user has selected/typed a string at a GUI, and you encode that - I can easily see how that might have failed. >> That's correct, and there is no solution (not in Python, not in any >> other programming language). You have to made trade-offs. For that, >> you need to analyze precisely what your requirements are. > I would say the requirements are: > 1. To open font files from any source (locale.) > 2. To display their filename on the gui and the console. > 3. To fetch some text meta-info (family etc.) via PIL/Freetype and display > same. > 4. To write the path and filename to text files. > 5. To make soft links (path + filename) to another path. > > So, there's a lot of unicode + unicode and os.path.join and so forth going on. I notice that this doesn't include "to allow the user to enter file names", so it seems there is no input of file names, only output. Then I suggest this technique of keeping bytestring/unicode string pairs. Use the Unicode string for display, and the byte string for accessing the disc. >>> I went through this exercise recently and had no joy. It seems the string >>> I chose to use simply would not render - even under 'ignore' and >>> 'replace'. >> I don't understand what "would not render" means. > I meant it would not print the name, but constantly throws ascii related > errors. That cannot be. Both the ignore and the replace error handlers will silence all decoding errors. > I don't know if the character will survive this email, but the text I was > trying to display (under LANG=C) in a python script (not the immediate-mode > interpreter) was: "MÖgul". The second character is a capital O with an umlaut > (double-dots I think) above it. For some reason I could not get that to > display as "M?gul" or "Mgul". I see no problem with that: >>> u"M\xd6gul".encode("ascii","ignore") 'Mgul' >>> u"M\xd6gul".encode("ascii","replace") 'M?gul' Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list