On 06/25/2017 06:19 AM, Rod Person wrote: > But doing a simple ls of that directory show it is unicode but the > replacement of the offending character. > > http://rodperson.com/graphics/uc/ls.png
Now that is really strange. Your OS seems to not recognize that the filename is in UTF-8. I suspect this has something to do with the NAS file sharing protocol (smb). Though I'm pretty sure that Samba can handle UTF-8 filenames correctly. > I am in fact using Python 3.5. I may be lacking in unicode skills but I > do have the sense enough to know the version of Python I am invoking. > So I included this screenshot of that so the version of Python and the > files list returned by os.walk > > http://rodperson.com/graphics/uc/files.png If I create a file that has the U+2019 character in it on my Linux machine (BtrFS), and do os.walk on it, I see the character in then string properly. So it looks like Python does the right thing, automatically decoding from UTF-8. In your situation I think the problem is the file sharing protocol that your NAS is using. Somehow some information is being lost and your OS does not know that the filenames are in UTF-8, and just thinks they are bytes. And therefore Python doesn't know to decode the string, so you just end up with each byte being converted to a unicode code point and being shoved into the unicode string. How to get around this issue I don't know. Maybe there's a way to convert the unicode string to bytes using the value of each character, and then decode that back to unicode. -- https://mail.python.org/mailman/listinfo/python-list