On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa <ma...@pacujo.net> wrote: >> There are two things happening here: >> >> 1) The underlying file system is not UTF-8, and you can't depend on >> that, > > Correct. Linux pathnames are octet strings regardless of the locale. > > That's why Linux developers should refer to filenames using bytes. > Unfortunately, Python itself violates that principle by having > os.listdir() return str objects (to mention one example).
Only because you gave it a str with the path name. If you want to refer to file names using bytes, then be consistent and refer to ALL file names using bytes. As I demonstrated, that works just fine. >> 2) You forgot to put the path on that, so it failed to find the file. >> Here's my version of your demo: >> >>>>> open("/tmp/xyz/"+os.listdir('/tmp/xyz')[0]) >> <_io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8'> >> >> Looks fine to me. > > I stand corrected. > > Then we have: > > >>> os.listdir()[0].encode('utf-8') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in > position 0: surrogates not allowed So? ChrisA -- https://mail.python.org/mailman/listinfo/python-list