On *nix, file names are bytes. In real life, we prefer to think of file names as strings. How non-ASCII file names are created is determined by the locale, and on most systems these days, every locale uses UTF-8 and everybody's happy. Of course this doesn't mean you'll never run into and old directory tree from the pre-UTF8 age using some other encoding, and it doesn't prevent people from doing silly things in file names.
Python deals with this tolerably well: by convention, file names are strings, but you can use bytes for file names if you wish. The docs [1] warn you about the situation. [1] https://docs.python.org/3/library/os.path.html If Python runs into a non-UTF8 (better: non-decodable) file name and has to return a str, it uses surrogate escape codes. So far so good. Right? This leads to the unfortunate situation that you can't always print() file names, as print() is strict and refuses to toy with surrogates. To be more explicit, the script print(__file__) will fail depending on the file name. This feels wrong... (though every bit of behaviour is correct) (The situation can't arise on Windows, and Python 2 will pretend nothing happened in true UNIX style) Demo script to try at home below. -- Thomas # -*- coding: UTF-8 -*- from __future__ import unicode_literals, print_function import sys import os.path import subprocess import tempfile import shutil script = 'print(__file__)\n' file_names = ['🐪.py', '€.py', '€.py'.encode('latin9')] PY = sys.executable tmpdir = tempfile.mkdtemp() for fn in file_names: if isinstance(fn, bytes): path = os.path.join(tmpdir.encode('ascii'), fn) else: path = os.path.join(tmpdir, fn) print('► creating', path) with open(path, 'w') as fp: fp.write(script) print('► running', PY, path) status = subprocess.call([PY, path]) print('► exited with status', status) print('► cleaning up') shutil.rmtree(tmpdir) # End of script ####################################################################### # Output from Python 3.6.5 on Linux (Ubuntu 18.04):: # # ► creating /tmp/tmp_a4h5n22/🐪.py # ► running /usr/bin/python3 /tmp/tmp_a4h5n22/🐪.py # /tmp/tmp_a4h5n22/🐪.py # ► exited with status 0 # ► creating /tmp/tmp_a4h5n22/€.py # ► running /usr/bin/python3 /tmp/tmp_a4h5n22/€.py # /tmp/tmp_a4h5n22/€.py # ► exited with status 0 # ► creating b'/tmp/tmp_a4h5n22/\xa4.py' # ► running /usr/bin/python3 b'/tmp/tmp_a4h5n22/\xa4.py' # Traceback (most recent call last): # File "/tmp/tmp_a4h5n22/\udca4.py", line 1, in <module> # print(__file__) # UnicodeEncodeError: 'utf-8' codec can't encode character '\udca4' in position 17: surrogates not allowed # ► exited with status 1 # ► cleaning up # # Python 2.7.15rc1 on Linux (Ubuntu): # # ► creating /tmp/tmp_U_LPp/🐪.py # ► running /usr/bin/python2 /tmp/tmp_U_LPp/🐪.py # /tmp/tmp_U_LPp/🐪.py # ► exited with status 0 # ► creating /tmp/tmp_U_LPp/€.py # ► running /usr/bin/python2 /tmp/tmp_U_LPp/€.py # /tmp/tmp_U_LPp/€.py # ► exited with status 0 # ► creating /tmp/tmp_U_LPp/�.py # ► running /usr/bin/python2 /tmp/tmp_U_LPp/�.py # /tmp/tmp_U_LPp/�.py # ► exited with status 0 # ► cleaning up # # Python 3.7.0 on Windows 10:: # # ► creating C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\🐪.py # ► running C:\Users\tjol\AppData\Local\Programs\Python\Python37\python.exe C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\� # �.py # C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\🐪.py # ► exited with status 0 # ► creating C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\€.py # ► running C:\Users\tjol\AppData\Local\Programs\Python\Python37\python.exe C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\€ # .py # C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\€.py # ► exited with status 0 # ► creating b'C:\\Users\\tjol\\AppData\\Local\\Temp\\tmpzprwnyc2\\\xa4.py' # Traceback (most recent call last): # File ".\bytes_file_names2.py", line 25, in <module> # with open(path, 'w') as fp: # UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 45: invalid start byte # # Python 2.7.15 on Windows 10: # # Traceback (most recent call last): # File ".\bytes_file_names2.py", line 24, in <module> # print('Ôû║ creating', path) # File "C:\Python27\lib\encodings\cp850.py", line 12, in encode # return codecs.charmap_encode(input,errors,encoding_map) # UnicodeEncodeError: 'charmap' codec can't encode character u'\u25ba' in position 0: character maps to <undefined> -- https://mail.python.org/mailman/listinfo/python-list