On Sat, Mar 19, 2016 at 8:28 AM, Marko Rauhamaa <ma...@pacujo.net> wrote: > Chris Angelico <ros...@gmail.com>: > >> The problem is not Python's Unicode strings, then. The problem is the >> notion that path names are text. If they're text, they should be >> exclusively text (although, for low-level efficiency, they're more >> likely to be defined as "valid UTF-8 sequences" rather than "sequences >> of Unicode codepoints"); since they're not, they are fundamentally >> bytes. But that's not a problem with Python - it's a problem with the >> file system. > > The file system does not have a problem. Python has a problem because it > tries to present pathnames as Unicode strings, which isn't always > possible.
But what does a file name *mean*? If it has no meaning, we should simply use a hierarchical tree of IDs. The point of a file *name* is that it has meaning to a human, which implies that they carry text, not bytes. So I maintain that the problem here is with the file system; it permits (for historical reasons) arbitrary byte sequences. If I were building an entire OS ecosystem from scratch today, I'd probably do a lot of things with a hybrid system of documented meaning atop implementation-detail APIs. In this particular case, I would define the API in terms of byte sequences, but clearly documenting that these byte sequences are to be understood to mean text strings, and thus must be valid UTF-8. It's still efficient (moving bytes around the kernel is easier than having heaps of text<->bytes transitions), but it allows future changes to depend on all non-broken usage fitting this pattern. ChrisA -- https://mail.python.org/mailman/listinfo/python-list