On Sat, Mar 7, 2015 at 1:50 AM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > Rustom Mody wrote: > >> On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: > > [snip example of an analogous situation with NULs] > >> Strawman. > > Sigh. If I had a dollar for every time somebody cried "Strawman!" when what > they really should say is "Yes, that's a good argument, I'm afraid I can't > argue against it, at least not without considerable thought", I'd be a > wealthy man...
If I had a dollar for every time anyone said "If I had <insert currency unit here> for every time...", I'd go meta all day long and profit from it... :) > - If you are writing your own file system layer, it's 2015 fer fecks sake, > file names should be Unicode strings, not bytes! (That's one part of the > Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file > system, whichever you please, but again remember that both are > variable-width formats. I agree that that part of the Unix model needs to change, but there are two viable ways to move forward: 1) Keep file names as bytes, but mandate that they be valid UTF-8 streams, and recommend that they be decoded UTF-8 for display to a human 2) Change the entire protocol stack from the file system upwards so that file names become Unicode strings. Trouble with #2 is that file names need to be passed around somehow, which means bytes in memory. So ultimately, #2 really means "keep file names as bytes, and mandate an encoding all the way up the stack"... so it's a massive documentation change that really comes down to the same thing as #1. This is one area where, as I understand it, Mac OS got it right. It's time for other Unix variants to adopt the same policy. The bulk of file names will be ASCII-only anyway, so requiring UTF-8 won't affect them; a lot of others are already UTF-8; so all we need is a transition scheme for the remaining ones. If there's a known FS encoding, it ought to be possible to have a file system conversion tool that goes through everything, decodes, re-encodes UTF-8, and then flags the file system as UTF-8 compliant. All that'd be left would be the file names that are broken already - ones that don't decode in the FS encoding - and there's nothing to be done with them but wrap them up into something probably-meaningless-but reversible. When can we start doing this? ext5? ChrisA -- https://mail.python.org/mailman/listinfo/python-list