Cameron Simpson wrote: > On 22Apr2009 08:50, Martin v. Löwis <mar...@v.loewis.de> wrote: > | File names, environment variables, and command line arguments are > | defined as being character data in POSIX; > > Specific citation please? I'd like to check the specifics of this.
For example, on environment variables: http://opengroup.org/onlinepubs/007908799/xbd/envvar.html # For values to be portable across XSI-conformant systems, the value # must be composed of characters from the portable character set (except # NUL and as indicated below). # Environment variable names used by the utilities in the XCU # specification consist solely of upper-case letters, digits and the "_" # (underscore) from the characters defined in Portable Character Set . # Other characters may be permitted by an implementation; Or, on command line arguments: http://opengroup.org/onlinepubs/007908799/xsh/execve.html # The arguments represented by arg0, ... are pointers to null-terminated # character strings where a character string is "A contiguous sequence of characters terminated by and including the first null byte.", and a character is # A sequence of one or more bytes representing a single graphic symbol # or control code. This term corresponds to the ISO C standard term # multibyte character (multi-byte character), where a single-byte # character is a special case of a multi-byte character. Unlike the # usage in the ISO C standard, character here has no necessary # relationship with storage space, and byte is used when storage space # is discussed. > So you're proposing that all POSIX OS interfaces (which use byte strings) > interpret those byte strings into Python3 str objects, with a codec > that will accept arbitrary byte sequences losslessly and is totally > reversible, yes? Correct. > And, I hope, that the os.* interfaces silently use it by default. Correct. > | Applications that need to process the original byte > | strings can obtain them by encoding the character strings with the > | file system encoding, passing "python-escape" as the error handler > | name. > > -1 > > This last sentence kills the idea for me, unless I'm missing something. > Which I may be, of course. > > POSIX filesystems _do_not_ have a file system encoding. Why is that a problem for the PEP? > If I'm writing a general purpose UNIX tool like chmod or find, I expect > it to work reliably on _any_ UNIX pathname. It must be totally encoding > blind. If I speak to the os.* interface to open a file, I expect to hand > it bytes and have it behave. See the other messages. If you want to do that, you can continue to. > I'm very much in favour of being able to work in strings for most > purposes, but if I use the os.* interfaces on a UNIX system it is > necessary to be _able_ to work in bytes, because UNIX file pathnames > are bytes. Please re-read the PEP. It provides a way of being able to access any POSIX file name correctly, and still pass strings. > If there isn't a byte-safe os.* facility in Python3, it will simply be > unsuitable for writing low level UNIX tools. Why is that? The mechanism in the PEP is precisely defined to allow writing low level UNIX tools. > Finally, I have a small python program whose whole purpose in life > is to transcode UNIX filenames before transfer to a MacOSX HFS > directory, because of HFS's enforced particular encoding. What approach > should a Python app take to transcode UNIX pathnames under your scheme? Compute the corresponding character strings, and use them. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list