Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread norseman
Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define "printable". One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by printable is that the string must

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Barry Scott
On 30 Apr 2009, at 21:06, Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define "printable". One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by pr

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis
>>> How do get a printable unicode version of these path strings if they >>> contain none unicode data? >> >> Define "printable". One way would be to use a regular expression, >> replacing all codes in a certain range with a question mark. > > What I mean by printable is that the string must be va

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Barry Scott
On 30 Apr 2009, at 05:52, Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define "printable". One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by prin

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis
> How do get a printable unicode version of these path strings if they > contain none unicode data? Define "printable". One way would be to use a regular expression, replacing all codes in a certain range with a question mark. > I'm guessing that an app has to understand that filenames come in tw

Re: PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson
On 29Apr2009 23:41, Barry Scott wrote: > On 22 Apr 2009, at 07:50, Martin v. Löwis wrote: >> If the locale's encoding is UTF-8, the file system encoding is set to >> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes >> (which must be >= 0x80) into half surrogate codes U+DC80..U

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Barry Scott
On 22 Apr 2009, at 07:50, Martin v. Löwis wrote: If the locale's encoding is UTF-8, the file system encoding is set to a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. Forgive me if this has been covered.

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-26 Thread Martin v. Löwis
> How about another str-like type, a sequence of char-or-bytes? That would be a different PEP. I personally like my own proposal more, but feel free to propose something different. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-26 Thread Adrian
How about another str-like type, a sequence of char-or-bytes? Could be called strbytes or stringwithinvalidcharacters. It would support whatever subset of str functionality makes sense / is easy to implement plus a to_escaped_str() method (that does the escaping the PEP talks about) for people who

Re: PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Cameron Simpson
On 25Apr2009 14:07, "Martin v. Löwis" wrote: | Cameron Simpson wrote: | > On 22Apr2009 08:50, Martin v. Löwis wrote: | > | File names, environment variables, and command line arguments are | > | defined as being character data in POSIX; | > | > Specific citation please? I'd like to check the spe

Re: PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Zooko O'Whielacronx
Thanks for writing this PEP 383, MvL. I recently ran into this problem in Python 2.x in the Tahoe project [1]. The Tahoe project should be considered a good use case showing what some people need. For example, the assumption that a file will later be written back into the same local file

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
> If the bytes are mapped to single half surrogate codes instead of the > normal pairs (low+high), then I can see that decoding could never be > ambiguous and encoding could produce the original bytes. I was confused by Markus Kuhn's original UTF-8b specification. I have now changed the PEP to avo

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
Cameron Simpson wrote: > On 22Apr2009 08:50, Martin v. Löwis wrote: > | File names, environment variables, and command line arguments are > | defined as being character data in POSIX; > > Specific citation please? I'd like to check the specifics of this. For example, on environment variables: h

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-24 Thread Martin v. Löwis
> Why not use U+DCxx for non-UTF-8 encodings too? I thought of that, and was tricked into believing that only U+DC8x is a half surrogate. Now I see that you are right, and have fixed the PEP accordingly. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-24 Thread Lino Mastrodomenico
2009/4/22 "Martin v. Löwis" : > To convert non-decodable bytes, a new error handler "python-escape" is > introduced, which decodes non-decodable bytes using into a private-use > character U+F01xx, which is believed to not conflict with private-use > characters that currently exist in Python codecs.

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-23 Thread MRAB
Martin v. Löwis wrote: MRAB wrote: Martin v. Löwis wrote: [snip] To convert non-decodable bytes, a new error handler "python-escape" is introduced, which decodes non-decodable bytes using into a private-use character U+F01xx, which is believed to not conflict with private-use characters that cu

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-23 Thread James Y Knight
On Apr 22, 2009, at 2:50 AM, Martin v. Löwis wrote: I'm proposing the following PEP for inclusion into Python 3.1. Please comment. +1. Even if some people still want a low-level bytes API, it's important that the easy case be easy. That is: the majority of Python applications should *just

Re: PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-23 Thread Cameron Simpson
On 24Apr2009 09:27, I wrote: | If I'm writing a general purpose UNIX tool like chmod or find, I expect | it to work reliably on _any_ UNIX pathname. It must be totally encoding | blind. If I speak to the os.* interface to open a file, I expect to hand | it bytes and have it behave. As an explicit e

Re: PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-23 Thread Cameron Simpson
On 22Apr2009 08:50, Martin v. L�wis wrote: | File names, environment variables, and command line arguments are | defined as being character data in POSIX; Specific citation please? I'd like to check the specifics of this. | the C APIs however allow | passing arbitrary bytes - whether these confo

Re: PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread v+python
On Apr 21, 11:50 pm, "Martin v. Löwis" wrote: > I'm proposing the following PEP for inclusion into Python 3.1. > Please comment. Basically the scheme doesn't work. Aside from that, it is very close. There are tons of encoding schemes that could work... they don't have to include half-surrogate

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Martin v. Löwis
>> The python-escape codec is only used/meaningful if the env encoding >> is not UTF-8. For any other encoding, it is assumed that no character >> actually maps to the private-use characters. > > Which should be true for any encoding from the pre-unicode era, but not > for UTF-16/32 and variants.

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread M.-A. Lemburg
On 2009-04-22 22:06, Walter Dörwald wrote: > Martin v. Löwis wrote: >>> "correct" -> "corrected" >> Thanks, fixed. >> To convert non-decodable bytes, a new error handler "python-escape" is introduced, which decodes non-decodable bytes using into a private-use character U+F01xx, which

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Walter Dörwald
Martin v. Löwis wrote: >> "correct" -> "corrected" > > Thanks, fixed. > >>> To convert non-decodable bytes, a new error handler "python-escape" is >>> introduced, which decodes non-decodable bytes using into a private-use >>> character U+F01xx, which is believed to not conflict with private-use >

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Martin v. Löwis
MRAB wrote: > Martin v. Löwis wrote: > [snip] >> To convert non-decodable bytes, a new error handler "python-escape" is >> introduced, which decodes non-decodable bytes using into a private-use >> character U+F01xx, which is believed to not conflict with private-use >> characters that currently exi

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Martin v. Löwis
> "correct" -> "corrected" Thanks, fixed. >> To convert non-decodable bytes, a new error handler "python-escape" is >> introduced, which decodes non-decodable bytes using into a private-use >> character U+F01xx, which is believed to not conflict with private-use >> characters that currently exist

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Walter Dörwald
Martin v. Löwis wrote: > I'm proposing the following PEP for inclusion into Python 3.1. > Please comment. > > Regards, > Martin > > PEP: 383 > Title: Non-decodable Bytes in System Character Interfaces > Version: $Revision: 71793 $ > Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 20

Re: PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread MRAB
Martin v. Löwis wrote: [snip] To convert non-decodable bytes, a new error handler "python-escape" is introduced, which decodes non-decodable bytes using into a private-use character U+F01xx, which is believed to not conflict with private-use characters that currently exist in Python codecs. The

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Nick Coghlan
Martin v. Löwis wrote: > I'm proposing the following PEP for inclusion into Python 3.1. > Please comment. That seems like a much nicer solution than having parallel bytes/Unicode APIs everywhere. When the locale encoding is UTF-8, would UTF-8b also be used for the command line decoding and enviro

PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-21 Thread Martin v. Löwis
I'm proposing the following PEP for inclusion into Python 3.1. Please comment. Regards, Martin PEP: 383 Title: Non-decodable Bytes in System Character Interfaces Version: $Revision: 71793 $ Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 2009) $ Author: Martin v. Löwis Status: Draft