Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Nick Coghlan
Martin v. Löwis wrote:
> I'm proposing the following PEP for inclusion into Python 3.1.
> Please comment.

That seems like a much nicer solution than having parallel bytes/Unicode
APIs everywhere.

When the locale encoding is UTF-8, would UTF-8b also be used for the
command line decoding and environment variable encoding/decoding? (the
PEP currently only states that the encoding switch will be done for the
file system encoding - it is silent regarding the other two system
interfaces).

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
---
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread glyph

On 06:50 am, [email protected] wrote:

I'm proposing the following PEP for inclusion into Python 3.1.
Please comment.



To convert non-decodable bytes, a new error handler "python-escape" is
introduced, which decodes non-decodable bytes using into a private-use
character U+F01xx, which is believed to not conflict with private-use
characters that currently exist in Python codecs.


-1.  On UNIX, character data is not sufficient to represent paths.  We 
must, must, must continue to have a simple bytes interface to these 
APIs.  Covering it up in layers of obscure encoding hacks will not make 
the problem go away, it will just make it harder to understand.


To make matters worse, Linux and GNOME use the PUA for some printable 
characters.  If you open up charmap on an ubuntu system and select "view 
by unicode character block", then click on "private use area", you'll 
see many of these.  I know that Apple uses at least a few PUA codepoints 
for the apple logo and the propeller/option icons as well.


I am still -1 on any turn-non-decodable-bytes-into-text, because it 
makes life harder for those of us trying to keep bytes and text 
straight, but if you absolutely must represent POSIX filenames as 
mojibake rather than bytes, the only workable solution is to use NUL as 
your escape character.  That's the only code point which _actually_ 
can't show up in a filename somehow.  As we discussed last time, this is 
what Mono does with System.IO.Path.  As a bonus, it's _much_ easier to 
detect a NUL from random application code than to try to figure out if a 
string has any half-surrogates or magic PUA characters which shouldn't 
be interpreted according to platform PUA rules.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Walter Dörwald
Martin v. Löwis wrote:

> I'm proposing the following PEP for inclusion into Python 3.1.
> Please comment.
> 
> Regards,
> Martin
> 
> PEP: 383
> Title: Non-decodable Bytes in System Character Interfaces
> Version: $Revision: 71793 $
> Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 2009) $
> Author: Martin v. Löwis 
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 22-Apr-2009
> Python-Version: 3.1
> Post-History:
> 
> Abstract
> 
> 
> File names, environment variables, and command line arguments are
> defined as being character data in POSIX; the C APIs however allow
> passing arbitrary bytes - whether these conform to a certain encoding
> or not. This PEP proposes a means of dealing with such irregularities
> by embedding the bytes in character strings in such a way that allows
> recreation of the original byte string.
> 
> Rationale
> =
> 
> The C char type is a data type that is commonly used to represent both
> character data and bytes. Certain POSIX interfaces are specified and
> widely understood as operating on character data, however, the system
> call interfaces make no assumption on the encoding of these data, and
> pass them on as-is. With Python 3, character strings use a
> Unicode-based internal representation, making it difficult to ignore
> the encoding of byte strings in the same way that the C interfaces can
> ignore the encoding.
> 
> On the other hand, Microsoft Windows NT has correct the original

"correct" -> "corrected"

> design limitation of Unix, and made it explicit in its system
> interfaces that these data (file names, environment variables, command
> line arguments) are indeed character data, by providing a
> Unicode-based API (keeping a C-char-based one for backwards
> compatibility).
> 
> [...]
> 
> Specification
> =
> 
> On Windows, Python uses the wide character APIs to access
> character-oriented APIs, allowing direct conversion of the
> environmental data to Python str objects.
> 
> On POSIX systems, Python currently applies the locale's encoding to
> convert the byte data to Unicode. If the locale's encoding is UTF-8,
> it can represent the full set of Unicode characters, otherwise, only a
> subset is representable. In the latter case, using private-use
> characters to represent these bytes would be an option. For UTF-8,
> doing so would create an ambiguity, as the private-use characters may
> regularly occur in the input also.
> 
> To convert non-decodable bytes, a new error handler "python-escape" is
> introduced, which decodes non-decodable bytes using into a private-use
> character U+F01xx, which is believed to not conflict with private-use
> characters that currently exist in Python codecs.

Would this mean that real private use characters in the file name would
raise an exception? How? The UTF-8 decoder doesn't pass those bytes to
any error handler.

> The error handler interface is extended to allow the encode error
> handler to return byte strings immediately, in addition to returning
> Unicode strings which then get encoded again.

Then the error callback for encoding would become specific to the target
encoding. Would this mean that the handler checks which encoding is used
and behaves like "strict" if it doesn't recognize the encoding?

> If the locale's encoding is UTF-8, the file system encoding is set to
> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.

Is this done by the codec, or the error handler? If it's done by the
codec I don't see a reason for the "python-escape" error handler.

> Discussion
> ==
> 
> While providing a uniform API to non-decodable bytes, this interface
> has the limitation that chosen representation only "works" if the data
> get converted back to bytes with the python-escape error handler
> also.

I thought the error handler would be used for decoding.

> Encoding the data with the locale's encoding and the (default)
> strict error handler will raise an exception, encoding them with UTF-8
> will produce non-sensical data.
> 
> For most applications, we assume that they eventually pass data
> received from a system interface back into the same system
> interfaces. For example, and application invoking os.listdir() will

"and" -> "an"

> likely pass the result strings back into APIs like os.stat() or
> open(), which then encodes them back into their original byte
> representation. Applications that need to process the original byte
> strings can obtain them by encoding the character strings with the
> file system encoding, passing "python-escape" as the error handler
> name.

Servus,
   Walter
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread MRAB

Martin v. Löwis wrote:
[snip]

To convert non-decodable bytes, a new error handler "python-escape" is
introduced, which decodes non-decodable bytes using into a private-use
character U+F01xx, which is believed to not conflict with private-use
characters that currently exist in Python codecs.

The error handler interface is extended to allow the encode error
handler to return byte strings immediately, in addition to returning
Unicode strings which then get encoded again.

If the locale's encoding is UTF-8, the file system encoding is set to
a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
(which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.


If the byte stream happens to include a sequence which decodes to
U+F01xx, shouldn't that raise an exception?
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Dirkjan Ochtman

On 22/04/2009 14:20, [email protected] wrote:

-1. On UNIX, character data is not sufficient to represent paths. We
must, must, must continue to have a simple bytes interface to these
APIs. Covering it up in layers of obscure encoding hacks will not make
the problem go away, it will just make it harder to understand.


As a hg developer, I have to concur. Keeping bytes-based APIs intact 
would make porting hg to py3k much, much easier. You may be able to 
imagine that dealing with paths correctly cross-platform on a VCS is a 
major PITA, and py3k is currently not helping the situation.


Cheers,

Dirkjan
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Benjamin Peterson
2009/4/22 Dirkjan Ochtman :
> On 22/04/2009 14:20, [email protected] wrote:
>>
>> -1. On UNIX, character data is not sufficient to represent paths. We
>> must, must, must continue to have a simple bytes interface to these
>> APIs. Covering it up in layers of obscure encoding hacks will not make
>> the problem go away, it will just make it harder to understand.
>
> As a hg developer, I have to concur. Keeping bytes-based APIs intact would
> make porting hg to py3k much, much easier. You may be able to imagine that
> dealing with paths correctly cross-platform on a VCS is a major PITA, and
> py3k is currently not helping the situation.

You're concerns are valid, but I don't see anything in the PEP about
removing the bytes APIs.



-- 
Regards,
Benjamin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Antoine Pitrou
Dirkjan Ochtman  ochtman.nl> writes:
> 
> As a hg developer, I have to concur. Keeping bytes-based APIs intact 
> would make porting hg to py3k much, much easier. You may be able to 
> imagine that dealing with paths correctly cross-platform on a VCS is a 
> major PITA, and py3k is currently not helping the situation.

bytes-based APIs are certainly more bullet-proof under Unix, but it's the
reverse under Windows. Martin's proposal aims to bridge the gap and propose
something that makes text-based APIs as bullet-proof under Unix as they already
are under Windows.

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Martin v. Löwis
> "correct" -> "corrected"

Thanks, fixed.

>> To convert non-decodable bytes, a new error handler "python-escape" is
>> introduced, which decodes non-decodable bytes using into a private-use
>> character U+F01xx, which is believed to not conflict with private-use
>> characters that currently exist in Python codecs.
> 
> Would this mean that real private use characters in the file name would
> raise an exception? How? The UTF-8 decoder doesn't pass those bytes to
> any error handler.

The python-escape codec is only used/meaningful if the env encoding
is not UTF-8. For any other encoding, it is assumed that no character
actually maps to the private-use characters.

>> The error handler interface is extended to allow the encode error
>> handler to return byte strings immediately, in addition to returning
>> Unicode strings which then get encoded again.
> 
> Then the error callback for encoding would become specific to the target
> encoding.

Why would it become specific? It can work the same way for any encoding:
take U+F01xx, and generate the byte xx.

>> If the locale's encoding is UTF-8, the file system encoding is set to
>> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
>> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
> 
> Is this done by the codec, or the error handler? If it's done by the
> codec I don't see a reason for the "python-escape" error handler.

utf-8b is a new codec. However, the utf-8b codec is only used if the
env encoding would otherwise be utf-8. For utf-8b, the error handler
is indeed unnecessary.

>> While providing a uniform API to non-decodable bytes, this interface
>> has the limitation that chosen representation only "works" if the data
>> get converted back to bytes with the python-escape error handler
>> also.
> 
> I thought the error handler would be used for decoding.

It's used in both directions: for decoding, it converts \xXX to
U+F01XX. For encoding, U+F01XX will trigger an error, which is then
handled by the handler to produce \xXX.

> "and" -> "an"

Thanks, fixed.

Regards,
Martin

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread R. David Murray

On Wed, 22 Apr 2009 at 13:29, Benjamin Peterson wrote:

2009/4/22 Dirkjan Ochtman :

On 22/04/2009 14:20, [email protected] wrote:


-1. On UNIX, character data is not sufficient to represent paths. We
must, must, must continue to have a simple bytes interface to these
APIs. Covering it up in layers of obscure encoding hacks will not make
the problem go away, it will just make it harder to understand.


As a hg developer, I have to concur. Keeping bytes-based APIs intact would
make porting hg to py3k much, much easier. You may be able to imagine that
dealing with paths correctly cross-platform on a VCS is a major PITA, and
py3k is currently not helping the situation.


You're concerns are valid, but I don't see anything in the PEP about
removing the bytes APIs.


Yeah, but IIRC a complete set of bytes APIs doesn't exist yet in py3k.

--David
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Martin v. Löwis
> -1.  On UNIX, character data is not sufficient to represent paths.  We
> must, must, must continue to have a simple bytes interface to these
> APIs.

I'd like to respond to this concern in three ways:

1. The PEP doesn't remove any of the existing interfaces. So if the
   interfaces for byte-oriented file names in 3.0 work fine for you,
   feel free to continue to use them.

2. Even if they were taken away (which the PEP does not propose to do),
   it would be easy to emulate them for applications that want them.
   For example, listdir could be wrapped as

   def listdir_b(bytestring):
   fse = sys.getfilesystemencoding()
   string = bytestring.decode(fse, "python-escape")
   for fn in os.listdir(string):
   yield fn.encoded(fse, "python-escape")

3. I still disagree that we must, must, must continue to provide these
   interfaces. I don't understand from the rest of your message what
   would *actually* break if people would use the proposed interfaces.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Martin v. Löwis
Dirkjan Ochtman wrote:
> On 22/04/2009 14:20, [email protected] wrote:
>> -1. On UNIX, character data is not sufficient to represent paths. We
>> must, must, must continue to have a simple bytes interface to these
>> APIs. Covering it up in layers of obscure encoding hacks will not make
>> the problem go away, it will just make it harder to understand.
> 
> As a hg developer, I have to concur. Keeping bytes-based APIs intact
> would make porting hg to py3k much, much easier. You may be able to
> imagine that dealing with paths correctly cross-platform on a VCS is a
> major PITA, and py3k is currently not helping the situation.

I find these statements contradicting:
py3k *is* keeping the byte-based APIs for file names intact, so
why is it not helping the situation, when this is what is needed
to make porting much, much easier?

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Martin v. Löwis
> Yeah, but IIRC a complete set of bytes APIs doesn't exist yet in py3k.

Define complete. I'm not aware of any interfaces wrt. file IO that are
lacking, so which ones were you thinking of?

Python doesn't currently provide a way to access environment variables
and command line arguments as bytes. With the PEP, such a way would
actually become available for applications that desire it.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Martin v. Löwis
MRAB wrote:
> Martin v. Löwis wrote:
> [snip]
>> To convert non-decodable bytes, a new error handler "python-escape" is
>> introduced, which decodes non-decodable bytes using into a private-use
>> character U+F01xx, which is believed to not conflict with private-use
>> characters that currently exist in Python codecs.
>>
>> The error handler interface is extended to allow the encode error
>> handler to return byte strings immediately, in addition to returning
>> Unicode strings which then get encoded again.
>>
>> If the locale's encoding is UTF-8, the file system encoding is set to
>> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
>> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
>>
> If the byte stream happens to include a sequence which decodes to
> U+F01xx, shouldn't that raise an exception?

I apparently have not expressed it clearly, so please help me improve
the text. What I mean is this:

- if the environment encoding (for lack of better name) is UTF-8,
  Python stops using the utf-8 codec under this PEP, and switches
  to the utf-8b codec.
- otherwise (env encoding is not utf-8), undecodable bytes get decoded
  with the error handler. In this case, U+F01xx will not occur
  in the byte stream, since no other codec ever produces this PUA
  character (this is not fully true - UTF-16 may also produce PUA
  characters, but they can't appear as env encodings).
So the case you are referring to should not happen.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread R. David Murray

On Wed, 22 Apr 2009 at 21:21, "Martin v. L?wis" wrote:

Yeah, but IIRC a complete set of bytes APIs doesn't exist yet in py3k.


Define complete. I'm not aware of any interfaces wrt. file IO that are
lacking, so which ones were you thinking of?

Python doesn't currently provide a way to access environment variables
and command line arguments as bytes. With the PEP, such a way would
actually become available for applications that desire it.


Those are the two that I'm thinking of.

I think I understand your proposal better now after your example of
implementing listdir(bytes).  Putting it in the PEP would probably
be a good idea.  I personally don't have enough practice in actually
working with various encodings (or any understanding of unicode escapes)
to comment further.

--David___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Walter Dörwald
Martin v. Löwis wrote:
>> "correct" -> "corrected"
> 
> Thanks, fixed.
> 
>>> To convert non-decodable bytes, a new error handler "python-escape" is
>>> introduced, which decodes non-decodable bytes using into a private-use
>>> character U+F01xx, which is believed to not conflict with private-use
>>> characters that currently exist in Python codecs.
>> Would this mean that real private use characters in the file name would
>> raise an exception? How? The UTF-8 decoder doesn't pass those bytes to
>> any error handler.
> 
> The python-escape codec is only used/meaningful if the env encoding
> is not UTF-8. For any other encoding, it is assumed that no character
> actually maps to the private-use characters.

Which should be true for any encoding from the pre-unicode era, but not
for UTF-16/32 and variants.

>>> The error handler interface is extended to allow the encode error
>>> handler to return byte strings immediately, in addition to returning
>>> Unicode strings which then get encoded again.
>> Then the error callback for encoding would become specific to the target
>> encoding.
> 
> Why would it become specific? It can work the same way for any encoding:
> take U+F01xx, and generate the byte xx.

If any error callback emits bytes these byte sequences must be legal in
the target encoding, which depends on the target encoding itself.

However for the normal use of this error handler this might be
irrelevant, because those filenames that get encoded were constructed in
such a way that reencoding them regenerates the original byte sequence.

>>> If the locale's encoding is UTF-8, the file system encoding is set to
>>> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
>>> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
>> Is this done by the codec, or the error handler? If it's done by the
>> codec I don't see a reason for the "python-escape" error handler.
> 
> utf-8b is a new codec. However, the utf-8b codec is only used if the
> env encoding would otherwise be utf-8. For utf-8b, the error handler
> is indeed unnecessary.

Wouldn't it make more sense to be consistent how non-decodable bytes get
decoded? I.e. should the utf-8b codec decode those bytes to PUA
characters too (and refuse to encode then, so the error handler outputs
them)?

>>> While providing a uniform API to non-decodable bytes, this interface
>>> has the limitation that chosen representation only "works" if the data
>>> get converted back to bytes with the python-escape error handler
>>> also.
>> I thought the error handler would be used for decoding.
> 
> It's used in both directions: for decoding, it converts \xXX to
> U+F01XX. For encoding, U+F01XX will trigger an error, which is then
> handled by the handler to produce \xXX.

But only for non-UTF8 encodings?

Servus,
   Walter
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread M.-A. Lemburg
On 2009-04-22 22:06, Walter Dörwald wrote:
> Martin v. Löwis wrote:
>>> "correct" -> "corrected"
>> Thanks, fixed.
>>
 To convert non-decodable bytes, a new error handler "python-escape" is
 introduced, which decodes non-decodable bytes using into a private-use
 character U+F01xx, which is believed to not conflict with private-use
 characters that currently exist in Python codecs.
>>> Would this mean that real private use characters in the file name would
>>> raise an exception? How? The UTF-8 decoder doesn't pass those bytes to
>>> any error handler.
>> The python-escape codec is only used/meaningful if the env encoding
>> is not UTF-8. For any other encoding, it is assumed that no character
>> actually maps to the private-use characters.
> 
> Which should be true for any encoding from the pre-unicode era, but not
> for UTF-16/32 and variants.

Actually it's not even true for the pre-Unicode codecs. It was and is common
for Asian companies to use company specific symbols in private areas
or extended versions of CJK character sets.

Microsoft even published an editor for Asian users create their
own glyphs as needed:

http://msdn.microsoft.com/en-us/library/cc194861.aspx

Here's an overview for some US companies using such extensions:


http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsi&item_id=VendorUseOfPUA
(it's no surprise that most of these actually defined their own charsets)

SIL even started a registry for the private use areas (PUAs):

http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsi&cat_id=UnicodePUA

This is their current list of assignments:


http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsi&item_id=SILPUAassignments

and here's how to register:


http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsi&cat_id=UnicodePUA#404a261e

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 22 2009)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Martin v. Löwis
>> The python-escape codec is only used/meaningful if the env encoding
>> is not UTF-8. For any other encoding, it is assumed that no character
>> actually maps to the private-use characters.
> 
> Which should be true for any encoding from the pre-unicode era, but not
> for UTF-16/32 and variants.

Right. However, these can't appear as environment/file system encodings,
because they use null bytes.

>> Why would it become specific? It can work the same way for any encoding:
>> take U+F01xx, and generate the byte xx.
> 
> If any error callback emits bytes these byte sequences must be legal in
> the target encoding, which depends on the target encoding itself.

No. The whole process started with data having an *invalid* encoding
in the source encoding (which, after the roundtrip, is now the
target encoding). So the python-escape error handler deliberately
produces byte sequences that are invalid in the environment encoding
(hence the additional permission of having it produce bytes instead
of characters).

> However for the normal use of this error handler this might be
> irrelevant, because those filenames that get encoded were constructed in
> such a way that reencoding them regenerates the original byte sequence.

Exactly so. The error handler is not of much use outside this specific
scenario.

>> utf-8b is a new codec. However, the utf-8b codec is only used if the
>> env encoding would otherwise be utf-8. For utf-8b, the error handler
>> is indeed unnecessary.
> 
> Wouldn't it make more sense to be consistent how non-decodable bytes get
> decoded? I.e. should the utf-8b codec decode those bytes to PUA
> characters too (and refuse to encode then, so the error handler outputs
> them)?

Unfortunately, that won't work. If the original encoding is UTF-8, and
uses PUA characters, then, on re-encoding, it's not possible to tell
whether to encode as a PUA character, or as an invalid byte.

This was my original proposal a year ago, and people immediately
suggested that it is not at all acceptable if there is the slightest
chance of information loss. Hence the current PEP.

>>> I thought the error handler would be used for decoding.
>> It's used in both directions: for decoding, it converts \xXX to
>> U+F01XX. For encoding, U+F01XX will trigger an error, which is then
>> handled by the handler to produce \xXX.
> 
> But only for non-UTF8 encodings?

Right. For ease of use, the implementation will specify the error
handler regardless, and the recommended use for applications will
be to use the error handler regardless. For utf-8b, the error
handler will never be invoked, since all input can be converted
always.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Issue5434: datetime.monthdelta

2009-04-22 Thread Jess Austin
On Thu, Apr 16, 2009 at 8:01 PM, Jess Austin  wrote:
> These operations are useful in particular contexts.  What I've
> submitted is also useful, and currently isn't easy in core,
> batteries-included python.  While I would consider the foregoing
> interpretation of the Zen to be backwards (this doesn't add another
> way to do something that's already possible, it makes possible
> something that currently encourages one to pull her hair out), I
> suppose it doesn't matter.  If adding a class and a function to a
> module will require extended advocacy on -ideas and c.l.p, I'm
> probably not the person for the job.
>
> If, on the other hand, one of the committers wants to toss this in at
> some point, whether now or 3 versions down the road, the patch is up
> at bugs.python.org (and I'm happy to make any suggested
> modifications).  I'm glad to have written this; I learned a bit about
> CPython internals and scraped a layer of rust off my C skills.  I will
> go ahead and backport the python-coded version to 2.3.  I'll continue
> this conversation with whomever for however long, but I suspect this
> topic will soon have worn out its welcome on python-dev.


I've uploaded the backported python version source distribution to
PyPI, http://pypi.python.org/pypi?name=MonthDelta&:action=display with
better-formatted documentation at
http://packages.python.org/MonthDelta/

"easy_install MonthDelta" works too.

cheers,
Jess
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread glyph


On 07:17 pm, [email protected] wrote:

-1.  On UNIX, character data is not sufficient to represent paths.  We
must, must, must continue to have a simple bytes interface to these
APIs.



I'd like to respond to this concern in three ways:

1. The PEP doesn't remove any of the existing interfaces. So if the
  interfaces for byte-oriented file names in 3.0 work fine for you,
  feel free to continue to use them.


It's good to know this.  It would be good if the PEP made it clear that 
it is proposing an additional way to work with undecodable bytes, not 
replacing the existing one.


For me, this PEP isn't an acceptable substitute for direct bytes-based 
access to command-line arguments and environment variables on UNIX.  To 
my knowledge *those* APIs still don't exist yet.  I would like it if 
this PEP were not used as an excuse to avoid adding them.

2. Even if they were taken away (which the PEP does not propose to do),
  it would be easy to emulate them for applications that want them.


I think this is a pretty clear abstraction inversion.  Luckily nobody is 
proposing it :).

3. I still disagree that we must, must, must continue to provide these
  interfaces.


You do have a point; if there is a clean, defined mapping between str 
and bytes in terms of all path/argv/environ APIs, then we don't *need* 
those APIs, since we can just implement them in terms of characters. 
But I still think that's a bad idea, since mixing the returned strings 
with *other* APIs remains problematic.  However, I still think the 
mapping you propose is problematic...

  I don't understand from the rest of your message what
  would *actually* break if people would use the proposed interfaces.


As far as more concrete problems: the utf-8 codec currently in python 
2.5 and 2.6, and 3.0 will happily encode half-surrogates, at least in 
the builds I have.


   >>> '\udc81'.encode('utf-8').decode('utf-8')
   '\udc81'

So there's an ambiguity when passing U+DC81 to this codec: do you mean 
\xed\xb2\x81 or do you just mean \x81?  Of course it would be possible 
to make UTF-8B consistent in this regard, but it is still going to 
interact with code that thinks in terms of actual UTF-8, and the failure 
mode here is very difficult to inspect.


A major problem here is that it's very difficult to puzzle out whether 
anything *will* actually break.  I might be wrong about the above for 
some subtlety of unicode that I don't quite understand, but I don't want 
to spend all day experimenting with every possible set of build options, 
python versions, and unicode specifications.  Neither, I wager, do most 
people who want to call listdir().


Another specific problem: looking at the Character Map application on my 
desktop, U+F0126 and U+F0127 are considered printable characters.  I'm 
not sure what they're supposed to be, exactly, but there are glyphs 
there.  This is running Ubuntu 8.04; there may be more of these in use 
in more recent version of GNOME.


There is nothing "private" about the "private use" area; Python can 
never use any of these characters for *anything*, except possibly 
internally in ways which are never exposed to application code, because 
the operating system (or window system, or libraries) might use them. 
If I pass a string with those printable PUA/A characters in it to 
listdir(), what happens?  Do they get turned into bytes, do they only 
get turned into bytes if my filesystem encoding happens to be something 
other than UTF-8...?


The PEP seems a bit ambiguous to me as far as how the PUA hack and the 
half-surrogate hack interact.  I could be wrong, but it seems to me to 
be an either-or proposition, in which case there would be *four* bytes 
types in python 3.1: bytes, bytearray, str-with-PUA/A-junk, str-with- 
half-surrogate-junk.  Detecting the difference would be an expensive and 
subtle affair; the simplest solution I could think of would be to use an 
error-prone regex.  If the encoding hack used were simply NULL, then the 
detection would be straightforward: "if '\u' in thingy:".


Ultimately I think I'm only -0 on all of this now, as long as we get 
bytes versions of environ and argv.  Even if these corner-case issues 
aren't fixed, those of us who want to have correct handling of 
undecodable filenames can do so.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com