Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Martin v. Löwis wrote: > I'm proposing the following PEP for inclusion into Python 3.1. > Please comment. That seems like a much nicer solution than having parallel bytes/Unicode APIs everywhere. When the locale encoding is UTF-8, would UTF-8b also be used for the command line decoding and environment variable encoding/decoding? (the PEP currently only states that the encoding switch will be done for the file system encoding - it is silent regarding the other two system interfaces). Cheers, Nick. -- Nick Coghlan | [email protected] | Brisbane, Australia --- ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 06:50 am, [email protected] wrote: I'm proposing the following PEP for inclusion into Python 3.1. Please comment. To convert non-decodable bytes, a new error handler "python-escape" is introduced, which decodes non-decodable bytes using into a private-use character U+F01xx, which is believed to not conflict with private-use characters that currently exist in Python codecs. -1. On UNIX, character data is not sufficient to represent paths. We must, must, must continue to have a simple bytes interface to these APIs. Covering it up in layers of obscure encoding hacks will not make the problem go away, it will just make it harder to understand. To make matters worse, Linux and GNOME use the PUA for some printable characters. If you open up charmap on an ubuntu system and select "view by unicode character block", then click on "private use area", you'll see many of these. I know that Apple uses at least a few PUA codepoints for the apple logo and the propeller/option icons as well. I am still -1 on any turn-non-decodable-bytes-into-text, because it makes life harder for those of us trying to keep bytes and text straight, but if you absolutely must represent POSIX filenames as mojibake rather than bytes, the only workable solution is to use NUL as your escape character. That's the only code point which _actually_ can't show up in a filename somehow. As we discussed last time, this is what Mono does with System.IO.Path. As a bonus, it's _much_ easier to detect a NUL from random application code than to try to figure out if a string has any half-surrogates or magic PUA characters which shouldn't be interpreted according to platform PUA rules. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Martin v. Löwis wrote: > I'm proposing the following PEP for inclusion into Python 3.1. > Please comment. > > Regards, > Martin > > PEP: 383 > Title: Non-decodable Bytes in System Character Interfaces > Version: $Revision: 71793 $ > Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 2009) $ > Author: Martin v. Löwis > Status: Draft > Type: Standards Track > Content-Type: text/x-rst > Created: 22-Apr-2009 > Python-Version: 3.1 > Post-History: > > Abstract > > > File names, environment variables, and command line arguments are > defined as being character data in POSIX; the C APIs however allow > passing arbitrary bytes - whether these conform to a certain encoding > or not. This PEP proposes a means of dealing with such irregularities > by embedding the bytes in character strings in such a way that allows > recreation of the original byte string. > > Rationale > = > > The C char type is a data type that is commonly used to represent both > character data and bytes. Certain POSIX interfaces are specified and > widely understood as operating on character data, however, the system > call interfaces make no assumption on the encoding of these data, and > pass them on as-is. With Python 3, character strings use a > Unicode-based internal representation, making it difficult to ignore > the encoding of byte strings in the same way that the C interfaces can > ignore the encoding. > > On the other hand, Microsoft Windows NT has correct the original "correct" -> "corrected" > design limitation of Unix, and made it explicit in its system > interfaces that these data (file names, environment variables, command > line arguments) are indeed character data, by providing a > Unicode-based API (keeping a C-char-based one for backwards > compatibility). > > [...] > > Specification > = > > On Windows, Python uses the wide character APIs to access > character-oriented APIs, allowing direct conversion of the > environmental data to Python str objects. > > On POSIX systems, Python currently applies the locale's encoding to > convert the byte data to Unicode. If the locale's encoding is UTF-8, > it can represent the full set of Unicode characters, otherwise, only a > subset is representable. In the latter case, using private-use > characters to represent these bytes would be an option. For UTF-8, > doing so would create an ambiguity, as the private-use characters may > regularly occur in the input also. > > To convert non-decodable bytes, a new error handler "python-escape" is > introduced, which decodes non-decodable bytes using into a private-use > character U+F01xx, which is believed to not conflict with private-use > characters that currently exist in Python codecs. Would this mean that real private use characters in the file name would raise an exception? How? The UTF-8 decoder doesn't pass those bytes to any error handler. > The error handler interface is extended to allow the encode error > handler to return byte strings immediately, in addition to returning > Unicode strings which then get encoded again. Then the error callback for encoding would become specific to the target encoding. Would this mean that the handler checks which encoding is used and behaves like "strict" if it doesn't recognize the encoding? > If the locale's encoding is UTF-8, the file system encoding is set to > a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes > (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. Is this done by the codec, or the error handler? If it's done by the codec I don't see a reason for the "python-escape" error handler. > Discussion > == > > While providing a uniform API to non-decodable bytes, this interface > has the limitation that chosen representation only "works" if the data > get converted back to bytes with the python-escape error handler > also. I thought the error handler would be used for decoding. > Encoding the data with the locale's encoding and the (default) > strict error handler will raise an exception, encoding them with UTF-8 > will produce non-sensical data. > > For most applications, we assume that they eventually pass data > received from a system interface back into the same system > interfaces. For example, and application invoking os.listdir() will "and" -> "an" > likely pass the result strings back into APIs like os.stat() or > open(), which then encodes them back into their original byte > representation. Applications that need to process the original byte > strings can obtain them by encoding the character strings with the > file system encoding, passing "python-escape" as the error handler > name. Servus, Walter ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Martin v. Löwis wrote: [snip] To convert non-decodable bytes, a new error handler "python-escape" is introduced, which decodes non-decodable bytes using into a private-use character U+F01xx, which is believed to not conflict with private-use characters that currently exist in Python codecs. The error handler interface is extended to allow the encode error handler to return byte strings immediately, in addition to returning Unicode strings which then get encoded again. If the locale's encoding is UTF-8, the file system encoding is set to a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. If the byte stream happens to include a sequence which decodes to U+F01xx, shouldn't that raise an exception? ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 22/04/2009 14:20, [email protected] wrote: -1. On UNIX, character data is not sufficient to represent paths. We must, must, must continue to have a simple bytes interface to these APIs. Covering it up in layers of obscure encoding hacks will not make the problem go away, it will just make it harder to understand. As a hg developer, I have to concur. Keeping bytes-based APIs intact would make porting hg to py3k much, much easier. You may be able to imagine that dealing with paths correctly cross-platform on a VCS is a major PITA, and py3k is currently not helping the situation. Cheers, Dirkjan ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
2009/4/22 Dirkjan Ochtman : > On 22/04/2009 14:20, [email protected] wrote: >> >> -1. On UNIX, character data is not sufficient to represent paths. We >> must, must, must continue to have a simple bytes interface to these >> APIs. Covering it up in layers of obscure encoding hacks will not make >> the problem go away, it will just make it harder to understand. > > As a hg developer, I have to concur. Keeping bytes-based APIs intact would > make porting hg to py3k much, much easier. You may be able to imagine that > dealing with paths correctly cross-platform on a VCS is a major PITA, and > py3k is currently not helping the situation. You're concerns are valid, but I don't see anything in the PEP about removing the bytes APIs. -- Regards, Benjamin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Dirkjan Ochtman ochtman.nl> writes: > > As a hg developer, I have to concur. Keeping bytes-based APIs intact > would make porting hg to py3k much, much easier. You may be able to > imagine that dealing with paths correctly cross-platform on a VCS is a > major PITA, and py3k is currently not helping the situation. bytes-based APIs are certainly more bullet-proof under Unix, but it's the reverse under Windows. Martin's proposal aims to bridge the gap and propose something that makes text-based APIs as bullet-proof under Unix as they already are under Windows. Regards Antoine. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> "correct" -> "corrected" Thanks, fixed. >> To convert non-decodable bytes, a new error handler "python-escape" is >> introduced, which decodes non-decodable bytes using into a private-use >> character U+F01xx, which is believed to not conflict with private-use >> characters that currently exist in Python codecs. > > Would this mean that real private use characters in the file name would > raise an exception? How? The UTF-8 decoder doesn't pass those bytes to > any error handler. The python-escape codec is only used/meaningful if the env encoding is not UTF-8. For any other encoding, it is assumed that no character actually maps to the private-use characters. >> The error handler interface is extended to allow the encode error >> handler to return byte strings immediately, in addition to returning >> Unicode strings which then get encoded again. > > Then the error callback for encoding would become specific to the target > encoding. Why would it become specific? It can work the same way for any encoding: take U+F01xx, and generate the byte xx. >> If the locale's encoding is UTF-8, the file system encoding is set to >> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes >> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. > > Is this done by the codec, or the error handler? If it's done by the > codec I don't see a reason for the "python-escape" error handler. utf-8b is a new codec. However, the utf-8b codec is only used if the env encoding would otherwise be utf-8. For utf-8b, the error handler is indeed unnecessary. >> While providing a uniform API to non-decodable bytes, this interface >> has the limitation that chosen representation only "works" if the data >> get converted back to bytes with the python-escape error handler >> also. > > I thought the error handler would be used for decoding. It's used in both directions: for decoding, it converts \xXX to U+F01XX. For encoding, U+F01XX will trigger an error, which is then handled by the handler to produce \xXX. > "and" -> "an" Thanks, fixed. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Wed, 22 Apr 2009 at 13:29, Benjamin Peterson wrote: 2009/4/22 Dirkjan Ochtman : On 22/04/2009 14:20, [email protected] wrote: -1. On UNIX, character data is not sufficient to represent paths. We must, must, must continue to have a simple bytes interface to these APIs. Covering it up in layers of obscure encoding hacks will not make the problem go away, it will just make it harder to understand. As a hg developer, I have to concur. Keeping bytes-based APIs intact would make porting hg to py3k much, much easier. You may be able to imagine that dealing with paths correctly cross-platform on a VCS is a major PITA, and py3k is currently not helping the situation. You're concerns are valid, but I don't see anything in the PEP about removing the bytes APIs. Yeah, but IIRC a complete set of bytes APIs doesn't exist yet in py3k. --David ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> -1. On UNIX, character data is not sufficient to represent paths. We > must, must, must continue to have a simple bytes interface to these > APIs. I'd like to respond to this concern in three ways: 1. The PEP doesn't remove any of the existing interfaces. So if the interfaces for byte-oriented file names in 3.0 work fine for you, feel free to continue to use them. 2. Even if they were taken away (which the PEP does not propose to do), it would be easy to emulate them for applications that want them. For example, listdir could be wrapped as def listdir_b(bytestring): fse = sys.getfilesystemencoding() string = bytestring.decode(fse, "python-escape") for fn in os.listdir(string): yield fn.encoded(fse, "python-escape") 3. I still disagree that we must, must, must continue to provide these interfaces. I don't understand from the rest of your message what would *actually* break if people would use the proposed interfaces. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Dirkjan Ochtman wrote: > On 22/04/2009 14:20, [email protected] wrote: >> -1. On UNIX, character data is not sufficient to represent paths. We >> must, must, must continue to have a simple bytes interface to these >> APIs. Covering it up in layers of obscure encoding hacks will not make >> the problem go away, it will just make it harder to understand. > > As a hg developer, I have to concur. Keeping bytes-based APIs intact > would make porting hg to py3k much, much easier. You may be able to > imagine that dealing with paths correctly cross-platform on a VCS is a > major PITA, and py3k is currently not helping the situation. I find these statements contradicting: py3k *is* keeping the byte-based APIs for file names intact, so why is it not helping the situation, when this is what is needed to make porting much, much easier? Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> Yeah, but IIRC a complete set of bytes APIs doesn't exist yet in py3k. Define complete. I'm not aware of any interfaces wrt. file IO that are lacking, so which ones were you thinking of? Python doesn't currently provide a way to access environment variables and command line arguments as bytes. With the PEP, such a way would actually become available for applications that desire it. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
MRAB wrote: > Martin v. Löwis wrote: > [snip] >> To convert non-decodable bytes, a new error handler "python-escape" is >> introduced, which decodes non-decodable bytes using into a private-use >> character U+F01xx, which is believed to not conflict with private-use >> characters that currently exist in Python codecs. >> >> The error handler interface is extended to allow the encode error >> handler to return byte strings immediately, in addition to returning >> Unicode strings which then get encoded again. >> >> If the locale's encoding is UTF-8, the file system encoding is set to >> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes >> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. >> > If the byte stream happens to include a sequence which decodes to > U+F01xx, shouldn't that raise an exception? I apparently have not expressed it clearly, so please help me improve the text. What I mean is this: - if the environment encoding (for lack of better name) is UTF-8, Python stops using the utf-8 codec under this PEP, and switches to the utf-8b codec. - otherwise (env encoding is not utf-8), undecodable bytes get decoded with the error handler. In this case, U+F01xx will not occur in the byte stream, since no other codec ever produces this PUA character (this is not fully true - UTF-16 may also produce PUA characters, but they can't appear as env encodings). So the case you are referring to should not happen. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Wed, 22 Apr 2009 at 21:21, "Martin v. L?wis" wrote: Yeah, but IIRC a complete set of bytes APIs doesn't exist yet in py3k. Define complete. I'm not aware of any interfaces wrt. file IO that are lacking, so which ones were you thinking of? Python doesn't currently provide a way to access environment variables and command line arguments as bytes. With the PEP, such a way would actually become available for applications that desire it. Those are the two that I'm thinking of. I think I understand your proposal better now after your example of implementing listdir(bytes). Putting it in the PEP would probably be a good idea. I personally don't have enough practice in actually working with various encodings (or any understanding of unicode escapes) to comment further. --David___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Martin v. Löwis wrote: >> "correct" -> "corrected" > > Thanks, fixed. > >>> To convert non-decodable bytes, a new error handler "python-escape" is >>> introduced, which decodes non-decodable bytes using into a private-use >>> character U+F01xx, which is believed to not conflict with private-use >>> characters that currently exist in Python codecs. >> Would this mean that real private use characters in the file name would >> raise an exception? How? The UTF-8 decoder doesn't pass those bytes to >> any error handler. > > The python-escape codec is only used/meaningful if the env encoding > is not UTF-8. For any other encoding, it is assumed that no character > actually maps to the private-use characters. Which should be true for any encoding from the pre-unicode era, but not for UTF-16/32 and variants. >>> The error handler interface is extended to allow the encode error >>> handler to return byte strings immediately, in addition to returning >>> Unicode strings which then get encoded again. >> Then the error callback for encoding would become specific to the target >> encoding. > > Why would it become specific? It can work the same way for any encoding: > take U+F01xx, and generate the byte xx. If any error callback emits bytes these byte sequences must be legal in the target encoding, which depends on the target encoding itself. However for the normal use of this error handler this might be irrelevant, because those filenames that get encoded were constructed in such a way that reencoding them regenerates the original byte sequence. >>> If the locale's encoding is UTF-8, the file system encoding is set to >>> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes >>> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. >> Is this done by the codec, or the error handler? If it's done by the >> codec I don't see a reason for the "python-escape" error handler. > > utf-8b is a new codec. However, the utf-8b codec is only used if the > env encoding would otherwise be utf-8. For utf-8b, the error handler > is indeed unnecessary. Wouldn't it make more sense to be consistent how non-decodable bytes get decoded? I.e. should the utf-8b codec decode those bytes to PUA characters too (and refuse to encode then, so the error handler outputs them)? >>> While providing a uniform API to non-decodable bytes, this interface >>> has the limitation that chosen representation only "works" if the data >>> get converted back to bytes with the python-escape error handler >>> also. >> I thought the error handler would be used for decoding. > > It's used in both directions: for decoding, it converts \xXX to > U+F01XX. For encoding, U+F01XX will trigger an error, which is then > handled by the handler to produce \xXX. But only for non-UTF8 encodings? Servus, Walter ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 2009-04-22 22:06, Walter Dörwald wrote: > Martin v. Löwis wrote: >>> "correct" -> "corrected" >> Thanks, fixed. >> To convert non-decodable bytes, a new error handler "python-escape" is introduced, which decodes non-decodable bytes using into a private-use character U+F01xx, which is believed to not conflict with private-use characters that currently exist in Python codecs. >>> Would this mean that real private use characters in the file name would >>> raise an exception? How? The UTF-8 decoder doesn't pass those bytes to >>> any error handler. >> The python-escape codec is only used/meaningful if the env encoding >> is not UTF-8. For any other encoding, it is assumed that no character >> actually maps to the private-use characters. > > Which should be true for any encoding from the pre-unicode era, but not > for UTF-16/32 and variants. Actually it's not even true for the pre-Unicode codecs. It was and is common for Asian companies to use company specific symbols in private areas or extended versions of CJK character sets. Microsoft even published an editor for Asian users create their own glyphs as needed: http://msdn.microsoft.com/en-us/library/cc194861.aspx Here's an overview for some US companies using such extensions: http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsi&item_id=VendorUseOfPUA (it's no surprise that most of these actually defined their own charsets) SIL even started a registry for the private use areas (PUAs): http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsi&cat_id=UnicodePUA This is their current list of assignments: http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsi&item_id=SILPUAassignments and here's how to register: http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsi&cat_id=UnicodePUA#404a261e -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 22 2009) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
>> The python-escape codec is only used/meaningful if the env encoding >> is not UTF-8. For any other encoding, it is assumed that no character >> actually maps to the private-use characters. > > Which should be true for any encoding from the pre-unicode era, but not > for UTF-16/32 and variants. Right. However, these can't appear as environment/file system encodings, because they use null bytes. >> Why would it become specific? It can work the same way for any encoding: >> take U+F01xx, and generate the byte xx. > > If any error callback emits bytes these byte sequences must be legal in > the target encoding, which depends on the target encoding itself. No. The whole process started with data having an *invalid* encoding in the source encoding (which, after the roundtrip, is now the target encoding). So the python-escape error handler deliberately produces byte sequences that are invalid in the environment encoding (hence the additional permission of having it produce bytes instead of characters). > However for the normal use of this error handler this might be > irrelevant, because those filenames that get encoded were constructed in > such a way that reencoding them regenerates the original byte sequence. Exactly so. The error handler is not of much use outside this specific scenario. >> utf-8b is a new codec. However, the utf-8b codec is only used if the >> env encoding would otherwise be utf-8. For utf-8b, the error handler >> is indeed unnecessary. > > Wouldn't it make more sense to be consistent how non-decodable bytes get > decoded? I.e. should the utf-8b codec decode those bytes to PUA > characters too (and refuse to encode then, so the error handler outputs > them)? Unfortunately, that won't work. If the original encoding is UTF-8, and uses PUA characters, then, on re-encoding, it's not possible to tell whether to encode as a PUA character, or as an invalid byte. This was my original proposal a year ago, and people immediately suggested that it is not at all acceptable if there is the slightest chance of information loss. Hence the current PEP. >>> I thought the error handler would be used for decoding. >> It's used in both directions: for decoding, it converts \xXX to >> U+F01XX. For encoding, U+F01XX will trigger an error, which is then >> handled by the handler to produce \xXX. > > But only for non-UTF8 encodings? Right. For ease of use, the implementation will specify the error handler regardless, and the recommended use for applications will be to use the error handler regardless. For utf-8b, the error handler will never be invoked, since all input can be converted always. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Issue5434: datetime.monthdelta
On Thu, Apr 16, 2009 at 8:01 PM, Jess Austin wrote: > These operations are useful in particular contexts. What I've > submitted is also useful, and currently isn't easy in core, > batteries-included python. While I would consider the foregoing > interpretation of the Zen to be backwards (this doesn't add another > way to do something that's already possible, it makes possible > something that currently encourages one to pull her hair out), I > suppose it doesn't matter. If adding a class and a function to a > module will require extended advocacy on -ideas and c.l.p, I'm > probably not the person for the job. > > If, on the other hand, one of the committers wants to toss this in at > some point, whether now or 3 versions down the road, the patch is up > at bugs.python.org (and I'm happy to make any suggested > modifications). I'm glad to have written this; I learned a bit about > CPython internals and scraped a layer of rust off my C skills. I will > go ahead and backport the python-coded version to 2.3. I'll continue > this conversation with whomever for however long, but I suspect this > topic will soon have worn out its welcome on python-dev. I've uploaded the backported python version source distribution to PyPI, http://pypi.python.org/pypi?name=MonthDelta&:action=display with better-formatted documentation at http://packages.python.org/MonthDelta/ "easy_install MonthDelta" works too. cheers, Jess ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 07:17 pm, [email protected] wrote: -1. On UNIX, character data is not sufficient to represent paths. We must, must, must continue to have a simple bytes interface to these APIs. I'd like to respond to this concern in three ways: 1. The PEP doesn't remove any of the existing interfaces. So if the interfaces for byte-oriented file names in 3.0 work fine for you, feel free to continue to use them. It's good to know this. It would be good if the PEP made it clear that it is proposing an additional way to work with undecodable bytes, not replacing the existing one. For me, this PEP isn't an acceptable substitute for direct bytes-based access to command-line arguments and environment variables on UNIX. To my knowledge *those* APIs still don't exist yet. I would like it if this PEP were not used as an excuse to avoid adding them. 2. Even if they were taken away (which the PEP does not propose to do), it would be easy to emulate them for applications that want them. I think this is a pretty clear abstraction inversion. Luckily nobody is proposing it :). 3. I still disagree that we must, must, must continue to provide these interfaces. You do have a point; if there is a clean, defined mapping between str and bytes in terms of all path/argv/environ APIs, then we don't *need* those APIs, since we can just implement them in terms of characters. But I still think that's a bad idea, since mixing the returned strings with *other* APIs remains problematic. However, I still think the mapping you propose is problematic... I don't understand from the rest of your message what would *actually* break if people would use the proposed interfaces. As far as more concrete problems: the utf-8 codec currently in python 2.5 and 2.6, and 3.0 will happily encode half-surrogates, at least in the builds I have. >>> '\udc81'.encode('utf-8').decode('utf-8') '\udc81' So there's an ambiguity when passing U+DC81 to this codec: do you mean \xed\xb2\x81 or do you just mean \x81? Of course it would be possible to make UTF-8B consistent in this regard, but it is still going to interact with code that thinks in terms of actual UTF-8, and the failure mode here is very difficult to inspect. A major problem here is that it's very difficult to puzzle out whether anything *will* actually break. I might be wrong about the above for some subtlety of unicode that I don't quite understand, but I don't want to spend all day experimenting with every possible set of build options, python versions, and unicode specifications. Neither, I wager, do most people who want to call listdir(). Another specific problem: looking at the Character Map application on my desktop, U+F0126 and U+F0127 are considered printable characters. I'm not sure what they're supposed to be, exactly, but there are glyphs there. This is running Ubuntu 8.04; there may be more of these in use in more recent version of GNOME. There is nothing "private" about the "private use" area; Python can never use any of these characters for *anything*, except possibly internally in ways which are never exposed to application code, because the operating system (or window system, or libraries) might use them. If I pass a string with those printable PUA/A characters in it to listdir(), what happens? Do they get turned into bytes, do they only get turned into bytes if my filesystem encoding happens to be something other than UTF-8...? The PEP seems a bit ambiguous to me as far as how the PUA hack and the half-surrogate hack interact. I could be wrong, but it seems to me to be an either-or proposition, in which case there would be *four* bytes types in python 3.1: bytes, bytearray, str-with-PUA/A-junk, str-with- half-surrogate-junk. Detecting the difference would be an expensive and subtle affair; the simplest solution I could think of would be to use an error-prone regex. If the encoding hack used were simply NULL, then the detection would be straightforward: "if '\u' in thingy:". Ultimately I think I'm only -0 on all of this now, as long as we get bytes versions of environ and argv. Even if these corner-case issues aren't fixed, those of us who want to have correct handling of undecodable filenames can do so. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
