Re: [Python-Dev] Bytes path related questions for Guido
On 24 August 2014 14:44, Nick Coghlan wrote: > 2. Should we add some additional helpers to the string module for > dealing with surrogate escaped bytes and other techniques for > smuggling arbitrary binary data as text? > > My proposal [3] is to add: > > * string.escaped_surrogates (constant with the 128 escaped code points) > * string.clean(s): replaces surrogates with '\ufffd' or another > specified code point > * string.redecode(s, encoding): encodes a string back to bytes and > then decodes it again using the specified encoding (the old encoding > defaults to 'latin-1' to match the assumptions in WSGI) Serhiy & Ezio convinced me to scale this one back to a proposal for "codecs.clean_surrogate_escapes(s)", which replaces surrogates that may be produced by surrogateescape (that's what string.clean() above was supposed to be, but my description was not correct, and the name was too vague for that error to be obvious to the reader) "s != codecs.clean_surrogate_escapes(s)" would then become the check for "does this string contain any surrogate escaped bytes?" Regards, Nick. -- Nick Coghlan | [email protected] | Brisbane, Australia ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path related questions for Guido
Le 24/08/2014 09:04, Nick Coghlan a écrit : On 24 August 2014 14:44, Nick Coghlan wrote: 2. Should we add some additional helpers to the string module for dealing with surrogate escaped bytes and other techniques for smuggling arbitrary binary data as text? My proposal [3] is to add: * string.escaped_surrogates (constant with the 128 escaped code points) * string.clean(s): replaces surrogates with '\ufffd' or another specified code point * string.redecode(s, encoding): encodes a string back to bytes and then decodes it again using the specified encoding (the old encoding defaults to 'latin-1' to match the assumptions in WSGI) Serhiy & Ezio convinced me to scale this one back to a proposal for "codecs.clean_surrogate_escapes(s)", which replaces surrogates that may be produced by surrogateescape (that's what string.clean() above was supposed to be, but my description was not correct, and the name was too vague for that error to be obvious to the reader) "clean" conveys the wrong meaning. It should use a scary word such as "trap". "Cleaning" surrogates is unlikely to be the right procedure when dealing with surrogates produced by undecodable byte sequences. Regards Antoine. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path related questions for Guido
On 25 August 2014 00:23, Antoine Pitrou wrote: > Le 24/08/2014 09:04, Nick Coghlan a écrit : >> Serhiy & Ezio convinced me to scale this one back to a proposal for >> "codecs.clean_surrogate_escapes(s)", which replaces surrogates that >> may be produced by surrogateescape (that's what string.clean() above >> was supposed to be, but my description was not correct, and the name >> was too vague for that error to be obvious to the reader) > > > "clean" conveys the wrong meaning. It should use a scary word such as > "trap". "Cleaning" surrogates is unlikely to be the right procedure when > dealing with surrogates produced by undecodable byte sequences. "purge_surrogate_escapes" was the other term that occurred to me. Either way, my use case is to filter them out when I *don't* want to pass them along to other software, but would prefer the Unicode replacement character to the ASCII question mark created by using the "replace" filter when encoding. Cheers, Nick. -- Nick Coghlan | [email protected] | Brisbane, Australia ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path related questions for Guido
Yes on #1 -- making the low-level functions more usable for edge cases by supporting bytes seems fine (as long as the support for strings, where it exists, is not compromised). The status of pathlib is a little unclear to me -- is there a plan to eventually support bytes or not? For #2 I think you should probably just work with the others you have mentioned. On Sat, Aug 23, 2014 at 9:44 PM, Nick Coghlan wrote: > At Guido's request, splitting out two specific questions from Serhiy's > thread where I believe we could do with an explicit "yes or no" from > him. > > 1. Should we accept patches adding support for the direct use of bytes > paths in lower level filesystem manipulation APIs? (i.e. everything > that isn't pathlib) > > This was Serhiy's original question (due to some open issues [1,2]). I > think the answer is yes, as we already do in some cases, and the > "pathlib doesn't support binary paths" design decision is a high level > platform independent API vs low level potentially platform dependent > API one rather than being about disallowing the use of bytes paths in > general. > > [1] http://bugs.python.org/issue19997 > [2] http://bugs.python.org/issue20797 > > 2. Should we add some additional helpers to the string module for > dealing with surrogate escaped bytes and other techniques for > smuggling arbitrary binary data as text? > > My proposal [3] is to add: > > * string.escaped_surrogates (constant with the 128 escaped code points) > * string.clean(s): replaces surrogates with '\ufffd' or another > specified code point > * string.redecode(s, encoding): encodes a string back to bytes and > then decodes it again using the specified encoding (the old encoding > defaults to 'latin-1' to match the assumptions in WSGI) > > "s != string.clean(s)" would then serve as a check for "does this > string contain any surrogate escaped bytes?" > > [3] http://bugs.python.org/issue18814#msg225791 > > Regards, > Nick. > > -- > Nick Coghlan | [email protected] | Brisbane, Australia > ___ > Python-Dev mailing list > [email protected] > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/guido%40python.org > -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path related questions for Guido
On 25 Aug 2014 03:55, "Guido van Rossum" wrote: > > Yes on #1 -- making the low-level functions more usable for edge cases by supporting bytes seems fine (as long as the support for strings, where it exists, is not compromised). Thanks! > The status of pathlib is a little unclear to me -- is there a plan to eventually support bytes or not? It's text only and Antoine plans to keep it that - the concatenation operations, etc, are really only safe if you decode first. > > For #2 I think you should probably just work with the others you have mentioned. Yes, that sounds like a good idea. There's been some good progress on the issue tracker, so I think we can thrash out some workable (and comprehensible!) utilities that will be useful in their own right while also serving as aids to understanding for the underlying mechanisms. Cheers, Nick. > > > On Sat, Aug 23, 2014 at 9:44 PM, Nick Coghlan wrote: >> >> At Guido's request, splitting out two specific questions from Serhiy's >> thread where I believe we could do with an explicit "yes or no" from >> him. >> >> 1. Should we accept patches adding support for the direct use of bytes >> paths in lower level filesystem manipulation APIs? (i.e. everything >> that isn't pathlib) >> >> This was Serhiy's original question (due to some open issues [1,2]). I >> think the answer is yes, as we already do in some cases, and the >> "pathlib doesn't support binary paths" design decision is a high level >> platform independent API vs low level potentially platform dependent >> API one rather than being about disallowing the use of bytes paths in >> general. >> >> [1] http://bugs.python.org/issue19997 >> [2] http://bugs.python.org/issue20797 >> >> 2. Should we add some additional helpers to the string module for >> dealing with surrogate escaped bytes and other techniques for >> smuggling arbitrary binary data as text? >> >> My proposal [3] is to add: >> >> * string.escaped_surrogates (constant with the 128 escaped code points) >> * string.clean(s): replaces surrogates with '\ufffd' or another >> specified code point >> * string.redecode(s, encoding): encodes a string back to bytes and >> then decodes it again using the specified encoding (the old encoding >> defaults to 'latin-1' to match the assumptions in WSGI) >> >> "s != string.clean(s)" would then serve as a check for "does this >> string contain any surrogate escaped bytes?" >> >> [3] http://bugs.python.org/issue18814#msg225791 >> >> Regards, >> Nick. >> >> -- >> Nick Coghlan | [email protected] | Brisbane, Australia >> ___ >> Python-Dev mailing list >> [email protected] >> https://mail.python.org/mailman/listinfo/python-dev >> Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org > > > > > -- > --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
