Re: [Python-Dev] Bytes path related questions for Guido

2014-08-24 Thread Nick Coghlan
On 24 August 2014 14:44, Nick Coghlan  wrote:
> 2. Should we add some additional helpers to the string module for
> dealing with surrogate escaped bytes and other techniques for
> smuggling arbitrary binary data as text?
>
> My proposal [3] is to add:
>
> * string.escaped_surrogates (constant with the 128 escaped code points)
> * string.clean(s): replaces surrogates with '\ufffd' or another
> specified code point
> * string.redecode(s, encoding): encodes a string back to bytes and
> then decodes it again using the specified encoding (the old encoding
> defaults to 'latin-1' to match the assumptions in WSGI)


Serhiy & Ezio convinced me to scale this one back to a proposal for
"codecs.clean_surrogate_escapes(s)", which replaces surrogates that
may be produced by surrogateescape (that's what string.clean() above
was supposed to be, but my description was not correct, and the name
was too vague for that error to be obvious to the reader)

"s != codecs.clean_surrogate_escapes(s)" would then become the check
for "does this string contain any surrogate escaped bytes?"

Regards,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path related questions for Guido

2014-08-24 Thread Antoine Pitrou

Le 24/08/2014 09:04, Nick Coghlan a écrit :

On 24 August 2014 14:44, Nick Coghlan  wrote:

2. Should we add some additional helpers to the string module for
dealing with surrogate escaped bytes and other techniques for
smuggling arbitrary binary data as text?

My proposal [3] is to add:

* string.escaped_surrogates (constant with the 128 escaped code points)
* string.clean(s): replaces surrogates with '\ufffd' or another
specified code point
* string.redecode(s, encoding): encodes a string back to bytes and
then decodes it again using the specified encoding (the old encoding
defaults to 'latin-1' to match the assumptions in WSGI)



Serhiy & Ezio convinced me to scale this one back to a proposal for
"codecs.clean_surrogate_escapes(s)", which replaces surrogates that
may be produced by surrogateescape (that's what string.clean() above
was supposed to be, but my description was not correct, and the name
was too vague for that error to be obvious to the reader)


"clean" conveys the wrong meaning. It should use a scary word such as 
"trap". "Cleaning" surrogates is unlikely to be the right procedure when 
dealing with surrogates produced by undecodable byte sequences.


Regards

Antoine.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path related questions for Guido

2014-08-24 Thread Nick Coghlan
On 25 August 2014 00:23, Antoine Pitrou  wrote:
> Le 24/08/2014 09:04, Nick Coghlan a écrit :
>> Serhiy & Ezio convinced me to scale this one back to a proposal for
>> "codecs.clean_surrogate_escapes(s)", which replaces surrogates that
>> may be produced by surrogateescape (that's what string.clean() above
>> was supposed to be, but my description was not correct, and the name
>> was too vague for that error to be obvious to the reader)
>
>
> "clean" conveys the wrong meaning. It should use a scary word such as
> "trap". "Cleaning" surrogates is unlikely to be the right procedure when
> dealing with surrogates produced by undecodable byte sequences.

"purge_surrogate_escapes" was the other term that occurred to me.

Either way, my use case is to filter them out when I *don't* want to
pass them along to other software, but would prefer the Unicode
replacement character to the ASCII question mark created by using the
"replace" filter when encoding.

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path related questions for Guido

2014-08-24 Thread Guido van Rossum
Yes on #1 -- making the low-level functions more usable for edge cases by
supporting bytes seems fine (as long as the support for strings, where it
exists, is not compromised).

The status of pathlib is a little unclear to me -- is there a plan to
eventually support bytes or not?

For #2 I think you should probably just work with the others you have
mentioned.


On Sat, Aug 23, 2014 at 9:44 PM, Nick Coghlan  wrote:

> At Guido's request, splitting out two specific questions from Serhiy's
> thread where I believe we could do with an explicit "yes or no" from
> him.
>
> 1. Should we accept patches adding support for the direct use of bytes
> paths in lower level filesystem manipulation APIs? (i.e. everything
> that isn't pathlib)
>
> This was Serhiy's original question (due to some open issues [1,2]). I
> think the answer is yes, as we already do in some cases, and the
> "pathlib doesn't support binary paths" design decision is a high level
> platform independent API vs low level potentially platform dependent
> API one rather than being about disallowing the use of bytes paths in
> general.
>
> [1] http://bugs.python.org/issue19997
> [2] http://bugs.python.org/issue20797
>
> 2. Should we add some additional helpers to the string module for
> dealing with surrogate escaped bytes and other techniques for
> smuggling arbitrary binary data as text?
>
> My proposal [3] is to add:
>
> * string.escaped_surrogates (constant with the 128 escaped code points)
> * string.clean(s): replaces surrogates with '\ufffd' or another
> specified code point
> * string.redecode(s, encoding): encodes a string back to bytes and
> then decodes it again using the specified encoding (the old encoding
> defaults to 'latin-1' to match the assumptions in WSGI)
>
> "s != string.clean(s)" would then serve as a check for "does this
> string contain any surrogate escaped bytes?"
>
> [3] http://bugs.python.org/issue18814#msg225791
>
> Regards,
> Nick.
>
> --
> Nick Coghlan   |   [email protected]   |   Brisbane, Australia
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/guido%40python.org
>



-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path related questions for Guido

2014-08-24 Thread Nick Coghlan
On 25 Aug 2014 03:55, "Guido van Rossum"  wrote:
>
> Yes on #1 -- making the low-level functions more usable for edge cases by
supporting bytes seems fine (as long as the support for strings, where it
exists, is not compromised).

Thanks!

> The status of pathlib is a little unclear to me -- is there a plan to
eventually support bytes or not?

It's text only and Antoine plans to keep it that - the concatenation
operations, etc, are really only safe if you decode first.

>
> For #2 I think you should probably just work with the others you have
mentioned.

Yes, that sounds like a good idea. There's been some good progress on the
issue tracker, so I think we can thrash out some workable (and
comprehensible!) utilities that will be useful in their own right while
also serving as aids to understanding for the underlying mechanisms.

Cheers,
Nick.

>
>
> On Sat, Aug 23, 2014 at 9:44 PM, Nick Coghlan  wrote:
>>
>> At Guido's request, splitting out two specific questions from Serhiy's
>> thread where I believe we could do with an explicit "yes or no" from
>> him.
>>
>> 1. Should we accept patches adding support for the direct use of bytes
>> paths in lower level filesystem manipulation APIs? (i.e. everything
>> that isn't pathlib)
>>
>> This was Serhiy's original question (due to some open issues [1,2]). I
>> think the answer is yes, as we already do in some cases, and the
>> "pathlib doesn't support binary paths" design decision is a high level
>> platform independent API vs low level potentially platform dependent
>> API one rather than being about disallowing the use of bytes paths in
>> general.
>>
>> [1] http://bugs.python.org/issue19997
>> [2] http://bugs.python.org/issue20797
>>
>> 2. Should we add some additional helpers to the string module for
>> dealing with surrogate escaped bytes and other techniques for
>> smuggling arbitrary binary data as text?
>>
>> My proposal [3] is to add:
>>
>> * string.escaped_surrogates (constant with the 128 escaped code points)
>> * string.clean(s): replaces surrogates with '\ufffd' or another
>> specified code point
>> * string.redecode(s, encoding): encodes a string back to bytes and
>> then decodes it again using the specified encoding (the old encoding
>> defaults to 'latin-1' to match the assumptions in WSGI)
>>
>> "s != string.clean(s)" would then serve as a check for "does this
>> string contain any surrogate escaped bytes?"
>>
>> [3] http://bugs.python.org/issue18814#msg225791
>>
>> Regards,
>> Nick.
>>
>> --
>> Nick Coghlan   |   [email protected]   |   Brisbane, Australia
>> ___
>> Python-Dev mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/python-dev
>> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/guido%40python.org
>
>
>
>
> --
> --Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com