[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2015-05-09 Thread Nick Coghlan
Nick Coghlan added the comment: surrogateescape and surrogateepass data *already* can't be inverted back to bytes reliably without knowing the original encoding - if you encode them as something else when they contain surrogates, you'll either get an exception (the default) or mojibake (if you us

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2015-05-09 Thread Stephen J. Turnbull
Stephen J. Turnbull added the comment: Please do not add the "rehandle" functions to codecs. They do not change the (duck-typed) representation of data while maintaining the semantics, they change the semantics of data while retaining the representation. I suggest a "validation" submodule of

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2015-03-17 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Note that provided Python implementations are rather a proof of concept. After discussion I'll provide more efficient C implementations, that should be 1-2 orders faster (and infinitely fast for common case of ASCII strings). --

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2015-03-17 Thread Nick Coghlan
Nick Coghlan added the comment: Oh, and yes, I agree a python-dev discussion would be a good idea. >From my perspective, "rehandle_surrogateescape" is the key function for making >it easier to check for malformed input data from operating system interfaces. The other items I don't personally h

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2015-03-17 Thread Nick Coghlan
Nick Coghlan added the comment: I'd wondered about that with respect to rehandle_surrogatepass. The current implementation looks like it processes *all* surrogates (even valid surrogate pairs), so "handle_surrogates" might be a suitable name. If the intent is for it to be "handle_lone_surrogat

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2015-03-17 Thread Serhiy Storchaka
Changes by Serhiy Storchaka : Added file: http://bugs.python.org/file38520/codecs_convert_escapes_2.patch ___ Python tracker ___ ___ Python-bu

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2015-03-17 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: I uploaded the patch just before your comment Nick. Here is updated patch. Functions are renamed as Nick suggested, added two more functions: decompose_astrals() and compose_surrogate_pairs(). They are mainly for example here, they can be committed in other

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2015-03-16 Thread Nick Coghlan
Nick Coghlan added the comment: (Serhiy, did you miss uploading the new patch?) Regarding the names, we may need to think about the use cases a bit more explicitly to clarify that in terms of the Python codecs API rather than expecting folks to understand the underlying representation. In the

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2015-03-16 Thread Serhiy Storchaka
Changes by Serhiy Storchaka : -- keywords: +patch Added file: http://bugs.python.org/file38506/codecs_convert_escapes.patch ___ Python tracker ___ ___

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2015-03-16 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Proposed preliminary patch adds three functions in the codecs module: convert_surrogates(data, errors) -- handle lone surrogates with specified error handler. >>> codecs.convert_surrogates('a\u20ac\udca4', 'backslashreplace') 'a€\\udca4' convert_surrogatees

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2015-03-15 Thread Serhiy Storchaka
Changes by Serhiy Storchaka : -- dependencies: +Add support of UnicodeTranslateError in standard error handlers ___ Python tracker ___ ___

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2014-09-23 Thread Nick Coghlan
Nick Coghlan added the comment: Ah, Serhiy's approach of avoiding the encode/decode dance entirely is an even better idea - replacing the lone surrogates directly with the output of the alternative error handler avoids any need to worry about the original encoding. --

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2014-09-23 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Good catch Antoine! Here is a sample of more complicated implementation. -- title: Add a convert_surrogates function to "clean" surrogate escaped strings -> Add codecs.convert_surrogateescape to "clean" surrogate escaped strings Added file: http://bu

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2014-09-23 Thread R. David Murray
R. David Murray added the comment: Oh, wait, I forgot that the context for this was dealing with unix filenames and/or stdio. So, a function that just uses the fsencoding to do the replace might indeed be appropriate, but in that case should probably live in the os module. os.convert_surroga

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2014-09-23 Thread R. David Murray
R. David Murray added the comment: And indeed my use case for this has instances of both cases: originally decoded using ASCII and the non-ascii bytes must end up as replaced characters, and originally decoded using utf-8. I'm also not sure that it is worth adding this. If you know what you a

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2014-09-23 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: On 23.09.2014 13:12, Nick Coghlan wrote: > > Nick Coghlan added the comment: > > Draft docstring for that version > > def convert_surrogates(data, errors='replace'): > """Convert escaped surrogates by applying a different error handler > >

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2014-09-23 Thread Antoine Pitrou
Antoine Pitrou added the comment: The encoding used impacts the result: >>> s = 'abc\udcc3\udca9' >>> s.encode('ascii', 'surrogateescape').decode('ascii', 'replace') 'abc��' >>> s.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace') 'abcé' The original string ('abc\udcc3\udca9') was ob

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2014-09-23 Thread Nick Coghlan
Nick Coghlan added the comment: Antoine: what would be the use case for using a different encoding for the temporary bytes object? It's discarded anyway, so the encoding used isn't externally visible. -- ___ Python tracker

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2014-09-23 Thread Nick Coghlan
Nick Coghlan added the comment: Draft docstring for that version def convert_surrogates(data, errors='replace'): """Convert escaped surrogates by applying a different error handler Uses the "replace" error handler by default, but any input error handler may be specif

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2014-09-23 Thread Antoine Pitrou
Antoine Pitrou added the comment: Le 23/09/2014 12:57, Nick Coghlan a écrit : > The function definition again, this time with a draft docstring: > > def convert_surrogateescape(data, errors='replace'): > """Convert escaped raw bytes by applying a different error handler > >

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2014-09-23 Thread Nick Coghlan
Nick Coghlan added the comment: Note I would also be OK with "convert_surrogates", as that's the term that appears in the relevant error message: >>> b'\xe9'.decode('ascii', 'surrogateescape').encode() Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'utf-8' codec c

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2014-09-23 Thread Nick Coghlan
Nick Coghlan added the comment: The function definition again, this time with a draft docstring: def convert_surrogateescape(data, errors='replace'): """Convert escaped raw bytes by applying a different error handler Uses the "replace" error handler by default, but any input

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2014-09-23 Thread Nick Coghlan
Nick Coghlan added the comment: The error handler is called "surrogateescape". That means "convert_surrogateescape" is always only a single step away from thinking "I want to remove the smuggled bytes from a surrogateescape'd string", without needing to assume any knowledge on the part of the

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2014-09-23 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: Don't like the function name :-) How about codecs.filter_non_utf8_data(), since that's closer to what the function is really doing and doesn't require knowledge about what "surrogateescape" is. -- nosy: +lemburg

[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

2014-09-23 Thread Nick Coghlan
Nick Coghlan added the comment: Updated issue title to reflect current proposal. -- title: Add tools for "cleaning" surrogate escaped strings -> Add codecs.convert_surrogateescape to "clean" surrogate escaped strings ___ Python tracker