Nick Coghlan added the comment:

My main use case is for passing data to other applications that *don't* have 
their Unicode handling in order - I want to be able to use Python to do the 
data scrubbing, but at the moment it requires intimate knowledge of the codec 
error handling system to do it. (I had never even heard of surrogatepass until 
this evening)

Situation:

What I have: data decoded with surrogateescape
What I want: that same data with all the surrogates gone, replaced with either 
the Unicode replacement character or an ASCII question mark (which I want will 
depend on the exact situation)

Assume I am largely clueless about the codec system. I know nothing beyond the 
fact that Python 3 strings may have smuggled bytes in them and I want to get 
rid of them because they confuse the application I'm passing them to.

The concrete example that got me thinking about this again was the task of 
writing filenames into a UTF-8 encoded email, and wanting to scrub the output 
from os.listdir before writing the list into the email (s/email/web page/ also 
works).

For issue #22016 I actually suggested doing this as *another* codec error 
handler ("surrogatereplace"), but Stephen Turnbull convinced me this original 
idea was better: it should just be a pure data transformation pass on the 
string, clearing the surrogates out, and leaving me with data that is identical 
to that I would have had if "surrogatereplace" had been used instead of 
"surrogateescape" in the first place.

As "errors='replace'" already covers the "ASCII ?" replacement case, that means 
your proposed "redecode" based solution would cover the rest of my use case.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue18814>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to