Kang-Hao (Kenny) Lu <kennyl...@csail.mit.edu> added the comment:

> The followings are on my TODO list, although this patch doesn't depend
> on any of these and can be reviewed and landed separately:
>  * make the surrogatepass error handler work for utf-16 and utf-32. (I
>    should be able to finish this by today)

Unfortunately this took longer than I thought but here comes the patch.

>>  * fix an error in the error handler for utf-16-le. (In, Python3.2 
>> b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" 
>> instead of "A" for some reason)
>
> This should probably be done on a separate patch that will be applied
> to 3.2/3.3 (assuming that it can go to 3.2).  Rejecting surrogates will
> go in 3.3 only.  (Note that lot of Unicode-related code changed between
> 3.2 and 3.3.)

This turns out to be just two liners so I fixed that on the way. I can create 
separate patch with separate test for 3.2 (certainly doable) and even for 3.3, 
but since the test is now part of test_lone_surrogates, I feel less willing to 
do that for 3.3.

You might notice the codec naming inconsistency (utf-16-be and utf16be for 
encoding and decoding respectively). I have filed issue #13913 for this.

Also, the strcmps are quite crappy. I am working on issue #13916 (disallow the 
"surrogatepass" handler for non utf-* encodings). As long as we have that we 
can examine individual character instead...

In this patch, The "encoding" attribute for UnicodeDecodeException is now 
changed to return utf16(be|le) for utf-16. This is necessary info for 
"surrogatepass" to work although admittedly this is rather ugly. Any good idea? 
A new attribute for Unicode(Decode|Encode)Exception might be helpful but 
utf-16/32 are fairly uncommon encodings anyway and we should not add more 
burden for, say, utf-8.

>> Should we really reject lone surrogates for UTF-7?
>
> No, I meant only UTF-8/16/32; UTF-7 is fine as is.

Good to know.

----------
Added file: http://bugs.python.org/file24384/surrogatepass_for_utf-16&32.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12892>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to