Kang-Hao (Kenny) Lu <kennyl...@csail.mit.edu> added the comment: > The followings are on my TODO list, although this patch doesn't depend > on any of these and can be reviewed and landed separately: > * make the surrogatepass error handler work for utf-16 and utf-32. (I > should be able to finish this by today)
Unfortunately this took longer than I thought but here comes the patch. >> * fix an error in the error handler for utf-16-le. (In, Python3.2 >> b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" >> instead of "A" for some reason) > > This should probably be done on a separate patch that will be applied > to 3.2/3.3 (assuming that it can go to 3.2). Rejecting surrogates will > go in 3.3 only. (Note that lot of Unicode-related code changed between > 3.2 and 3.3.) This turns out to be just two liners so I fixed that on the way. I can create separate patch with separate test for 3.2 (certainly doable) and even for 3.3, but since the test is now part of test_lone_surrogates, I feel less willing to do that for 3.3. You might notice the codec naming inconsistency (utf-16-be and utf16be for encoding and decoding respectively). I have filed issue #13913 for this. Also, the strcmps are quite crappy. I am working on issue #13916 (disallow the "surrogatepass" handler for non utf-* encodings). As long as we have that we can examine individual character instead... In this patch, The "encoding" attribute for UnicodeDecodeException is now changed to return utf16(be|le) for utf-16. This is necessary info for "surrogatepass" to work although admittedly this is rather ugly. Any good idea? A new attribute for Unicode(Decode|Encode)Exception might be helpful but utf-16/32 are fairly uncommon encodings anyway and we should not add more burden for, say, utf-8. >> Should we really reject lone surrogates for UTF-7? > > No, I meant only UTF-8/16/32; UTF-7 is fine as is. Good to know. ---------- Added file: http://bugs.python.org/file24384/surrogatepass_for_utf-16&32.patch _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12892> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com