[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2015-08-11 Thread tmp12342
tmp12342 added the comment: Serhiy, I understand the first reason, but https://docs.python.org/3/library/codecs.html says > applicable to text encodings: > [...] > This code will then be turned back into the same byte when the > 'surrogateescape' error handler is used when encoding the data. Sh

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2015-08-10 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: There are two causes: 1. UTF-16 and UTF-32 are based on 2- and 4-bytes units. If the surrogateescape error handler will support UTF-16 and UTF-32, encoding could produce the data that can't be decoded back correctly. For example '\udcac \udcac' -> b'\xac\x2

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2015-08-10 Thread Martijn Pieters
Martijn Pieters added the comment: I don't understand why encoding with `surrogateescape` isn't supported still; is it the fact that a surrogate would have to produce *single bytes* rather than double? E.g. b'\x80' -> '\udc80' -> b'\x80' doesn't work because that would mean the UTF-16 and UTF-

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-11-21 Thread STINNER Victor
STINNER Victor added the comment: Thanks Ezio and Serhiy for having fix UTF-16 and UTF-32 codecs! -- ___ Python tracker ___ ___ Python

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-11-19 Thread Roundup Robot
Roundup Robot added the comment: New changeset 130597102dac by Serhiy Storchaka in branch 'default': Remove dead code committed in issue #12892. http://hg.python.org/cpython/rev/130597102dac -- ___ Python tracker _

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-11-19 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Ezio have approved the patch and I have committed it. Thank you Victor and Kang-Hao for your patches. Thanks all for the reviews. -- resolution: -> fixed stage: patch review -> committed/rejected status: open -> closed __

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-11-19 Thread Roundup Robot
Roundup Robot added the comment: New changeset 0d9624f2ff43 by Serhiy Storchaka in branch 'default': Issue #12892: The utf-16* and utf-32* codecs now reject (lone) surrogates. http://hg.python.org/cpython/rev/0d9624f2ff43 -- nosy: +python-dev ___ Pyth

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-11-18 Thread Serhiy Storchaka
Changes by Serhiy Storchaka : -- assignee: ezio.melotti -> serhiy.storchaka ___ Python tracker ___ ___ Python-bugs-list mailing list U

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-18 Thread Serhiy Storchaka
Changes by Serhiy Storchaka : Added file: http://bugs.python.org/file32202/utf_16_32_surrogates_6.patch ___ Python tracker ___ ___ Python-bugs

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-18 Thread Serhiy Storchaka
Changes by Serhiy Storchaka : Removed file: http://bugs.python.org/file32201/utf_16_32_surrogates_6.patch ___ Python tracker ___ ___ Python-bu

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-18 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Changed the documentation as was discussed with Ezio on IRC. Ezio, do you want commit this patch? Feel free to reword the documentation if you are feeling be better. -- Added file: http://bugs.python.org/file32201/utf_16_32_surrogates_6.patch __

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-11 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Updated patch addresses Victor's comments on Rietveld. Thank you Victor. The "surrogatepass" error handler now works with different spellings of encodings ("utf_32le", "UTF-32-LE", etc). > I tested utf_16_32_surrogates_4.patch: surrogateescape with as encode

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-10 Thread STINNER Victor
STINNER Victor added the comment: > Could you please review this not so simple patch instead? I did a first review of your code on rietveld. -- ___ Python tracker ___ __

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-10 Thread STINNER Victor
STINNER Victor added the comment: I tested utf_16_32_surrogates_4.patch: surrogateescape with as encoder does not work as expected. >>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'ignore') '[]' >>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'replace') '[�]' >>> b'[\x00\x80\xdc]\x00'.decode('utf-1

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Martin v . Löwis
Martin v. Löwis added the comment: Marc-Andre: please don't confuse "use in major operating systems" with "major use in operating systems". I agree with Antoine that UTF-16 isn't widely used on Windows, despite notepad and Office supporting it. Most users on Windows using notepad continue to

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Here is my idea: http://permalink.gmane.org/gmane.comp.python.ideas/23521. I see that a discussion about how fast UTF-16 codec should be already larger than discussion about patches. Could you please review this not so simple patch instead? Yet one help whi

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread STINNER Victor
STINNER Victor added the comment: I don't think that performances on a microbenchmark is the good question. The good question is: does Python conform to Unicode? The answer is simple and explicit: no. Encoding lone surrogates may lead to bugs and even security vulnerabilities. Please open a new

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: On 08.10.2013 12:30, Antoine Pitrou wrote: > >> UTF-8, UTF-16 and UTF-32 codecs need to be as fast as possible >> in Python to not create performance problems when converting >> between platform Unicode data and the internal formats >> used in Python. > > "

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: On 08.10.2013 11:42, Serhiy Storchaka wrote: > > UTF-16 codec still fast enough. Let first make it correct and then will try > optimize it. I have an idea how restore 3.3 performance (if it worth, the > code already complicated enough). That's a good plan

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Antoine Pitrou
Antoine Pitrou added the comment: > UTF-8, UTF-16 and UTF-32 codecs need to be as fast as possible > in Python to not create performance problems when converting > between platform Unicode data and the internal formats > used in Python. "As fast as possible" is a platonic dream. They only need t

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: I repeat myself. Even with the patch, UTF-16 codec is faster than UTF-8 codec (except ASCII-only data). This is fastest Unicode codec in Python (perhaps UTF-32 can be made faster, but this is another issue). > The real question is: Can the UTF-16/32 codecs b

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: On 08.10.2013 11:33, Antoine Pitrou wrote: > > Antoine Pitrou added the comment: > >> MS Notepad and MS Office save Unicode text files in UTF-16-LE, >> unless you explicitly specify UTF-8, just like many other Windows >> applications that support Unicode te

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: UTF-16 codec still fast enough. Let first make it correct and then will try optimize it. I have an idea how restore 3.3 performance (if it worth, the code already complicated enough). The converting to/from wchar_t* uses different code. --

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Antoine Pitrou
Antoine Pitrou added the comment: > MS Notepad and MS Office save Unicode text files in UTF-16-LE, > unless you explicitly specify UTF-8, just like many other Windows > applications that support Unicode text files: I'd be curious to know if people actually edit *text files* using Microsoft Word

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: On 08.10.2013 11:03, Antoine Pitrou wrote: > >>> utf-16 isn't that widely used, so it's probably fine if it becomes >>> a bit slower. >> >> It's the default encoding for Unicode text files and APIs on Windows, >> so I'd say it *is* widely used :-) > > I've

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Antoine Pitrou
Antoine Pitrou added the comment: > On 08.10.2013 10:46, Antoine Pitrou wrote: > > > > utf-16 isn't that widely used, so it's probably fine if it becomes > > a bit slower. > > It's the default encoding for Unicode text files and APIs on Windows, > so I'd say it *is* widely used :-) I've never

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: On 08.10.2013 10:46, Antoine Pitrou wrote: > > utf-16 isn't that widely used, so it's probably fine if it becomes a bit > slower. It's the default encoding for Unicode text files and APIs on Windows, so I'd say it *is* widely used :-) http://en.wikipedia.

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Antoine Pitrou
Antoine Pitrou added the comment: utf-16 isn't that widely used, so it's probably fine if it becomes a bit slower. -- nosy: +pitrou ___ Python tracker ___ ___

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-07 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Updated whatsnew and Misc/ files. -- Added file: http://bugs.python.org/file31984/utf_16_32_surrogates_4.patch ___ Python tracker ___

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-01 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Could you please make a review Ezio? -- ___ Python tracker ___ ___ Python-bugs-list mailing list U

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: No, it isn't faster. I tested this variant, it is 1.5x slower. And simple range checking actually is slower. -- ___ Python tracker ___ __

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: On 02.09.2013 18:56, Serhiy Storchaka wrote: > > Oh, I were blind. Thank you Marc-Andre. Here is corrected patch. > Unfortunately it 1.4-1.5 times slower on UTF-16 encoding UCS2 strings than > previous wrong patch. I think it would be faster to do this in

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Serhiy Storchaka
Changes by Serhiy Storchaka : Removed file: http://bugs.python.org/file31555/utf_16_32_surrogates_2.patch ___ Python tracker ___ ___ Python-bu

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Oh, I were blind. Thank you Marc-Andre. Here is corrected patch. Unfortunately it 1.4-1.5 times slower on UTF-16 encoding UCS2 strings than previous wrong patch. -- Added file: http://bugs.python.org/file31557/utf_16_32_surrogates_3.patch __

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: You should be able to squeeze out some extra cycles by avoiding the bit calculations using a simple range check for ch >= 0xd800: +# if STRINGLIB_MAX_CHAR >= 0xd800 +if (((ch1 ^ 0xd800) & + (ch1 ^ 0xd800) & + (ch1

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Here is a patch which combines both Kang-Hao's patches, synchronized with tip, fixed and optimized. Unfortunately even optimized this patch slowdown encoding/decoding some data. Here are some benchmark results (benchmarking tools are here: https://bitbucket

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2012-11-04 Thread Serhiy Storchaka
Changes by Serhiy Storchaka : -- stage: test needed -> patch review versions: +Python 3.4 -Python 3.3 ___ Python tracker ___ ___ Pytho

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2012-04-24 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: > * fix an error in the error handler for utf-16-le. (In, Python3.2 > b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" instead of > "A" for some reason) The patch for issue14579 fixes this in Python 3.2. The patch for issue14624 fixes this

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2012-01-31 Thread Kang-Hao (Kenny) Lu
Kang-Hao (Kenny) Lu added the comment: > The followings are on my TODO list, although this patch doesn't depend > on any of these and can be reviewed and landed separately: > * make the surrogatepass error handler work for utf-16 and utf-32. (I >should be able to finish this by today) Unfo

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2012-01-30 Thread Ezio Melotti
Ezio Melotti added the comment: Thanks for the patch! > * fix an error in the error handler for utf-16-le. (In, Python3.2 > b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" > instead of "A" for some reason) This should probably be done on a separate patch that will be applie

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2012-01-29 Thread Kang-Hao (Kenny) Lu
Kang-Hao (Kenny) Lu added the comment: Attached patch does the following beyond what the patch from haypo does: * call the error handler * reject 0xd800~0xdfff when decoding utf-32 The followings are on my TODO list, although this patch doesn't depend on any of these and can be reviewed an

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-11-29 Thread STINNER Victor
STINNER Victor added the comment: Hum, my patch doesn't call the error handler. -- ___ Python tracker ___ ___ Python-bugs-list mailin

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-11-29 Thread STINNER Victor
STINNER Victor added the comment: Patch rejecting surrogates in UTF-16 and UTF-32 encoders. I don't think that Python 2.7 and 3.2 should be changed in a minor version. -- dependencies: -Refactor code using unicode_encode_call_errorhandler() in unicodeobject.c keywords: +patch Added f

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-11-29 Thread STINNER Victor
STINNER Victor added the comment: Python 3.3 has a strange behaviour: >>> '\uDBFF\uDFFF'.encode('utf-16-le').decode('utf-16-le') '\U0010' >>> '\U0010'.encode('utf-16-le').decode('utf-16-le') '\U0010' I would expect text.decode(encoding).encode(encoding)==text or an encode or decod

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-10-25 Thread Ezio Melotti
Changes by Ezio Melotti : -- dependencies: +Refactor code using unicode_encode_call_errorhandler() in unicodeobject.c ___ Python tracker ___

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-09-04 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: Ezio Melotti wrote: > > New submission from Ezio Melotti : > >>From Chapter 03 of the Unicode Standard 6[0], D91: > """ > • UTF-16 encoding form: The Unicode encoding form that assigns each Unicode > scalar value in the ranges U+..U+D7FF and U+E000..U

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-09-04 Thread Ezio Melotti
New submission from Ezio Melotti : >From Chapter 03 of the Unicode Standard 6[0], D91: """ • UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges U+..U+D7FF and U+E000..U+ to a single unsigned 16-bit code unit with the same numeric value a