On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote: > Steve D'Aprano <steve+pyt...@pearwood.info>: > >> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote: >>> Also, [surrogates] don't exist as Unicode code points. Python >>> shouldn't allow surrogate characters in strings. >> >> Not quite. This is where it gets a bit messy and confusing. The bottom >> line is: surrogates *are* code points, but they aren't *characters*. > > All animals are equal, but some animals are more equal than others.
Huh? >> Strings which contain surrogates are strictly speaking illegal, >> although some programming languages (including Python) allow them. > > Python shouldn't allow them. That's one opinion. >> The Unicode standard defines surrogates as follows: >> [...] >> >> - Surrogate Code Point. A Unicode code point in the range >> U+D800..U+DFFF. Reserved for use by UTF-16, > > The writer of the standard is playing word games, maybe to offer a fig > leaf to Windows, Java et al. Seriously? >> By the letter of the Unicode standard, [Python] should not do this, >> but nevertheless it does and it appears to do no real harm and have >> some benefit. > > I'm afraid Python's choice may lead to exploitable security holes in > Python programs. Feel free to back up that with an actual demonstration of an exploit, rather than just FUD. >>>> py> low = '\uDC37' >>> >>> That should raise a SyntaxError exception. >> >> If Python was strictly conforming, that is correct, but it turns out >> there are some useful things you can do with strings if you allow >> surrogates. > > Conceptual confusion is a high price to pay for such tricks. There's a lot to comprehend about Unicode. I don't see that Python's non-strict implementation is harder to understand than the strict version. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list