Steve D'Aprano <steve+pyt...@pearwood.info>: > On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote: > >> Steve D'Aprano <steve+pyt...@pearwood.info>: >> >>> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote: >>>> Also, [surrogates] don't exist as Unicode code points. Python >>>> shouldn't allow surrogate characters in strings. >>> >>> Not quite. This is where it gets a bit messy and confusing. The >>> bottom line is: surrogates *are* code points, but they aren't >>> *characters*. >> >> All animals are equal, but some animals are more equal than others. > > Huh?
There is no difference between 0xD800 and 0xD8000000. They are both numbers that don't--and won't--represent anything in Unicode. It's pointless to call one a "code point" and not the other one. A code point that isn't code for anything can barely be called a code point. I'm guessing 0xD800 is called a code point because it was always called that. It was dropped out when UTF-16 was invented but they didn't want to "demote" the number retroactively, especially since Windows and Java already were allowing them in strings. >>> By the letter of the Unicode standard, [Python] should not do this, >>> but nevertheless it does and it appears to do no real harm and have >>> some benefit. >> >> I'm afraid Python's choice may lead to exploitable security holes in >> Python programs. > > Feel free to back up that with an actual demonstration of an exploit, > rather than just FUD. It might come as a surprise to programmers that pathnames cannot be UTF-encoded or displayed. Also, those situations might not show up during testing but only with appropriately crafted input. Marko -- https://mail.python.org/mailman/listinfo/python-list