On Mon, 23 Jan 2017 02:19 am, Marko Rauhamaa wrote: > Steve D'Aprano <steve+pyt...@pearwood.info>: > >> On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote: >> >>> Steve D'Aprano <steve+pyt...@pearwood.info>: >>> >>>> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote: >>>>> Also, [surrogates] don't exist as Unicode code points. Python >>>>> shouldn't allow surrogate characters in strings. >>>> >>>> Not quite. This is where it gets a bit messy and confusing. The >>>> bottom line is: surrogates *are* code points, but they aren't >>>> *characters*. >>> >>> All animals are equal, but some animals are more equal than others. >> >> Huh? > > There is no difference between 0xD800 and 0xD8000000.
Arithmetic disagrees: py> 0xD800 == 0xD8000000 False > They are both > numbers that don't--and won't--represent anything in Unicode. Your use of hex notation 0x... indicates that you're talking about code units rather than U+... code points. The first one 0xD800 could be: - a Little Endian double-byte code unit for 'Ø' in either UCS-2 or UTF-16; - a Big Endian double-byte code unit that has no special meaning in UCS-2; - one half of a surrogate pair (two double-byte code units) in Big Endian UTF-16, encoding some unknown supplementary code point. The second one 0xD8000000 could be: - a C long (four-byte int) 3623878656, which is out of range for Big Endian UCS-4 or UTF-32; - the Little Endian four-byte code unit for 'Ø' in either UCS-4 or UTF-32. > It's pointless to call one a "code point" and not the other one. Neither of them are code points. You're confusing the concrete representation with the abstract character. Perhaps you meant to compare the code point U+D800 to, well, there's no comparison to be made, because "U+D8000000" is not valid and is completely out of range. The largest code point is U+10FFFF. > A code point > that isn't code for anything can barely be called a code point. It does have a purpose. Or even more than one. - It ensures that there is a one-to-one mapping between code points and code units in any specific encoding and byte-order. - By reserving those code points, it ensures that they cannot be accidentally used by the standard for something else. - It makes it easier to talk about the entities: "U+D800 is a surrogate code point reserved for UTF-16 surrogates", as opposed to "U+D800 isn't anything, but if it was something, it would be a code point reserved for UTF-16 surrogates". - Or worse, forcing us to talk in terms of code units (implementation) instead of abstract characters, which is painfully verbose: "0xD800 in Big Endian UTF-16, or 0x00D8 in Little Endian UTF-16, or 0x0000D800 in Big Endian UTF-32, or 0x00D80000 in Little Endian UTF-16, doesn't map to any code point but is reserved for UTF-16 surrogate pairs." And, an entirely unforeseen purpose: - It allows languages like Python to (ab)use surrogate code points for round-tripping file names which aren't valid Unicode. [...] >>> I'm afraid Python's choice may lead to exploitable security holes in >>> Python programs. >> >> Feel free to back up that with an actual demonstration of an exploit, >> rather than just FUD. > > It might come as a surprise to programmers that pathnames cannot be > UTF-encoded or displayed. Many things come as surprises to programmers, and many pathnames cannot be UTF-encoded. To be precise, Mac OS requires pathnames to be both valid and normalised UTF-8, and it would be nice if that practice spread. But Windows only requires pathnames to consist of UCS-2 code points, and Linux pathnames are arbitrary bytes that may include characters which are illegal on Windows. So you don't need to involve surrogates to have undecodable pathnames. > Also, those situations might not show up > during testing but only with appropriately crafted input. I'm not seeing a security exploit here. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list