On 16/09/2005, at 5:12 AM, Joseph S. Myers wrote:
On Fri, 16 Sep 2005, Geoffrey Keating wrote:What this means in practise, I think, is that the structure that represents a token, 'struct cpp_token' will grow from 16 bytes to 20 bytes, which makes it 2 cache lines rather than 1, and a subsequent memory use increase and compiler performance decrease. It might be that someone will think of some clever way to avoid this, but I couldn't think of any that would be likely to be a win overall, since a significant proportion of tokens are identifiers. (I especially didn't like the alternative that required a second hash lookup for every identifier.)There are plenty of spare bits in cpp_token to flag extended identifiersand handle them specially (as a slow path, marked as such with__builtin_expect). There's one bit in the flags byte, two unused bytes after it and a whole word not used in the case of identifiers (identifiersuse a cpp_hashnode * where strings and numbers use a struct cpp_stringwhich is bigger) which could store a canonical form of an identifier (or could store the noncanonical spelling for the use of the specific placeswhich care about the original spelling).
Yes, I think this can be made to work efficiently.
Adding salt to the wound, of course, is that for C the only differencebetween an (A) or (B) and a (C) implementation is that a (C) implementation is less expressive: there are some programs, all of which are erroneous and require a diagnostic, that can't be written. So you lose compiler performance just so users have another bullet to shoot their feet with.C++ requires (A)
This is true, but only in the sense that C requires (B). Either language can be supported by any of the three implementations with an appropriate phase 1 rule.
Implementation of (A) could start by a (slow path, if there are extendedcharacters present) conversion of the whole input to UCNs, or a more efficient conversion that avoids the need to convert within comments.
Although UCNs would be the most convenient form for the preprocessor, the backend would like strings to be in UTF-8, to avoid the need for conversion when outputting names to the assembler.
But if any normalisation of UCNs is documented for C++ it does need to be documented in the form of transforming UCNs to other UCNs (not to UTF-8).
Yes; but this is not a difficult problem. For C++, you would just say (following my proposed wording) that after they're converted to UTF-8, they are converted back to some canonical form of UCN ('the version with the most lower-case characters', for instance). Then, when stringifying, you would convert UTF-8 characters in identifiers to that canonical UCN.
smime.p7s
Description: S/MIME cryptographic signature