On Fri, 16 Sep 2005, Geoffrey Keating wrote: > What this means in practise, I think, is that the structure that > represents a token, 'struct cpp_token' will grow from 16 bytes to 20 > bytes, which makes it 2 cache lines rather than 1, and a subsequent > memory use increase and compiler performance decrease. It might be > that someone will think of some clever way to avoid this, but I > couldn't think of any that would be likely to be a win overall, since > a significant proportion of tokens are identifiers. (I especially > didn't like the alternative that required a second hash lookup for > every identifier.)
There are plenty of spare bits in cpp_token to flag extended identifiers and handle them specially (as a slow path, marked as such with __builtin_expect). There's one bit in the flags byte, two unused bytes after it and a whole word not used in the case of identifiers (identifiers use a cpp_hashnode * where strings and numbers use a struct cpp_string which is bigger) which could store a canonical form of an identifier (or could store the noncanonical spelling for the use of the specific places which care about the original spelling). > Adding salt to the wound, of course, is that for C the only difference > between an (A) or (B) and a (C) implementation is that a (C) > implementation is less expressive: there are some programs, all of > which are erroneous and require a diagnostic, that can't be written. > So you lose compiler performance just so users have another bullet > to shoot their feet with. C++ requires (A) and provides examples of valid programs where it can be told whether a normalisation of UCNs is part of the implementation-defined phase 1 transformation. As I gave in a previous discussion, #include <stdlib.h> #include <string.h> #define h(s) #s #define str(s) h(s) int main() { if (strcmp(str(str(\u00c1)), "\"\\u00c1\"")) abort (); if (strcmp(str(str(\u00C1)), "\"\\u00C1\"")) abort (); } Implementation of (A) could start by a (slow path, if there are extended characters present) conversion of the whole input to UCNs, or a more efficient conversion that avoids the need to convert within comments. But if any normalisation of UCNs is documented for C++ it does need to be documented in the form of transforming UCNs to other UCNs (not to UTF-8). -- Joseph S. Myers http://www.srcf.ucam.org/~jsm28/gcc/ [EMAIL PROTECTED] (personal mail) [EMAIL PROTECTED] (CodeSourcery mail) [EMAIL PROTECTED] (Bugzilla assignments and CCs)