Re: proposed Opengroup action for c99 command (XCU ERN 76)

Joseph S. Myers Fri, 16 Sep 2005 05:13:00 -0700

On Fri, 16 Sep 2005, Geoffrey Keating wrote:

> What this means in practise, I think, is that the structure that
> represents a token, 'struct cpp_token' will grow from 16 bytes to 20
> bytes, which makes it 2 cache lines rather than 1, and a subsequent
> memory use increase and compiler performance decrease.  It might be
> that someone will think of some clever way to avoid this, but I
> couldn't think of any that would be likely to be a win overall, since
> a significant proportion of tokens are identifiers.  (I especially
> didn't like the alternative that required a second hash lookup for
> every identifier.)


There are plenty of spare bits in cpp_token to flag extended identifiers 
and handle them specially (as a slow path, marked as such with 
__builtin_expect).  There's one bit in the flags byte, two unused bytes 
after it and a whole word not used in the case of identifiers (identifiers 
use a cpp_hashnode * where strings and numbers use a struct cpp_string 
which is bigger) which could store a canonical form of an identifier (or 
could store the noncanonical spelling for the use of the specific places 
which care about the original spelling).

> Adding salt to the wound, of course, is that for C the only difference
> between an (A) or (B) and a (C) implementation is that a (C)
> implementation is less expressive: there are some programs, all of
> which are erroneous and require a diagnostic, that can't be written.
> So you lose compiler performance just so users have another bullet
> to shoot their feet with.

C++ requires (A) and provides examples of valid programs where it can be 
told whether a normalisation of UCNs is part of the implementation-defined 
phase 1 transformation.  As I gave in a previous discussion,

#include <stdlib.h>
#include <string.h>
#define h(s) #s
#define str(s) h(s)
int
main()
{
  if (strcmp(str(str(\u00c1)), "\"\\u00c1\"")) abort ();
  if (strcmp(str(str(\u00C1)), "\"\\u00C1\"")) abort ();
}

Implementation of (A) could start by a (slow path, if there are extended 
characters present) conversion of the whole input to UCNs, or a more 
efficient conversion that avoids the need to convert within comments.  
But if any normalisation of UCNs is documented for C++ it does need to be 
documented in the form of transforming UCNs to other UCNs (not to UTF-8).

-- 
Joseph S. Myers               http://www.srcf.ucam.org/~jsm28/gcc/
    [EMAIL PROTECTED] (personal mail)
    [EMAIL PROTECTED] (CodeSourcery mail)
    [EMAIL PROTECTED] (Bugzilla assignments and CCs)

Re: proposed Opengroup action for c99 command (XCU ERN 76)

Reply via email to