Re: proposed Opengroup action for c99 command (XCU ERN 76)

Geoff Keating Fri, 16 Sep 2005 11:29:16 -0700


On 16/09/2005, at 5:12 AM, Joseph S. Myers wrote:

On Fri, 16 Sep 2005, Geoffrey Keating wrote:
What this means in practise, I think, is that the structure that
represents a token, 'struct cpp_token' will grow from 16 bytes to 20
bytes, which makes it 2 cache lines rather than 1, and a subsequent
memory use increase and compiler performance decrease.  It might be
that someone will think of some clever way to avoid this, but I
couldn't think of any that would be likely to be a win overall, since
a significant proportion of tokens are identifiers.  (I especially
didn't like the alternative that required a second hash lookup for
every identifier.)
There are plenty of spare bits in cpp_token to flag extended identifiers
and handle them specially (as a slow path, marked as such with
__builtin_expect). There's one bit in the flags byte, two unused bytes after it and a whole word not used in the case of identifiers (identifiers
use a cpp_hashnode * where strings and numbers use a struct cpp_string
which is bigger) which could store a canonical form of an identifier (or could store the noncanonical spelling for the use of the specific places
which care about the original spelling).


Yes, I think this can be made to work efficiently.

Adding salt to the wound, of course, is that for C the only difference

between an (A) or (B) and a (C) implementation is that a (C)
implementation is less expressive: there are some programs, all of
which are erroneous and require a diagnostic, that can't be written.
So you lose compiler performance just so users have another bullet
to shoot their feet with.


C++ requires (A)

This is true, but only in the sense that C requires (B). Either language can be supported by any of the three implementations with an appropriate phase 1 rule.

Implementation of (A) could start by a (slow path, if there are extended
characters present) conversion of the whole input to UCNs, or a more
efficient conversion that avoids the need to convert within comments.

Although UCNs would be the most convenient form for the preprocessor, the backend would like strings to be in UTF-8, to avoid the need for conversion when outputting names to the assembler.

But if any normalisation of UCNs is documented for C++ it does need to be documented in the form of transforming UCNs to other UCNs (not to UTF-8).

Yes; but this is not a difficult problem. For C++, you would just say (following my proposed wording) that after they're converted to UTF-8, they are converted back to some canonical form of UCN ('the version with the most lower-case characters', for instance). Then, when stringifying, you would convert UTF-8 characters in identifiers to that canonical UCN.

smime.p7s
Description: S/MIME cryptographic signature

Re: proposed Opengroup action for c99 command (XCU ERN 76)

Reply via email to