Paul Eggert <[EMAIL PROTECTED]> writes:

Hi Paul,

> I proposed to insert the following paragraph after XCU page 213 line
> 8366 (i.e, at the end of the INPUT FILES section of the c99 spec
> <http://www.opengroup.org/onlinepubs/009695399/utilities/c99.html>):
> 
>    It is implementation-defined whether trailing white-space characters
>    in each C-language source line are ignored.  Otherwise, the
>    multibyte characters of each source line are mapped on a one-to-one
>    basis to the C source character set.
> 
> In response Joseph S. Myers pointed out that this action would require
> c99 to use interpretation B of section 5.2.1 (page 20) of the C99 Rationale
> <http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf>.
> The Rationale says C preprocessors can be implemented in three ways:
> 
>   A.  Convert everything to UCNs in basic source characters as soon
>       as possible, that is, in translation phase 1.  (This is what
>       C++ requires, apparently.)
> 
>   B.  Use native encodings where possible, UCNs otherwise.
> 
>   C.  Convert everything to wide characters as soon as possible
>       using an internal encoding that encompasses the entire source
>       character set and all UCNs.
> 
> The C99 standardizers chose (B), but said implementations could also
> use (A) or (C) because the C99 standard gives almost unlimited freedom
> in translation phase 1 for compilers to do whatever transformations
> they like.
> 
> However, the proposed action for the c99 command would close this
> escape hatch, forcing interpretation (B) for c99 implementations.
> 
> So my question is: Is it a burden on GCC to require interpretation (B)?
> 
> My understanding is that GCC already uses (B), and that the answer is
> "no, it's no problem", but if I'm wrong please let me know.

I believe that GCC currently uses (C), from a language standpoint.

Following the discussion in
<http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9449>, you can tell the
difference between an (A) or (B) and a (C) implementation by writing:

#define foo \u00c1
#define foo \u00C1
int x;

in a text file and seeing if the compiler accepts it.  If the compiler
accepts it, it's a type (C) implementation.  GCC presently accepts
this testcase.

The actual internal implementation does not exactly follow any of
these models.  (For instance, it really uses UTF-8 rather than wide
characters; and strings aren't handled in quite the same way as
identifiers.)

There is a difference in terms of implementation cost.  A type (C)
implementation can immediately unify all identifiers that refer to the
same name, even if they are spelt differently, while an (A) or (B)
implementation must initially distinguish identifiers that are the
same identifier but spelt differently, and will have to keep
distinguishing them until the end of preprocessing.  Yet, it will
eventually need to unify them so that the rest of the compiler doesn't
have to do string comparisons, and when determining whether a token
has a macro replacement it must treat same-but-different identifiers
as the same.

What this means in practise, I think, is that the structure that
represents a token, 'struct cpp_token' will grow from 16 bytes to 20
bytes, which makes it 2 cache lines rather than 1, and a subsequent
memory use increase and compiler performance decrease.  It might be
that someone will think of some clever way to avoid this, but I
couldn't think of any that would be likely to be a win overall, since
a significant proportion of tokens are identifiers.  (I especially
didn't like the alternative that required a second hash lookup for
every identifier.)

Adding salt to the wound, of course, is that for C the only difference
between an (A) or (B) and a (C) implementation is that a (C)
implementation is less expressive: there are some programs, all of
which are erroneous and require a diagnostic, that can't be written.
So you lose compiler performance just so users have another bullet
to shoot their feet with.

Reply via email to