I recently proposed to the Open Group an action that would modify the POSIX specification for the c99 command that is often implemented using GCC. I thought the action would not affect GCC's conformance, but Joseph S. Myers raised the issue of UCNs and multibyte characters and I'd like to double-check that GCC is OK. If the action does affect GCC I'd like to modify the action before it's too late.
Here's the problem. Currently, POSIX places almost no requirements on how c99 transforms the physical source file into C source-language characters. For example, c99 is free to treat CR as LF, ignore trailing white space, convert tabs to spaces, or even (perversely) require that input files all start with line numbers that are otherwise ignored. This lack of specification was not intended, and I'm trying to help nail down the intent of what c99 is allowed to do. I proposed to insert the following paragraph after XCU page 213 line 8366 (i.e, at the end of the INPUT FILES section of the c99 spec <http://www.opengroup.org/onlinepubs/009695399/utilities/c99.html>): It is implementation-defined whether trailing white-space characters in each C-language source line are ignored. Otherwise, the multibyte characters of each source line are mapped on a one-to-one basis to the C source character set. In response Joseph S. Myers pointed out that this action would require c99 to use interpretation B of section 5.2.1 (page 20) of the C99 Rationale <http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf>. The Rationale says C preprocessors can be implemented in three ways: A. Convert everything to UCNs in basic source characters as soon as possible, that is, in translation phase 1. (This is what C++ requires, apparently.) B. Use native encodings where possible, UCNs otherwise. C. Convert everything to wide characters as soon as possible using an internal encoding that encompasses the entire source character set and all UCNs. The C99 standardizers chose (B), but said implementations could also use (A) or (C) because the C99 standard gives almost unlimited freedom in translation phase 1 for compilers to do whatever transformations they like. However, the proposed action for the c99 command would close this escape hatch, forcing interpretation (B) for c99 implementations. So my question is: Is it a burden on GCC to require interpretation (B)? My understanding is that GCC already uses (B), and that the answer is "no, it's no problem", but if I'm wrong please let me know. For more details, please see Shell and Utilities Enhancement Request Number 76 (XCU ERN 76), which you can find in <http://www.opengroup.org/austin/aardvark/latest/xcubug2.txt>. Also please see the followup email discussion at <http://www.opengroup.org/austin/mailarchives/ag/> (look for messages whose subject lines contain "XCU ERN 76"). Thanks.