[PATCH] D106577: [clang] Define __STDC_ISO_10646__

Aaron Ballman via Phabricator via cfe-commits Fri, 23 Jul 2021 04:40:32 -0700

aaron.ballman requested changes to this revision.
aaron.ballman added a comment.
This revision now requires changes to proceed.

In D106577#2899574 <https://reviews.llvm.org/D106577#2899574>, @cor3ntin wrote:

> In D106577#2898967 <https://reviews.llvm.org/D106577#2898967>, 
> @hubert.reinterpretcast wrote:
>
>> Every character in the Unicode required set encoded in what way? To say that 
>> such a character is stored in an object of type `wchar_t` means that 
>> interpreting the `wchar_t` yields that stored character. Methods to 
>> determine the interpretation of the stored `wchar_t` value include 
>> locale-sensitive functions such as `wcstombs` (and thus is tied to libc).
>
> "has the same value as the short identifier of that character." implies 
> UTF-32.
> There is no mention of interpretation here, the *value* is the same. As in, 
> when casting to an integer type you get the code point value.

This is how I interpret the words from the standard as well. I think it's 
purely about the bit width of `wchar_t` and whether it's wide enough to hold 
all Unicode code points as of a particular Unicode standard release.

I tried to do some archeology to see how this predefined macro came into 
existence. It was added in C99 at a time before we seemed to be collecting 
editors reports and there are no obvious papers on the topic, so I don't know 
what proposal added the feature. The C99 rationale document does not mention 
the macro at all, but from my reading of the rationale, it seems possible that 
this macro is related to the introduction of UCNs and whether `\Unnnnnnnn` can 
be stored in a `wchar_t`.

One thing I did find when doing my research though was: 
https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html 
which says, in part,

  The standard defines at least a macro __STDC_ISO_10646__ that is only defined 
on systems where the wchar_t type encodes ISO 10646 characters. If this symbol 
is not defined one should avoid making assumptions about the wide character 
representation.

This matches the interpretation that the libc encoding is salient

but... we still need to define what happens in freestanding environments where 
there is no libc with character conversion functions, so it also sounds like 
it's partially in the realm of the compiler.

> *Storing* that value might involve either assigning from a wide-character 
> literal or `mbrtowc`.
> Both methods imply some transcoding,  the latter of which could be affected 
> by locale such that it would store a different character, but again, is it 
> related to this wording?
>
> Note that by virtue of being a macro this cannot possibly be affected by 
> locale.
>
> A few scenarios
>
> - The encoding of wide literal as determined by clang is not utf-32, the 
> macro should be defined by neither the compiler nor the library
> - The encoding of wide literals as determined by the compiler is utf-32, libc 
> agrees... this works as intended
> - The encoding of wide literals as determined by the compiler is utf-32, libc 
> disagrees... nothing good can come of that.
>
> The compiler and the libc have to agree here.
> The library cannot (should not) define this macro without knowing the wide 
> literal encoding.

I agree that the compiler and libc need to agree on the encoding.

> Note that both standards imply that these macros should be defined when 
> relevant independently of the environment which includes hosted and 
> non-Linux+glibc platforms. So relying on a specific glibc implementation
> seems dubious. Especially as glibc will *always* define that macro

I think the point was more about "who is generally responsible for defining 
this macro, the compiler or the library" as opposed to it being a glibc thing 
specifically. I notice that musl also defines the macro 
(https://git.musl-libc.org/cgit/musl/tree/include/stdc-predef.h#n4).

> Now, I agree that the compiler and the library should ideally expose the same 
> *value* for this macro (although I struggle to find code that actually relies 
> on the value)
>
> When D34158 <https://reviews.llvm.org/D34158> as mentioned by @jyknight 
> lands, the value will be set to that of the library version thereby 
> overriding the compiler default.
> On other systems, the value will be set to the library version whenever the 
> library is included.

I think that's the correct behavior. The compiler says "my wchar_t encodes ISO 
10646" and the library has the chance to say "my wide char functions expect 
something else" if need be.

Given that there's two people who think this macro relates to the standard 
library, I'm going to mark review as needing changes so we don't accidentally 
land it. I think we should ask for an interpretation on the WG14 reflectors and 
come back once we have more information.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106577/new/

https://reviews.llvm.org/D106577

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D106577: [clang] Define __STDC_ISO_10646__

Reply via email to