On Fri, Aug 13, 2021 at 7:42 PM Aaron Ballman via Phabricator < revi...@reviews.llvm.org> wrote:
> aaron.ballman added a comment. > > In D106577#2943837 <https://reviews.llvm.org/D106577#2943837>, @jyknight > wrote: > > > In D106577#2904960 <https://reviews.llvm.org/D106577#2904960>, @rsmith > wrote: > > > >>> One specific example I'd like to be considered: > >>> Suppose the C standard library implementation's mbstowcs converts a > certain multi-byte character C to somewhere in the Unicode private use > area, because Unicode version N doesn't have a corresponding character. > Suppose further that the compiler is aware of Unicode version N+1, in which > a character corresponding to C was added. Is an implementation formed by > that combination of compiler and standard library, that defines > `__STDC_ISO_10646__` to N+1, conforming? Or is it non-conforming because it > represents character C as something other than the corresponding short name > from Unicode version N+1? > >> > >> And David Keaton (long-time WG14 member and current convener) replied: > >> > >>> Yikes! It does indeed sound like the library would affect the value > of `__STDC_ISO_10646__` in that case. Thanks for clarifying the details. > >> > >> There was no further discussion after that point, so I think the > unofficial WG14 stance is that the compiler and the library need to collude > on setting the value of that macro. > > > > I don't think that scenario is valid. MBCS-to-unicode mappings are a > part of the definition of the MBCS (sometimes officially, sometimes > de-facto defined by major vendors), not in the definition of Unicode. > > Isn't that scenario basically the one we're in today where the compiler is > unaware of what mappings the library provides? > > > And in fact, we have a real-life example of this: the GB18030 encoding. > That standard specifies 24 characters mappings to private-use-area unicode > codepoints in the most recent version, GB18030-2005. (Which is down from 80 > PUA mappings in its predecessor encoding GBK, and 25 in GB18030-2000.) Yet, > a new version of Unicode coming out will not affect that. Rather, I should > say, DID NOT affect that -- all of those 24 characters mapped to PUAs in > GB18030-2005 were actually assigned official unicode codepoints by 2005 > (some in Unicode 3.1, some in Unicode 4.1). But no matter -- GB18030 still > maps those to PUA code-points. The only way that can change is if GB18030 > gets updated. > > > > I do note that some implementations (e.g. glibc) have taken it upon > themselves to modify the official GB18030 character mapping table, and to > decode those 24 codepoints to the newly-defined unicode characters, instead > of the specified PUA codepoints. But there's no way that can be described > as a requirement -- it's not even technically correct! > > Does that imply that an implementation supporting that encoding can't > define __STDC_ISO_10646__ because it doesn't meet the "has the same value > as the short identifier" requirement? > FYI, there should be a revision of GB18030 this year that will not use the PUA anymore. In general the PUA is considered "not for interchange" so if you have a system that interprets PUA codepoints differently at different points in time you are outside of any guarantees provided by Unicode. GB18030-2005 is a weird exception as in general the standard library should never transcode to the PUA as this is not portable. GB18030, despite having a 1-1 mapping to unicode has to be considered a distinct character set from Unicode, as such, a system where wide literals are GB18030 encoded should not define __STDC_ISO_10646__ > > @jyknight, are you on the WG14 reflectors btw? Would you like to carry on > with this discussion over there (or would you like me to convey your > viewpoints on your behalf)? > > > Repository: > rG LLVM Github Monorepo > > CHANGES SINCE LAST ACTION > https://reviews.llvm.org/D106577/new/ > > https://reviews.llvm.org/D106577 > >
_______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits