On 07/20/2016 01:38 PM, David Malcolm wrote:
On Fri, 2016-07-08 at 17:49 -0400, David Malcolm wrote:
[...]
Also, this patch currently makes the assumption (in charset.c)
that there's a 1:1 correspondence between bytes in the source
character set and bytes in the execution character set. This can
be the case if both are, say, UTF-8, but might not hold in
general.
The source char set is UTF-8 or UTF-EBCDIC, and safe-ctype.c has:
# if HOST_CHARSET == HOST_CHARSET_EBCDIC
#error "FIXME: write tables for EBCDIC"
so presumably we don't actually have any hosts that supports EBCDIC
(do we?); as far as I can tell, we only currently support UTF-8
as the source char set.
Similarly, do we support any targets for which the execution
character set is *not* UTF-8?
I brought this up in this thread on the gcc mailing list:
"gcc/libcpp: non-UTF-8 source or execution encodings?"
https://gcc.gnu.org/ml/gcc/2016-07/msg00091.html
and in particular:
https://gcc.gnu.org/ml/gcc/2016-07/msg00106.html
it's possible to select the execution char set using at the command
-line for C-family frontends using:
-fexec-charset=
-fwide-exec-charset=
e.g. "-fexec-charset=IBM1047" will give one of the variants of EBCDIC.
Given that the internal interface already has a failure mode, I'm
thinking that a reasonable restriction is to only support locations
within string literals for the case where source character set ==
execution character set, and hence we have "convert_no_conversion" as
the converter. Does that sound sane? (I can write test coverage for
this).
I think this is sane. We can always revisit later if we change our
minds, particularly if folks want to do something crazy like self-host
on an EBCDIC system.
jeff