On 8/25/23 16:49, Jakub Jelinek wrote:
Hi!
This paper voted in as DR makes some multi-character literals ill-formed.
'abcd' stays valid, but e.g. 'รก' is newly invalid in UTF-8 exec charset
while valid e.g. in ISO-8859-1, because it is a single character which needs
2 bytes to be encoded.
The following patch does that by checking (only pedantically, especially
because it is a DR) if we'd emit a -Wmultichar warning because character
constant has more than one byte in it whether the number of bytes in the
narrow string matches number of bytes in CPP_STRING32 divided by char32_t
size in bytes. If it is, it is normal multi-character literal constant
and is diagnosed normally with -Wmultichar, if the number of bytes is
larger, at least one of the c-chars in the sequence was encoded as 2+
bytes.
Now, doing this way has 2 drawbacks, some of the diagnostics which doesn't
result in cpp_interpret_string_1 failures can be printed twice, once
when calling cpp_interpret_string_1 for CPP_CHAR, once for CPP_STRING32.
And, functionally I think it must work 100% correctly if host source
character set is UTF-8 (because all valid UTF-8 chars are encodable in
UTF-32), but might not work for some control codes in UTF-EBCDIC if
that is the source character set (though I don't know if we really actually
support it, e.g. Linux iconv certainly doesn't).
All we actually need is count the number of c-chars in the literal,
alternative would be to write custom character counter which would quietly
interpret/skip over + count escape sequences and decode UTF-8 characters
in between those escape sequences. But we'd need to have something similar
also for UTF-EBCDIC if it works at all, and from what I've looked, we don't
have anyything like that implemented in libcpp nor anywhere else in GCC.
Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
Or ok with some tweaks to avoid the second round of diagnostics from
cpp_interpret_string_1/convert_escape? Or reimplement that second time and
count manually?
2023-08-25 Jakub Jelinek <ja...@redhat.com>
PR c++/110341
libcpp/
* charset.cc: Implement C++ 26 P1854R4 - Making non-encodable string
literals ill-formed.
(narrow_str_to_charconst): Change last type from cpp_ttype to
const cpp_token *. For C++ if pedantic and i > 1 in CPP_CHAR
interpret token also as CPP_STRING32 and if number of characters
in the CPP_STRING32 is larger than number of bytes in CPP_CHAR,
pedwarn on it.
(cpp_interpret_charconst): Adjust narrow_str_to_charconst caller.
gcc/testsuite/
* g++.dg/cpp26/literals1.C: New test.
* g++.dg/cpp26/literals2.C: New test.
* g++.dg/cpp23/wchar-multi1.C (c, d): Expect an error rather than
warning.
--- gcc/testsuite/g++.dg/cpp26/literals1.C.jj 2023-08-25 17:23:06.662878355
+0200
+++ gcc/testsuite/g++.dg/cpp26/literals1.C 2023-08-25 17:37:03.085132304
+0200
@@ -0,0 +1,65 @@
+// C++26 P1854R4 - Making non-encodable string literals ill-formed
+// { dg-do compile { target c++11 } }
+// { dg-require-effective-target int32 }
+// { dg-options "-pedantic-errors -finput-charset=UTF-8 -fexec-charset=UTF-8" }
+
+int d = '๐'; // { dg-error "character too
large for character literal type" }
...
+char16_t m = u'๐'; // { dg-error "character
constant too long for its type" }
Why are these different diagnostics? Why doesn't the first line already
hit the existing diagnostic that the second gets?
Both could be clearer that the problem is that the single source
character can't be encoded as a single execution character.
Jason