On 11/2/23 03:53, Jakub Jelinek wrote:
On Fri, Oct 27, 2023 at 07:05:34PM -0400, Jason Merrill wrote:
--- gcc/testsuite/g++.dg/cpp26/literals1.C.jj   2023-08-25 17:23:06.662878355 
+0200
+++ gcc/testsuite/g++.dg/cpp26/literals1.C      2023-08-25 17:37:03.085132304 
+0200
@@ -0,0 +1,65 @@
+// C++26 P1854R4 - Making non-encodable string literals ill-formed
+// { dg-do compile { target c++11 } }
+// { dg-require-effective-target int32 }
+// { dg-options "-pedantic-errors -finput-charset=UTF-8 -fexec-charset=UTF-8" }
+
+int d = '😁';                                           // { dg-error "character too 
large for character literal type" }
...
+char16_t m = u'😁';                                     // { dg-error "character 
constant too long for its type" }

Why are these different diagnostics?  Why doesn't the first line already hit
the existing diagnostic that the second gets?

Both could be clearer that the problem is that the single source character
can't be encoded as a single execution character.

The first diagnostics is the newly added in the patch which takes precedence
over the existing diagnostics (and wouldn't actually trigger without the
patch).  Sure, I could make that new diagnostics more specific, but all
I generally know is that (str2.len / nbwc) c-chars are encodable in str.len
execution character set code units.
So, would you like 2 different messages, one for str2.len / nbwb == 1
"single character not encodable in a single execution character code unit"

Sounds good, but let's drop "single".

and otherwise
"%d characters need %d execution character code units"
or
"at least one character not encodable in a single execution character code unit"
or something different?

The latter sounds good.  Maybe adding "in multicharacter literal"?

Everything else (i.e. u8 case in narrow_str_to_charconst and L, u and U
cases in wide_str_to_charconst) is already covered by existing diagnostics
which has the "character constant too long for its type"
wording and covers for both C and C++ both the cases where there are more
than one c-chars in the literal (allowed in the L case for < C++23) and
when one c-char encodes in more than one code units (but this time
it isn't execution character set, but UTF-8 character set for u8,
wide execution character set for L, UTF-16 character set for u and
UTF-32 for U).
Plus the same "character constant too long for its type" diagnostics
is emitted if normal narrow literal has several c-chars encodable all as
single execution character code units, but more than can fit into int.

So, do you want to change just the new diagnostics (and what is your
preferred wording), or use the old diagnostics wording also for the
new one, or do you want to change the preexisting diagnostics as well
and e.g. differentiate there between the single c-char cases which need
more than one code unit and different wording for more than one c-char?
Note, if we differentiate between those, we'd need to count how many
c-chars we have even for the u8, L, u and U cases if we see more than
one code unit, similarly how the patch does that (and also the follow-up
patch tweaks).

Under the existing diagnostic I'd like to distinguish the different cases more, e.g.

"multicharacter literal with %d characters exceeds 'int' size of %d bytes"
"multicharacter literal cannot have an encoding prefix"

Jason

Reply via email to