Re: Bug in clang?

Ingo Schwarze Wed, 20 Aug 2025 07:34:36 -0700

Hell Walter,

Walter Alejandro Iglesias wrote on Wed, Aug 20, 2025 at 09:18:52AM +0200:
> On Tue, Aug 19, 2025 at 05:39:13PM +0200, Ingo Schwarze wrote:
>> Walter Alejandro Iglesias wrote on Mon, Aug 18, 2025 at 06:40:04PM +0200:


>>> #define period      0x2e
>>> #define question    0x3f
>>> #define exclam      0x21
>>> #define ellipsis    L'\u2026'
>>> const wchar_t p[] = { period, question, exclam, ellipsis };

>> In addition to what otto@ said, this is bad style for more than one
>> reason.
>> 
>> First of all, that data type of the constant "0x2e" is "int",
>> see for example C11 6.4.4.1 (Integer constants).  Casting "int"
>> to "wchar_t" doesn't really make sense.  On OpenBSD, it only
>> works because UTF-8 is the only supported character encoding *and*
>> wchar_t stores Unicode codepoints.  But neither of these choices
>> are portable.  What you want is (C11 6.4.4.4 Character constants):
>> 
>>   #define period     L'.'
>>   #define question   L'?'
>>   #define exclam     L'!'

> As I made this change to my code (https://en.roquesor.com/fmtroff.html)
> the following reminded me why, at some point, I decided to switch to
> hexadecimal notation.
> 
>   #define backslash   L'\\'
>   #define apostrophe  L'\''
> 
> It isn't very confusing there, but among the arguments of a function or
> a conditional...

Making code look nice is nice to have and can even make code more
readable and hence reduce the likelihood of bugs.  But even if you
are coding with narrow strings for ASCII only, whether

  char mychar = 0x5c;
  char mychar = 92;
  char mychar = 0134;

is more readable than 

  char mychar = '\\';

is debateable; at least i would find reading the latter easier than
the former, even in a conditional or function call argument.

For narrow characters, the portability argument is weak; writing
code that is portable to EBCDIC machines is the kind of excessive
portability that provokes bugs rather than prevent them.  But still,
i'd recommend against specifying narrow characters numerically.
Even mandoc_char(7) says:

  NUMBERED CHARACTERS
     For backward compatibility with existing manuals, mandoc(1)
     also supports the
           \N'number' and \[charnumber]
     escape sequences, inserting the character number from the
     current character set into the output.  Of course, this is
     inherently non-portable and is already marked as deprecated
     in the Heirloom roff manual; on top of that, the second form
     is a GNU extension.  For example, do not use \N'34' or
     \[char34], use \(dq, or even the plain `"' character where
     possible.

A similar recommendation makes sense for C code.

What *is* portable is specifying wide characters by Unicode
codepoint numbers, for example:

  wchar_t mywide = L'\u2026';  /* horizontal ellipsis */

But note that the C standard (C11 6.4.3.2 Universal character names)
explicitly requires the argument to \u to be at least 00A0,
with only three exceptions:

  L'\u0024' == L'$'
  L'\u0040' == L'@'
  L'\u0060' == L'`'

Being so specific is a weird quirk of the standard, but it means
you should better not abuse \u to obfuscate ASCII codepoints -
apart from being very ugly, it may not even work.  For example,
current base clang dies like this:

  error: character 'A' cannot be specified by a universal character name
    13 |         wchar_t mywide = L'\u0041';
  1 error generated.

So there is no real alternative to L'\\'.  While L'\x5c' and L'\134'
work for UTF-8 (and hence on OpenBSD), even that is not guaranteed
to be portable, and what those two produce may depend both on the
implementation and on the locale.

Yours,
  Ingo

Re: Bug in clang?

Reply via email to