Re: C Strings and String Literals.

Alejandro Colomar Tue, 15 Nov 2022 05:49:06 -0800

Hi Ralph,

On 11/14/22 14:56, Ralph Corderoy wrote:

Hi Alejandro,

C doesn't _really_ have strings, except at the library level.
It has character arrays and one grain of syntactic sugar for encoding
"string literals", which should not have been called that because
whether they get the null terminator is context-dependent.

          char a[5] = "fooba";
          char *b = "bazqux";

I see some Internet sources claim that C is absolutely reliable about
null-terminating such literals, but I can't agree.  The assignment to
`b` above adds a null terminator, and the one to `a` does not.  This
is the opposite of absolute reliability.  Since I foresee someone
calling me a liar for saying that, I'll grant that if you carry a long
enough list of exceptional cases for the syntax in your head, both are
predictable.  But it's simply a land mine for the everyday programmer.


- C defines both string literals and strings at the language level,
    e.g. main()'s argv[] is defined to contain strings.


I must disagree.  The string concept is very broad, and you can define
you own string, as for example:

struct str_s {
        size_t  len;
        u_char  *s;
}


The point under discussion was whether the language specification of C
has strings or just character arrays and whether string literals should
have been called that because whether they have terminating NUL is
‘context-dependent’.

To contradict what I've written, you're widening the discussion to
arbitrary data structures which can be used to implement a string.  That
is not relevant.

I just made that point to make sure that when we talk about strings we talk about a concrete type of strings, as we agree it is any number of non-NUL characters, followed by NUL.

However, assuming that the concept of string is a NUL-terminated char
array, there's little in the core language about it.


But little is not nothing and so the C language does have both strings,
as the specification states that is what is sitting in main()'s argv[],
and string literals.

I can't argue argv[] is not part of the language, since it's certainly documented in the standard. However, it's more of a side-effect of the interfaces provided by the kernel (mainly, exec(3), to which it's impossible to pass a sequence of chars with a NUL that's not terminating the string, since it will just reinterpret it as the end of the string.

If argv[] is the only valid array of strings in the language, I'd say we're in a bad position to say that the language has strings.

Sure, string literals are the only true strings in the language


Your ‘Sure’ implies you're agreeing with someone.  If so, it's not me.
You're wrong on this point.

I think I was kind-of agreeing with Branden, but I don't remember what I was thinking. Let's say it was a thinko of mine. Something not uncommon.

You can prove that string literals are really strings (i.e.,
NUL-terminated char arrays), by applying sizeof to them, and then
looping over their contents to see that there's exactly one NUL byte
at its last position.


Your definitions are wrong.  Proving "foo\0bar" ends with a NUL does not
make it a C string because a NUL-terminated char array is not a C string
if it contains a NUL before that.  A C string is zero or more non-NUL
chars followed by a NUL.


Yes, I like your definition better.

- In C, "foo" is a string literal.  That is the correct name as it is
    not a C string because a string literal may contain explicit NUL bytes
    within it which a string may not: "foo\0bar".


I wouldn't discard them as string literals only for that.


Sorry, I meant s/string literals/strings/.


I'm not discarding them as anything.  I am pointing out that according
to the language definition, "foo\0bar" is a string literal but not a C
string because of the embedded NUL thus the distinction is necessary and
terms are needed for each.

Writing by accident a NUL byte is not usual, anyway.


I didn't claim it was.  I was arguing why ‘they should not have been
called string literal’ is wrong and that whether they get a NUL
terminator is not ‘context dependent’.

So, we could argue that string literals, most of the time, are strings, conforming to the common idea of any non-NUL followed by a NUL.

- A character array may be initialised by a string literal.  Successive
    elements of the array are set to the string literal's characters,
    including the implicit NUL if there is room.

      char     two[2] = "foo";   // 'f' 'o'
      char   three[3] = "foo";   // 'f' 'o' 'o'
      char    four[4] = "foo";   // 'f' 'o' 'o' '\0'
      char    five[5] = "foo";   // 'f' 'o' 'o' '\0' '\0'
      char implicit[] = "foo";   // 'f' 'o' 'o' '\0'


Ahh my friend, you're too used to some dialect of C that allows this,
I believe.  ISO C11 doesn't, and I'm guessing any older ISO C versions
behave in the same way:

$ cat str.c
      char     two[2] = "foo";   // 'f' 'o'
      char   three[3] = "foo";   // 'f' 'o' 'o'
      char    four[4] = "foo";   // 'f' 'o' 'o' '\0'
      char    five[5] = "foo";   // 'f' 'o' 'o' '\0' '\0'
      char implicit[] = "foo";   // 'f' 'o' 'o' '\0'

$ cc str.c -Wpedantic -pedantic-errors
str.c:1:23: error: initializer-string for array of ‘char’ is too long
      1 |     char     two[2] = "foo";   // 'f' 'o'
        |                       ^~~~~


You are showing compiler output and claiming its error proves the
standard.

I actually did ask the compiler to warn about violations of the standard, and only about them. See:

- The default is '-std=gnu17'. It uses GNU extensions, but I'll show why this doesn't care too much, with quotes from the gcc(1) manual page:


[
           The  -ansi  option does not cause non‐ISO programs to
           be rejected gratuitously.  For  that,  -Wpedantic  is
           required in addition to -ansi.
]

[
           The compiler can accept several base standards,  such
           as c90 or c++98, and GNU dialects of those standards,
           such  as  gnu90  or gnu++98.  When a base standard is
           specified,  the   compiler   accepts   all   programs
           following   that   standard   plus  those  using  GNU
           extensions that do not contradict it.   For  example,
           -std=c90  turns  off certain features of GCC that are
           incompatible with ISO C90,  such  as  the  "asm"  and
           "typeof"  keywords, but not other GNU extensions that
           do not have a meaning in ISO C90,  such  as  omitting
           the  middle  term  of a "?:" expression. On the other
           hand, when a GNU dialect of a standard is  specified,
           all  features  supported by the compiler are enabled,
           even when those features change the  meaning  of  the
           base  standard.   As a result, some strict‐conforming
           programs may be rejected.  The particular standard is
           used by -Wpedantic to identify which features are GNU
           extensions given that version of  the  standard.  For
           example  -std=gnu90  -Wpedantic warns about C++ style
           // comments, while -std=gnu99 -Wpedantic does not.
]

[
           Where  the  standard specified with -std represents a
           GNU extended dialect of C, such as  gnu90  or  gnu99,
           there  is  a corresponding base standard, the version
           of ISO C on which the GNU extended dialect is  based.
           Warnings  from  -Wpedantic  are  given where they are
           required by the base standard.   (It  does  not  make
           sense for such warnings to be given only for features
           not   in  the  specified  GNU  C  dialect,  since  by
           definition the GNU dialects of C include all features
           the compiler supports  with  the  given  option,  and
           there would be nothing to warn about.)
]

[
       -pedantic-errors
           Give   an  error  whenever  the  base  standard  (see
           -Wpedantic) requires  a  diagnostic,  in  some  cases
           where there is undefined behavior at compile‐time and
           in  some  other cases that do not prevent compilation
           of programs that are valid according to the standard.
           This is not  equivalent  to  -Werror=pedantic,  since
           there  are  errors  enabled  by  this  option and not
           enabled by the latter and vice versa.
]

It would be handier to have a reference to the standard.

The standard is silent about it. Maybe they didn't even consider this to be important enough to standardize it. The relevant section is C17::6.7.9, but I didn't find anything there.

However, everything not allowed by the standard is Undefined Behaviour, so it is UB by ISO C, and therefore GCC is right in warning about it.


Here's a compiler which has been told I want C11.


You told it you want C11.


     $ gcc -std=c11 -c str.c


But you didn't tell it to warn about non-conforming code.

Moreover, you asked it to warn about things that may or may not have anything to do with ISO C11.


[
       -Wall
           This enables all  the  warnings  about  constructions
           that  some  users consider questionable, and that are
           easy to avoid (or modify  to  prevent  the  warning),
           even  in  conjunction with macros.  This also enables
           some  language‐specific  warnings  described  in  C++
           Dialect  Options  and  Objective‐C  and Objective-C++
           Dialect Options.
]

     str.c:1:19: warning: initializer-string for array of chars is too long
      char     two[2] = "foo";   // 'f' 'o'
                       ^~~~~
     $ objdump -sj .data str.o

     str.o:     file format elf64-x86-64

     Contents of section .data:
      0000 666f666f 6f666f6f 00666f6f 0000666f  fofoofoo.foo..fo
      0010 6f00                                 o.
     $

Note .data starts with two[]'s ‘fo’.

Undefined Behaviour can result in many different things, including the expected result. Moreover, since this behaviour is probably an extension by GCC (although I didn't care enough to check), it's probably implementation-defined to be that.

Remember that -std=c11 doesn't disable extensions that don't conflict with the standard (i.e., ones that define what would otherwise be undefined behaviour).


Again, quotation needed:

[
       -std=
           Determine  the  language  standard.    This option is
           currently only supported when compiling C or C++.

           The compiler can accept several base standards,  such
           as c90 or c++98, and GNU dialects of those standards,
           such  as  gnu90  or gnu++98.  When a base standard is
           specified,  the   compiler   accepts   all   programs
           following   that   standard   plus  those  using  GNU
           extensions that do not contradict it.   For  example,
           -std=c90  turns  off certain features of GCC that are
           incompatible with ISO C90,  such  as  the  "asm"  and
           "typeof"  keywords, but not other GNU extensions that
           do not have a meaning in ISO C90,  such  as  omitting
           the  middle  term  of a "?:" expression. On the other
           hand, when a GNU dialect of a standard is  specified,
           all  features  supported by the compiler are enabled,
           even when those features change the  meaning  of  the
           base  standard.   As a result, some strict‐conforming
           programs may be rejected.  The particular standard is
           used by -Wpedantic to identify which features are GNU
           extensions given that version of  the  standard.  For
           example  -std=gnu90  -Wpedantic warns about C++ style
           // comments, while -std=gnu99 -Wpedantic does not.
]

-  ISO C doesn't allow 'two'.


Reference needed.

The absence of permission makes it UB, IIRC. There's no possible quotation for the absence of permission. About something not being specified by the standard being UB, I don't remember/find what's the paragraph about it. Feel free to crrect me here, since I can't quote it.

-  It does however, allow 'five', and forces initialization to the same as
objects that have static storage duration (i.e., 0).  See C2x::6.7.10/22


Yes, I know that, showed it above, and this is nothing to do with
initialising a char array but just generally what happens,
e.g. ‘int a[42] = {3, 1, 4}’.

There's actually a paragraph in the standard that specifies it specifially for char arrays. But as you, I agree that this was already covered by normal array rules, I think.

-  It does allow 'three', 'four', and 'implicit', per C2x::6.7.10/15
(I believe it's that paragraph).  I admit that the wording is not so
clear as to reject 'two'; however GCC seems to interpret it that way,
in pedantic mode.


We've moved from C11 to a future C, C2x.  Paragraph 6.7.10.15 in C2x is
the same as 6.7.9.14 in C11.

Sorry, I had the C2x document more handy, since I had been discussing some features in it (or to be possibly included for C3x) these days. Now I've quoted C17.


     An array of character type may be initialized by a character string
     literal or UTF-8 string literal, optionally enclosed in braces.
     Successive bytes of the string literal (including the terminating
     null character if there is room or if the array is of unknown size)
     initialize the elements of the array.

It describes the behaviour shown by str.c above: successive bytes
initialise the array.  It is not rejected by the compiler.  More
importantly, I can't see where it is rejected by the standard.

- The string literal is reliably terminating by a NUL.


Terminated, yes.  "terminating", hmmm, I'd say no


Sorry, that's a typo, I meant ‘terminated’.


No problem.  We all do them :)

- It is not context dependent whether a string literal has a terminating
   NUL.


Sure.


Good.

And guns are just machines that do holes, context-independently.
However, they can kill, depending on the context.
Especially if they have no safety, like Glocks, or string literals.

$ cat str.c
      #include <stdio.h>

      int main(void)
      {
          printf("%zu\n", sizeof(1 ? "foo" : "bar"));
          printf("%zu\n", 1 ? sizeof("foo") : sizeof("bar"));
      }

$ cc str.c -Wpedantic -pedantic-errors
$ ./a.out
8
4


Yes, I recall this from elsewhere in the thread where I asked you to
explain why switching to nitems() fixed the problem because I couldn't
see it given the code samples shown.
https://lists.gnu.org/archive/html/groff/2022-11/msg00030.html


Sorry, I missed it.

But it is nothing to do with the language C defining what a string is
and having string literals as distinct things worthy of a separate name.

Hmm, makes sense. It's rather an issue of the ternary operator in this case. My point was that C has dangerous features, that combined can be very dangerous.

Some trivial constructs can help you get the compiler on your side, like the sizeof division.

See for example some (part of a) change that I did for optimizing some code,
where I transformed pointers to char to char arrays (following Ulrich Drepper's
article about libraries).  The global change using arrays instead of pointers
reduced the code size in a couple of KiB, IIRC, which for cache misses might be
an important thing.

-static const char *log_levels[] = {
+static const char  log_levels[][8] = {
       "alert",
       "error",
       "warn",
       "notice",
       "info",
       "debug",
   };

As a note, I used 8 for better alignment, but 7 would have been fine.
Now, let's imagine that I append the following element to the array:
"messages"?  Values of beta will give rise to dom!


That's because robust code has become fragile.  The original was better
because it allowed that addition of a longer string.  The couple of KiB
saved is probably irrelevant compared with the human time of dealing any
error which might arise.

Not really. It's a bug in the compiler. It's only the compiler that decides which code is fragile or not. Since the new code is undoubtedly better in terms of performance, and is perfectly supported by the compiler, I find it a bug that the compiler doesn't make it as safe as the worse version.

So, I'm working on improving the compiler to have it be as safe as the worse construct.

Wouldn't it be nice to use -Wunterminated-strings and let the
compiler yell at me if I write a string literal with 8 letters?


If the compiler doesn't do that then I expect there is a linter that
will, or a different compiler.

I don't know; maybe. I didn't care too much to try. Since in the project where I use that we don't have any static analyzers embedded in the build system, and I don't want to run it manually, I'll work on improving gcc(1), which is the simplest to do for me.

 But it sounds like some of the projects
you work on could do with a project-specific linter which understands
the conventions the code must follow.  That might not be too hard given
the LLVM framework and all the tools its provides these days.

Yeah, maybe. Maybe clang-tidy(1) already warns about it. Didn't check, since it's not useful for me right now.


Having the warning in gcc(1) is valuable, so I'll add it.

Cheers,

Alex

--
<http://www.alejandro-colomar.es/>

OpenPGP_signature
Description: OpenPGP digital signature

Re: C Strings and String Literals.

Reply via email to