Hi Ralph, On 11/14/22 14:56, Ralph Corderoy wrote:
Hi Alejandro,C doesn't _really_ have strings, except at the library level. It has character arrays and one grain of syntactic sugar for encoding "string literals", which should not have been called that because whether they get the null terminator is context-dependent. char a[5] = "fooba"; char *b = "bazqux"; I see some Internet sources claim that C is absolutely reliable about null-terminating such literals, but I can't agree. The assignment to `b` above adds a null terminator, and the one to `a` does not. This is the opposite of absolute reliability. Since I foresee someone calling me a liar for saying that, I'll grant that if you carry a long enough list of exceptional cases for the syntax in your head, both are predictable. But it's simply a land mine for the everyday programmer.- C defines both string literals and strings at the language level, e.g. main()'s argv[] is defined to contain strings.I must disagree. The string concept is very broad, and you can define you own string, as for example: struct str_s { size_t len; u_char *s; }The point under discussion was whether the language specification of C has strings or just character arrays and whether string literals should have been called that because whether they have terminating NUL is ‘context-dependent’. To contradict what I've written, you're widening the discussion to arbitrary data structures which can be used to implement a string. That is not relevant.
I just made that point to make sure that when we talk about strings we talk about a concrete type of strings, as we agree it is any number of non-NUL characters, followed by NUL.
However, assuming that the concept of string is a NUL-terminated char array, there's little in the core language about it.But little is not nothing and so the C language does have both strings, as the specification states that is what is sitting in main()'s argv[], and string literals.
I can't argue argv[] is not part of the language, since it's certainly documented in the standard. However, it's more of a side-effect of the interfaces provided by the kernel (mainly, exec(3), to which it's impossible to pass a sequence of chars with a NUL that's not terminating the string, since it will just reinterpret it as the end of the string.
If argv[] is the only valid array of strings in the language, I'd say we're in a bad position to say that the language has strings.
Sure, string literals are the only true strings in the languageYour ‘Sure’ implies you're agreeing with someone. If so, it's not me. You're wrong on this point.
I think I was kind-of agreeing with Branden, but I don't remember what I was thinking. Let's say it was a thinko of mine. Something not uncommon.
You can prove that string literals are really strings (i.e., NUL-terminated char arrays), by applying sizeof to them, and then looping over their contents to see that there's exactly one NUL byte at its last position.Your definitions are wrong. Proving "foo\0bar" ends with a NUL does not make it a C string because a NUL-terminated char array is not a C string if it contains a NUL before that. A C string is zero or more non-NUL chars followed by a NUL.
Yes, I like your definition better.
- In C, "foo" is a string literal. That is the correct name as it is not a C string because a string literal may contain explicit NUL bytes within it which a string may not: "foo\0bar".I wouldn't discard them as string literals only for that.
Sorry, I meant s/string literals/strings/.
I'm not discarding them as anything. I am pointing out that according to the language definition, "foo\0bar" is a string literal but not a C string because of the embedded NUL thus the distinction is necessary and terms are needed for each.Writing by accident a NUL byte is not usual, anyway.I didn't claim it was. I was arguing why ‘they should not have been called string literal’ is wrong and that whether they get a NUL terminator is not ‘context dependent’.
So, we could argue that string literals, most of the time, are strings, conforming to the common idea of any non-NUL followed by a NUL.
- A character array may be initialised by a string literal. Successive elements of the array are set to the string literal's characters, including the implicit NUL if there is room. char two[2] = "foo"; // 'f' 'o' char three[3] = "foo"; // 'f' 'o' 'o' char four[4] = "foo"; // 'f' 'o' 'o' '\0' char five[5] = "foo"; // 'f' 'o' 'o' '\0' '\0' char implicit[] = "foo"; // 'f' 'o' 'o' '\0'Ahh my friend, you're too used to some dialect of C that allows this, I believe. ISO C11 doesn't, and I'm guessing any older ISO C versions behave in the same way: $ cat str.c char two[2] = "foo"; // 'f' 'o' char three[3] = "foo"; // 'f' 'o' 'o' char four[4] = "foo"; // 'f' 'o' 'o' '\0' char five[5] = "foo"; // 'f' 'o' 'o' '\0' '\0' char implicit[] = "foo"; // 'f' 'o' 'o' '\0' $ cc str.c -Wpedantic -pedantic-errors str.c:1:23: error: initializer-string for array of ‘char’ is too long 1 | char two[2] = "foo"; // 'f' 'o' | ^~~~~You are showing compiler output and claiming its error proves the standard.
I actually did ask the compiler to warn about violations of the standard, and only about them. See:
- The default is '-std=gnu17'. It uses GNU extensions, but I'll show why this doesn't care too much, with quotes from the gcc(1) manual page:
[ The -ansi option does not cause non‐ISO programs to be rejected gratuitously. For that, -Wpedantic is required in addition to -ansi. ] [ The compiler can accept several base standards, such as c90 or c++98, and GNU dialects of those standards, such as gnu90 or gnu++98. When a base standard is specified, the compiler accepts all programs following that standard plus those using GNU extensions that do not contradict it. For example, -std=c90 turns off certain features of GCC that are incompatible with ISO C90, such as the "asm" and "typeof" keywords, but not other GNU extensions that do not have a meaning in ISO C90, such as omitting the middle term of a "?:" expression. On the other hand, when a GNU dialect of a standard is specified, all features supported by the compiler are enabled, even when those features change the meaning of the base standard. As a result, some strict‐conforming programs may be rejected. The particular standard is used by -Wpedantic to identify which features are GNU extensions given that version of the standard. For example -std=gnu90 -Wpedantic warns about C++ style // comments, while -std=gnu99 -Wpedantic does not. ] [ Where the standard specified with -std represents a GNU extended dialect of C, such as gnu90 or gnu99, there is a corresponding base standard, the version of ISO C on which the GNU extended dialect is based. Warnings from -Wpedantic are given where they are required by the base standard. (It does not make sense for such warnings to be given only for features not in the specified GNU C dialect, since by definition the GNU dialects of C include all features the compiler supports with the given option, and there would be nothing to warn about.) ] [ -pedantic-errors Give an error whenever the base standard (see -Wpedantic) requires a diagnostic, in some cases where there is undefined behavior at compile‐time and in some other cases that do not prevent compilation of programs that are valid according to the standard. This is not equivalent to -Werror=pedantic, since there are errors enabled by this option and not enabled by the latter and vice versa. ]
It would be handier to have a reference to the standard.
The standard is silent about it. Maybe they didn't even consider this to be important enough to standardize it. The relevant section is C17::6.7.9, but I didn't find anything there.
However, everything not allowed by the standard is Undefined Behaviour, so it is UB by ISO C, and therefore GCC is right in warning about it.
Here's a compiler which has been told I want C11.
You told it you want C11.
$ gcc -std=c11 -c str.c
But you didn't tell it to warn about non-conforming code.Moreover, you asked it to warn about things that may or may not have anything to do with ISO C11.
[ -Wall This enables all the warnings about constructions that some users consider questionable, and that are easy to avoid (or modify to prevent the warning), even in conjunction with macros. This also enables some language‐specific warnings described in C++ Dialect Options and Objective‐C and Objective-C++ Dialect Options. ]
str.c:1:19: warning: initializer-string for array of chars is too long char two[2] = "foo"; // 'f' 'o' ^~~~~ $ objdump -sj .data str.o str.o: file format elf64-x86-64 Contents of section .data: 0000 666f666f 6f666f6f 00666f6f 0000666f fofoofoo.foo..fo 0010 6f00 o. $ Note .data starts with two[]'s ‘fo’.
Undefined Behaviour can result in many different things, including the expected result. Moreover, since this behaviour is probably an extension by GCC (although I didn't care enough to check), it's probably implementation-defined to be that.
Remember that -std=c11 doesn't disable extensions that don't conflict with the standard (i.e., ones that define what would otherwise be undefined behaviour).
Again, quotation needed: [ -std= Determine the language standard. This option is currently only supported when compiling C or C++. The compiler can accept several base standards, such as c90 or c++98, and GNU dialects of those standards, such as gnu90 or gnu++98. When a base standard is specified, the compiler accepts all programs following that standard plus those using GNU extensions that do not contradict it. For example, -std=c90 turns off certain features of GCC that are incompatible with ISO C90, such as the "asm" and "typeof" keywords, but not other GNU extensions that do not have a meaning in ISO C90, such as omitting the middle term of a "?:" expression. On the other hand, when a GNU dialect of a standard is specified, all features supported by the compiler are enabled, even when those features change the meaning of the base standard. As a result, some strict‐conforming programs may be rejected. The particular standard is used by -Wpedantic to identify which features are GNU extensions given that version of the standard. For example -std=gnu90 -Wpedantic warns about C++ style // comments, while -std=gnu99 -Wpedantic does not. ]
- ISO C doesn't allow 'two'.Reference needed.
The absence of permission makes it UB, IIRC. There's no possible quotation for the absence of permission. About something not being specified by the standard being UB, I don't remember/find what's the paragraph about it. Feel free to crrect me here, since I can't quote it.
- It does however, allow 'five', and forces initialization to the same as objects that have static storage duration (i.e., 0). See C2x::6.7.10/22Yes, I know that, showed it above, and this is nothing to do with initialising a char array but just generally what happens, e.g. ‘int a[42] = {3, 1, 4}’.
There's actually a paragraph in the standard that specifies it specifially for char arrays. But as you, I agree that this was already covered by normal array rules, I think.
- It does allow 'three', 'four', and 'implicit', per C2x::6.7.10/15 (I believe it's that paragraph). I admit that the wording is not so clear as to reject 'two'; however GCC seems to interpret it that way, in pedantic mode.We've moved from C11 to a future C, C2x. Paragraph 6.7.10.15 in C2x is the same as 6.7.9.14 in C11.
Sorry, I had the C2x document more handy, since I had been discussing some features in it (or to be possibly included for C3x) these days. Now I've quoted C17.
An array of character type may be initialized by a character string literal or UTF-8 string literal, optionally enclosed in braces. Successive bytes of the string literal (including the terminating null character if there is room or if the array is of unknown size) initialize the elements of the array. It describes the behaviour shown by str.c above: successive bytes initialise the array. It is not rejected by the compiler. More importantly, I can't see where it is rejected by the standard.- The string literal is reliably terminating by a NUL.Terminated, yes. "terminating", hmmm, I'd say noSorry, that's a typo, I meant ‘terminated’.
No problem. We all do them :)
- It is not context dependent whether a string literal has a terminating NUL.Sure.Good.And guns are just machines that do holes, context-independently. However, they can kill, depending on the context. Especially if they have no safety, like Glocks, or string literals. $ cat str.c #include <stdio.h> int main(void) { printf("%zu\n", sizeof(1 ? "foo" : "bar")); printf("%zu\n", 1 ? sizeof("foo") : sizeof("bar")); } $ cc str.c -Wpedantic -pedantic-errors $ ./a.out 8 4Yes, I recall this from elsewhere in the thread where I asked you to explain why switching to nitems() fixed the problem because I couldn't see it given the code samples shown. https://lists.gnu.org/archive/html/groff/2022-11/msg00030.html
Sorry, I missed it.
But it is nothing to do with the language C defining what a string is and having string literals as distinct things worthy of a separate name.
Hmm, makes sense. It's rather an issue of the ternary operator in this case. My point was that C has dangerous features, that combined can be very dangerous.
Some trivial constructs can help you get the compiler on your side, like the sizeof division.
See for example some (part of a) change that I did for optimizing some code, where I transformed pointers to char to char arrays (following Ulrich Drepper's article about libraries). The global change using arrays instead of pointers reduced the code size in a couple of KiB, IIRC, which for cache misses might be an important thing. -static const char *log_levels[] = { +static const char log_levels[][8] = { "alert", "error", "warn", "notice", "info", "debug", }; As a note, I used 8 for better alignment, but 7 would have been fine. Now, let's imagine that I append the following element to the array: "messages"? Values of beta will give rise to dom!That's because robust code has become fragile. The original was better because it allowed that addition of a longer string. The couple of KiB saved is probably irrelevant compared with the human time of dealing any error which might arise.
Not really. It's a bug in the compiler. It's only the compiler that decides which code is fragile or not. Since the new code is undoubtedly better in terms of performance, and is perfectly supported by the compiler, I find it a bug that the compiler doesn't make it as safe as the worse version.
So, I'm working on improving the compiler to have it be as safe as the worse construct.
Wouldn't it be nice to use -Wunterminated-strings and let the compiler yell at me if I write a string literal with 8 letters?If the compiler doesn't do that then I expect there is a linter that will, or a different compiler.
I don't know; maybe. I didn't care too much to try. Since in the project where I use that we don't have any static analyzers embedded in the build system, and I don't want to run it manually, I'll work on improving gcc(1), which is the simplest to do for me.
But it sounds like some of the projects you work on could do with a project-specific linter which understands the conventions the code must follow. That might not be too hard given the LLVM framework and all the tools its provides these days.
Yeah, maybe. Maybe clang-tidy(1) already warns about it. Didn't check, since it's not useful for me right now.
Having the warning in gcc(1) is valuable, so I'll add it. Cheers, Alex -- <http://www.alejandro-colomar.es/>
OpenPGP_signature
Description: OpenPGP digital signature