Hi,
this is an updated version of my STRING_CSTs checking in varasm.c patch. It tries to answer the questions that were raised in th GO string literals thread. The answers are: a) strings with TYPE_SIZE_UNIT == NULL do exist, but only for STRING_CSTs in constructors of flexible array struct members. Not for string literals. b) In all cases where the STRING_CSTs do have a TYPE_SIZE_UNIT, the DECL_SIZE_UNIT has the same value. c) When STRING_CSTs do not have a TYPE_SIZE_UNIT, that is in the constructor of a flexible array member. In that case the TREE_STRING_LENGTH determines the flexible array size. It changes varasm's get_constant_size to return the TYPE_SIZE_UNIT of a STRING_CST literal as it is, without increasing the storage size to TREE_STRING_LENGTH. I added an assertion to make sure that all STRING_CSTs have a type size; size == 0 can happen for empty Ada strings. Furthermore it adds code to compare_constant to also compare the STRING_CSTs TYPE_SIZE_UNIT if available. So I want to remove that from get_constant_size in order to not change the memory layout of GO and Ada strings, while still having them look mostly like C string literals. Furthermore I added one more consistency check to check_string_literal, that makes sure that all STRING_CSTs that do have a TYPE_SIZE_UNIT, the size matches the DECL_SIZE_UNIT. Those newly discovered properties of string literals make the code in c_strlen and friends a lot simpler. Bootstrapped and reg-tested on x86_64-pc-linux-gnu. Is it OK for trunk? Thanks Bernd. On 08/17/18 15:53, Bernd Edlinger wrote: > On 08/17/18 15:38, Richard Biener wrote: >> On Fri, 17 Aug 2018, Bernd Edlinger wrote: >> >>> On 08/17/18 14:19, Richard Biener wrote: >>>> On Fri, 17 Aug 2018, Bernd Edlinger wrote: >>>> >>>>> Richard Biener wrote: >>>>>> +embedded @code{NUL} characters. However, the >>>>>> +@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that >>>>>> +is not part of the language string literal but appended by the front >>>>>> end. >>>>>> +If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE} >>>>>> +is one character shorter than @code{TREE_STRING_LENGTH}. >>>>>> +Excess caracters other than one trailing @code{NUL} character are not >>>> >>>> characters btw. >>>> >>> >>> thanks, updated. >>> >>>> I read the above that the string literal for >>>> >>>> char x[2] = "1"; >>>> >>>> is actually "1\0\0" - there's one NUL that is not part of the language >>>> string literal. The second sentence then suggests that both \0 >>>> are removed because 2 is less than 3? >>>> >>> >>> maybe 2 is a bad example, lets consider: >>> char x[2000000000] = "1"; >>> >>> That is a string_cst with STRING_LENGTH = 2, content = "2\0\0" >>> the array_type is used on both x, and the string_cst, >>> I was assuming that both tree objects refer to the same type object. >>> which is char[0..2000000000-1] >> >> Oh, didn't realize we use char[200000000] for the STRING_CST. Makes >> my suggestion to use char[] instead not (very) much worse than the >> existing practice then. >> >>> varasm assembles the bytes that are given by STRING_LENGTH >>> and appends zeros as appropriate. >>> >>>> As said, having this extra semantics of a STRING_CST tied to >>>> another tree node (its TREE_TYPE) looks ugly. >>>> >>>>>> +permitted. >>>>>> >>>>>> I find this very confusing and oppose to that change. Can we get >>>>>> back to the drawing board please? If we want an easy way to >>>>>> see whether a string is "properly" terminated then maybe we can >>>>>> simply use a flag that gets set by build_string? >>>>>> >>>>> >>>>> What I mean with that is the case like >>>>> char x[2] = "123456"; >>>>> >>>>> which is build_string(7, "123456"), but with a type char[2], >>>>> so varasm throws away "3456\0". >>>> >>>> I think varasm throws away chars not because of the type of >>>> the STRING_CST but because of the available storage in x. >>>> >>> >>> But at other places we look at the type of the string_cst, don't we? >>> Shouldn't those be the same? >> >> I think most (all?) places look at TREE_TYPE (TREE_TYPE (string)) >> only. I'm not aware of users of the array domain of the array type >> of a string - but I'm far from knowing GCC inside-out ;) >> > > Yes, I know, that happens to me as well on the first day after my holidays ;) > >>>>> I want to say that this is not okay, the excess precision >>>>> should only be used to strip the nul termination, in cases >>>>> where it is intended to be a assembled as a not zero terminated >>>>> string. But maybe the wording could be improved? >>>> >>>> ISTR we always assemble a NUL in .strings to get string merging >>>> working. >>>> >>> >>> String merging is not working when the string is not explicitly >>> NUL terminated, my followup patch here tries to fix that: >>> >>> [PATCH] Handle not explicitly zero terminated strings in merge sections >>> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00481.html >> >> I'd have expected sth as simple as >> >> if (merge_strings && str[thissize - 1] != '\0') >> thissize++; >> >> being appended in output_constant. >> > > Yes, but that can only be done in the .merge.str section, > otherwise it would happen in structure initializers as well. > And I would like to undo the case when Ada programs do > > Process ("ABCD" & Ascii.NUL); > > but not for embedded NUL in the string constant. > like: > > Process ("ABCD" & Acii.NUL & "EFGH"); > > > Bernd.
2018-08-22 Bernd Edlinger <bernd.edlin...@hotmail.de> * doc/generic.texi (STRING_CST): Update. * varasm.c (compare_constant): Compare type size of STRING_CSTs. (get_constant_size): Don't make STRING_CSTs larget than they are. (check_string_literal): New check function for STRING_CSTs. (output_constant): Use it. diff -Npur gcc/doc/generic.texi gcc/doc/generic.texi --- gcc/doc/generic.texi 2018-08-16 17:28:11.000000000 +0200 +++ gcc/doc/generic.texi 2018-08-22 10:17:29.852681466 +0200 @@ -1162,9 +1162,13 @@ These nodes represent string-constants. returns the length of the string, as an @code{int}. The @code{TREE_STRING_POINTER} is a @code{char*} containing the string itself. The string may not be @code{NUL}-terminated, and it may contain -embedded @code{NUL} characters. Therefore, the -@code{TREE_STRING_LENGTH} includes the trailing @code{NUL} if it is -present. +embedded @code{NUL} characters. However, the +@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that +is not part of the language string literal but appended by the front end. +If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE} +is one character shorter than @code{TREE_STRING_LENGTH}. +Excess characters other than one trailing @code{NUL} character are not +permitted. For wide string constants, the @code{TREE_STRING_LENGTH} is the number of bytes in the string, and the @code{TREE_STRING_POINTER} @@ -1173,6 +1177,11 @@ target system (that is, as integers in t non-wide string constants are distinguished only by the @code{TREE_TYPE} of the @code{STRING_CST}. +String constants may also be used for other purpose like assember +constraints or attributes. These have no @code{TREE_TYPE} and +need no explicit @code{NUL}-termination. Note there is always +another @code{NUL}-byte beyond @code{TREE_STRING_LENGTH}. + FIXME: The formats of string constants are not well-defined when the target system bytes are not the same width as host system bytes. diff -Npur gcc/varasm.c gcc/varasm.c --- gcc/varasm.c 2018-08-16 17:28:11.000000000 +0200 +++ gcc/varasm.c 2018-08-22 11:42:13.259002521 +0200 @@ -3146,7 +3146,9 @@ compare_constant (const tree t1, const t return FIXED_VALUES_IDENTICAL (TREE_FIXED_CST (t1), TREE_FIXED_CST (t2)); case STRING_CST: - if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2))) + if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2)) + || int_size_in_bytes (TREE_TYPE (t1)) + != int_size_in_bytes (TREE_TYPE (t2))) return 0; return (TREE_STRING_LENGTH (t1) == TREE_STRING_LENGTH (t2) @@ -3303,8 +3305,7 @@ get_constant_size (tree exp) HOST_WIDE_INT size; size = int_size_in_bytes (TREE_TYPE (exp)); - if (TREE_CODE (exp) == STRING_CST) - size = MAX (TREE_STRING_LENGTH (exp), size); + gcc_checking_assert (size >= 0); return size; } @@ -4774,6 +4775,32 @@ initializer_constant_valid_for_bitfield_ return false; } +/* Check if a STRING_CST fits into the field. + Tolerate only the case when the NUL termination + does not fit into the field. */ + +static bool +check_string_literal (tree string, unsigned HOST_WIDE_INT size) +{ + tree eltype = TREE_TYPE (TREE_TYPE (string)); + unsigned HOST_WIDE_INT elts = tree_to_uhwi (TYPE_SIZE_UNIT (eltype)); + const char *p = TREE_STRING_POINTER (string); + int len = TREE_STRING_LENGTH (string); + HOST_WIDE_INT mems = int_size_in_bytes (TREE_TYPE (string)); + + if (elts != 1 && elts != 2 && elts != 4) + return false; + if (len <= 0 || len % elts != 0) + return false; + if (size < (unsigned)len && size != len - elts) + return false; + if (memcmp (p + len - elts, "\0\0\0\0", elts) != 0) + return false; + if (mems >= 0 && mems != (HOST_WIDE_INT)size) + return false; + return true; +} + /* output_constructor outer state of relevance in recursive calls, typically for nested aggregate bitfields. */ @@ -4942,6 +4969,7 @@ output_constant (tree exp, unsigned HOST case STRING_CST: thissize = MIN ((unsigned HOST_WIDE_INT)TREE_STRING_LENGTH (exp), size); + gcc_checking_assert (check_string_literal (exp, size)); assemble_string (TREE_STRING_POINTER (exp), thissize); break; case VECTOR_CST: