Hi,

this is an updated version of my STRING_CSTs checking in varasm.c
patch.

It tries to answer the questions that were raised in th GO string literals
thread.

The answers are:
a) strings with TYPE_SIZE_UNIT == NULL do exist, but only for STRING_CSTs
in constructors of flexible array struct members.  Not for string literals.
b) In all cases where the STRING_CSTs do have a TYPE_SIZE_UNIT, the
DECL_SIZE_UNIT has the same value.
c) When STRING_CSTs do not have a TYPE_SIZE_UNIT, that is in the constructor
of a flexible array member.  In that case the TREE_STRING_LENGTH
determines the flexible array size.


It changes varasm's get_constant_size to return the TYPE_SIZE_UNIT of
a STRING_CST literal as it is, without increasing the storage size
to TREE_STRING_LENGTH.  I added an assertion to make sure that
all STRING_CSTs have a type size; size == 0 can happen for empty Ada
strings.

Furthermore it adds code to compare_constant to also compare the
STRING_CSTs TYPE_SIZE_UNIT if available.

So I want to remove that from get_constant_size in order to not change
the memory layout of GO and Ada strings, while still having them look
mostly like C string literals.

Furthermore I added one more consistency check to check_string_literal,
that makes sure that all STRING_CSTs that do have a TYPE_SIZE_UNIT,
the size matches the DECL_SIZE_UNIT.

Those newly discovered properties of string literals make the code in
c_strlen and friends a lot simpler.



Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
Is it OK for trunk?


Thanks
Bernd.




On 08/17/18 15:53, Bernd Edlinger wrote:
> On 08/17/18 15:38, Richard Biener wrote:
>> On Fri, 17 Aug 2018, Bernd Edlinger wrote:
>>
>>> On 08/17/18 14:19, Richard Biener wrote:
>>>> On Fri, 17 Aug 2018, Bernd Edlinger wrote:
>>>>
>>>>> Richard Biener wrote:
>>>>>> +embedded @code{NUL} characters.  However, the
>>>>>> +@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that
>>>>>> +is not part of the language string literal but appended by the front 
>>>>>> end.
>>>>>> +If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE}
>>>>>> +is one character shorter than @code{TREE_STRING_LENGTH}.
>>>>>> +Excess caracters other than one trailing @code{NUL} character are not
>>>>
>>>> characters btw.
>>>>
>>>
>>> thanks, updated.
>>>
>>>> I read the above that the string literal for
>>>>
>>>> char x[2] = "1";
>>>>
>>>> is actually "1\0\0" - there's one NUL that is not part of the language
>>>> string literal.  The second sentence then suggests that both \0
>>>> are removed because 2 is less than 3?
>>>>
>>>
>>> maybe 2 is a bad example, lets consider:
>>> char x[2000000000] = "1";
>>>
>>> That is a string_cst with STRING_LENGTH = 2, content = "2\0\0"
>>> the array_type is used on both x, and the string_cst,
>>> I was assuming that both tree objects refer to the same type object.
>>> which is char[0..2000000000-1]
>>
>> Oh, didn't realize we use char[200000000] for the STRING_CST.  Makes
>> my suggestion to use char[] instead not (very) much worse than the
>> existing practice then.
>>
>>> varasm assembles the bytes that are given by STRING_LENGTH
>>> and appends zeros as appropriate.
>>>
>>>> As said, having this extra semantics of a STRING_CST tied to
>>>> another tree node (its TREE_TYPE) looks ugly.
>>>>
>>>>>> +permitted.
>>>>>>
>>>>>> I find this very confusing and oppose to that change.  Can we get
>>>>>> back to the drawing board please?  If we want an easy way to
>>>>>> see whether a string is "properly" terminated then maybe we can
>>>>>> simply use a flag that gets set by build_string?
>>>>>>
>>>>>
>>>>> What I mean with that is the case like
>>>>> char x[2] = "123456";
>>>>>
>>>>> which is build_string(7, "123456"), but with a type char[2],
>>>>> so varasm throws away "3456\0".
>>>>
>>>> I think varasm throws away chars not because of the type of
>>>> the STRING_CST but because of the available storage in x.
>>>>
>>>
>>> But at other places we look at the type of the string_cst, don't we?
>>> Shouldn't those be the same?
>>
>> I think most (all?) places look at TREE_TYPE (TREE_TYPE (string))
>> only.  I'm not aware of users of the array domain of the array type
>> of a string - but I'm far from knowing GCC inside-out ;)
>>
> 
> Yes, I know, that happens to me as well on the first day after my holidays ;)
> 
>>>>> I want to say that this is not okay, the excess precision
>>>>> should only be used to strip the nul termination, in cases
>>>>> where it is intended to be a assembled as a not zero terminated
>>>>> string.  But maybe the wording could be improved?
>>>>
>>>> ISTR we always assemble a NUL in .strings to get string merging
>>>> working.
>>>>
>>>
>>> String merging is not working when the string is not explicitly
>>> NUL terminated, my followup patch here tries to fix that:
>>>
>>> [PATCH] Handle not explicitly zero terminated strings in merge sections
>>> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00481.html
>>
>> I'd have expected sth as simple as
>>
>>    if (merge_strings && str[thissize - 1] != '\0')
>>      thissize++;
>>
>> being appended in output_constant.
>>
> 
> Yes, but that can only be done in the .merge.str section,
> otherwise it would happen in structure initializers as well.
> And I would like to undo the case when Ada programs do
> 
> Process ("ABCD" & Ascii.NUL);
> 
> but not for embedded NUL in the string constant.
> like:
> 
> Process ("ABCD" & Acii.NUL & "EFGH");
> 
> 
> Bernd.
2018-08-22  Bernd Edlinger  <bernd.edlin...@hotmail.de>

	* doc/generic.texi (STRING_CST): Update.
	* varasm.c (compare_constant): Compare type size of STRING_CSTs.
	(get_constant_size): Don't make STRING_CSTs larget than they are.
	(check_string_literal): New check function for STRING_CSTs.
	(output_constant): Use it.

diff -Npur gcc/doc/generic.texi gcc/doc/generic.texi
--- gcc/doc/generic.texi	2018-08-16 17:28:11.000000000 +0200
+++ gcc/doc/generic.texi	2018-08-22 10:17:29.852681466 +0200
@@ -1162,9 +1162,13 @@ These nodes represent string-constants.
 returns the length of the string, as an @code{int}.  The
 @code{TREE_STRING_POINTER} is a @code{char*} containing the string
 itself.  The string may not be @code{NUL}-terminated, and it may contain
-embedded @code{NUL} characters.  Therefore, the
-@code{TREE_STRING_LENGTH} includes the trailing @code{NUL} if it is
-present.
+embedded @code{NUL} characters.  However, the
+@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that
+is not part of the language string literal but appended by the front end.
+If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE}
+is one character shorter than @code{TREE_STRING_LENGTH}.
+Excess characters other than one trailing @code{NUL} character are not
+permitted.
 
 For wide string constants, the @code{TREE_STRING_LENGTH} is the number
 of bytes in the string, and the @code{TREE_STRING_POINTER}
@@ -1173,6 +1177,11 @@ target system (that is, as integers in t
 non-wide string constants are distinguished only by the @code{TREE_TYPE}
 of the @code{STRING_CST}.
 
+String constants may also be used for other purpose like assember
+constraints or attributes.  These have no @code{TREE_TYPE} and
+need no explicit @code{NUL}-termination.  Note there is always
+another @code{NUL}-byte beyond @code{TREE_STRING_LENGTH}.
+
 FIXME: The formats of string constants are not well-defined when the
 target system bytes are not the same width as host system bytes.
 
diff -Npur gcc/varasm.c gcc/varasm.c
--- gcc/varasm.c	2018-08-16 17:28:11.000000000 +0200
+++ gcc/varasm.c	2018-08-22 11:42:13.259002521 +0200
@@ -3146,7 +3146,9 @@ compare_constant (const tree t1, const t
       return FIXED_VALUES_IDENTICAL (TREE_FIXED_CST (t1), TREE_FIXED_CST (t2));
 
     case STRING_CST:
-      if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2)))
+      if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2))
+	  || int_size_in_bytes (TREE_TYPE (t1))
+	     != int_size_in_bytes (TREE_TYPE (t2)))
 	return 0;
 
       return (TREE_STRING_LENGTH (t1) == TREE_STRING_LENGTH (t2)
@@ -3303,8 +3305,7 @@ get_constant_size (tree exp)
   HOST_WIDE_INT size;
 
   size = int_size_in_bytes (TREE_TYPE (exp));
-  if (TREE_CODE (exp) == STRING_CST)
-    size = MAX (TREE_STRING_LENGTH (exp), size);
+  gcc_checking_assert (size >= 0);
   return size;
 }
 
@@ -4774,6 +4775,32 @@ initializer_constant_valid_for_bitfield_
   return false;
 }
 
+/* Check if a STRING_CST fits into the field.
+   Tolerate only the case when the NUL termination
+   does not fit into the field.   */
+
+static bool
+check_string_literal (tree string, unsigned HOST_WIDE_INT size)
+{
+  tree eltype = TREE_TYPE (TREE_TYPE (string));
+  unsigned HOST_WIDE_INT elts = tree_to_uhwi (TYPE_SIZE_UNIT (eltype));
+  const char *p = TREE_STRING_POINTER (string);
+  int len = TREE_STRING_LENGTH (string);
+  HOST_WIDE_INT mems = int_size_in_bytes (TREE_TYPE (string));
+
+  if (elts != 1 && elts != 2 && elts != 4)
+    return false;
+  if (len <= 0 || len % elts != 0)
+    return false;
+  if (size < (unsigned)len && size != len - elts)
+    return false;
+  if (memcmp (p + len - elts, "\0\0\0\0", elts) != 0)
+    return false;
+  if (mems >= 0 && mems != (HOST_WIDE_INT)size)
+    return false;
+  return true;
+}
+
 /* output_constructor outer state of relevance in recursive calls, typically
    for nested aggregate bitfields.  */
 
@@ -4942,6 +4969,7 @@ output_constant (tree exp, unsigned HOST
 	case STRING_CST:
 	  thissize
 	    = MIN ((unsigned HOST_WIDE_INT)TREE_STRING_LENGTH (exp), size);
+	  gcc_checking_assert (check_string_literal (exp, size));
 	  assemble_string (TREE_STRING_POINTER (exp), thissize);
 	  break;
 	case VECTOR_CST:

Reply via email to