On Mon, 20 Aug 2018, Bernd Edlinger wrote: > On 08/20/18 15:19, Richard Biener wrote: > > On Mon, 20 Aug 2018, Bernd Edlinger wrote: > > > >> On 08/20/18 13:01, Richard Biener wrote: > >>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlin...@hotmail.de> > >>> wrote: > >>>> > >>>> > >>>> > >>>> On 08/01/18 11:29, Richard Biener wrote: > >>>>> > >>>>> Hmm. I think it would be nice if TREE_STRING_LENGTH would > >>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient > >>>>> for your check above. Because the '\0' doesn't belong to the > >>>>> string. Then build_string internally appends a '\0' outside > >>>>> of TREE_STRING_LENGTH. > >>>>> > >>>> > >>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide > >>>> character. > >>> > >>> That could be fixed though (a wide 0 is just N 0s). Add a elsz = 1 > >>> parameter to build_string and allocate as many extra 0s as needed. > >>> > >>> There are STRING_CSTs which are not string literals, > >>>> for instance attribute tags, Pragmas, asm constrants, etc. > >>>> They use the '\0' outside, and have probably no TREE_TYPE. > >>>> > >>>>> > >>>>>> So I would like to be able to assume that the STRING_CST objects > >>>>>> are internally always generated properly by the front end. > >>>>> > >>>>> Yeah, I guess we need to define what "properly" is ;) > >>>>> > >>>> Yes. > >>>> > >>>>>> And that the ARRAY_TYPE of the string literal either has the > >>>>>> same length than the TREE_STRING_LENGTH or if it is shorter, > >>>>>> this is always exactly one (wide) character size less than > >>>>>> TREE_STRING_LENGTH > >>>>> > >>>>> I think it should be always the same... > >>>>> > >>>> > >>>> One could not differentiate between "\0" without zero-termination > >>>> and "" with zero-termination, theoretically. > >>> > >>> Is that important? Doesn't the C standard say how to parse string > >>> literals? > >>> > >>>> We also have char x[100] = "ab"; > >>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100. > >>>> Of course one could create it with a TREE_STRING_LENGTH = 100, > >>>> but imagine char x[100000000000] = "ab" > >>> > >>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I > >>> hope matches "ab" and not 'x'. If it matches 'x' then I'd rather have it > >>> unconditionally be [], thus incomplete (because the literals "size" > >>> depends > >>> on the context/LHS it is used on). > >>> > >> > >> Sorry, but I must say, it is not at all like that. > >> > >> If I compile x.c: > >> const char x[100] = "ab"; > >> > >> and set a breakpoint at output_constant: > >> > >> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256, > >> reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821 > >> 4821 if (size == 0 || flag_syntax_only) > >> (gdb) p size > >> $1 = 100 > >> (gdb) call debug(exp) > >> "ab" > >> (gdb) p *exp > >> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1, > >> (gdb) p exp->typed.type->type_common.size_unit > >> $5 = (tree) 0x7ffff6ff9d80 > >> (gdb) call debug(exp->typed.type->type_common.size_unit) > >> 100 > >> (gdb) p exp->string.length > >> $6 = 3 > >> (gdb) p exp->string.str[0] > >> $8 = 97 'a' > >> (gdb) p exp->string.str[1] > >> $9 = 98 'b' > >> (gdb) p exp->string.str[2] > >> $10 = 0 '\000' > >> (gdb) p exp->string.str[3] > >> $11 = 0 '\000' > >> > >> > >> This is an important property of string_cst objects, that is used in > >> c_strlen: > >> > >> It folds c_strlen(&x[4]) directly to 0, because every byte beyond > >> TREE_STRING_LENGTH > >> is guaranteed to be zero up to the type size. > >> > >> I would not have spent one thought on implementing an optimization like > >> that, > >> but that's how it is right now. > > > > Huh. So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka > > they have zero-padding up to its type size. I don't see this documented > > anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0" > > with appropriate TYPE_SIZE. > > > > This is also a relatively new thing on trunk (coming out of the added > > mem_size parameter of string_constant). That this looks at the STRING_CST > > type like > > > > if (TREE_CODE (array) == STRING_CST) > > { > > *ptr_offset = fold_convert (sizetype, offset); > > if (mem_size) > > *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); > > return array; > > > > I'd call simply a bug. As said, interpretation of STRING_CSTs should > > depend on the context. And for > > > > This use of the TYPE_SIZE_UNIT was pre-existing to r263607 > before that (but not in the gcc-8-branch) we had this in c_strlen: > > HOST_WIDE_INT maxelts = strelts; > tree type = TREE_TYPE (src); > if (tree size = TYPE_SIZE_UNIT (type)) > if (tree_fits_shwi_p (size)) > { > maxelts = tree_to_uhwi (size); > maxelts = maxelts / eltsize - 1; > } > > which I moved to string_constant and return the result through memsize. > > It seems to be working somehow, and I'd bet removing that will immediately > break twenty or thirty of the strlenopt tests. > > Obviously the tree string objects allow way too much variations, > and it has to be restricted in some way or another. > > In the moment my approach is: look at what most string constants do > and add assertions to ensure that there are no exceptions. > > > > char a[4] = "abc"; > > char b[5] = "abc"; > > > > we should better be able to share STRING_CSTs [you can see LTO > > sharing the nodes if you do b[4] but not when b[5] I suppose]. > > > >> All I want to do, is make sure that all string constants have the same > >> look and feel > >> in the middle-end, and restrict the variations that are allowed by the > >> current > >> implementation. > > > > Sure, I understand that. But I'd like to simplify things and not add > > complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide > > whether sth is 0-terminated. > > > > This is not only about 0-terminated. (*) > > It is also about when you get a STRING_CST with a TREE_STRING_LENGTH, > there are places in the middle-end that assume that the object contains > _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT. > Those I want to protect.
Well, but string_constant handles &STRING_CST just fine but in that context there's no "object" we look at. IMHO whenever string_constant ends up with a DECL, looking at ctor_for_folding and we end up with a STRING_CST that doesn't fit in the DECL we looked at we have a bug (non-truncated STRING_CST) and either should do the truncation or simply return NULL. So we should look at DECL_SIZE_UNIT and compare that with TREE_STRING_LENGTH. Otherwise you are relying on TYPE_SIZE_UNIT of the STRING_CST being "correct" (fitting the decl context). > Bernd. > > > *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero, > the string is only returned when it is properly zero terminated.", > but maybe I should have written: > "If MEM_SIZE is zero, the string is only returned when the storage size > of the string object is at least TREE_STRING_LENGTH." > That's at least exactly what the check does. Well, as said above if (TREE_CODE (array) == STRING_CST) { *ptr_offset = fold_convert (sizetype, offset); if (mem_size) *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); return array; } simply assumes the "storage size" of a STRING_CST is determined by its TYPE_SIZE_UNIT. That may be true as you noted in the folloup, but clearly in the above case there's nothing to protect? And in the case we pulled the STRING_CST from some DECL_INITIAL it doesn't protect from overflow of the FIELD_DECL unless frontends never generate "wrong" STRING_CSTs? Btw, get_constant_size / mergeable_string_section suggsts that we may view STRING_CST as general target-encoded byte blob. That may be useful to compress CONSTRUCTORs memory use. It looks like mergeable_string_section wrongly would reject a char[] typed string because int_size_in_bytes returns -1 for incomplete types. I still believe using char[] for most STRING_CSTs generated by FEs would be a good thing for IL memory use. Shouldn't the mem_size initializations use sth like get_constant_size as well? Also if (mem_size) *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (init)); else if (compare_tree_int (array_size, length + 1) < 0) return NULL_TREE; the latter check doesn't seem to honor 'offset' and *mem_size is again looking at the STRING_CST, not at the FIELD_DECL or wherever we got the STRING_CST from. Richard. > > > Richard. > > > >> > >> Bernd. > >> > >> > >>>>>> The idea is to use this property of string literals where needed, > >>>>>> and check rigorously in varasm.c. > >>>>>> > >>>>>> Does that make sense? > >>>>> > >>>>> So if it is not the same then the excess character needs to be > >>>>> a (wide) NUL in your model? ISTR your varasm.c patch didn't verify > >>>>> that. > >>>>> > >>>> > >>>> I think it does. > >>>> > >>>> > >>>> Bernd. > >> > > > -- Richard Biener <rguent...@suse.de> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)