Re: [PATCH] Make strlen range computations more conservative

Jeff Law Mon, 06 Aug 2018 15:50:00 -0700

On 08/06/2018 10:32 AM, Martin Sebor wrote:
> On 08/06/2018 09:12 AM, Jeff Law wrote:
>> On 08/04/2018 03:56 PM, Martin Sebor wrote:
>>> On 08/03/2018 01:00 AM, Jeff Law wrote:
>>>> On 07/24/2018 05:18 PM, Bernd Edlinger wrote:
>>>>> On 07/24/18 23:46, Jeff Law wrote:
>>>>>> On 07/24/2018 01:59 AM, Bernd Edlinger wrote:
>>>>>>> Hi!
>>>>>>>
>>>>>>> This patch makes strlen range computations more conservative.
>>>>>>>
>>>>>>> Firstly if there is a visible type cast from type A to B before
>>>>>>> passing
>>>>>>> then value to strlen, don't expect the type layout of B to restrict
>>>>>>> the
>>>>>>> possible return value range of strlen.
>>>>>> Why do you think this is the right thing to do?  ie, is there
>>>>>> language
>>>>>> in the standards that makes you think the code as it stands today is
>>>>>> incorrect from a conformance standpoint?  Is there a significant
>>>>>> body of
>>>>>> code that is affected in an adverse way by the current code?  If so,
>>>>>> what code?
>>>>>>
>>>>>>
>>>>>
>>>>> I think if you have an object, of an effective type A say char[100],
>>>>> then
>>>>> you can cast the address of A to B, say typedef char (*B)[2] for
>>>>> instance
>>>>> and then to const char *, say for use in strlen.  I may be wrong, but
>>>>> I think
>>>>> that we should at least try not to pick up char[2] from B, but instead
>>>>> use A for strlen ranges, or leave this range open.  Currently the
>>>>> range
>>>>> info for strlen is [0..1] in this case, even if we see the type cast
>>>>> in the generic tree.
>>>> ISTM that you're essentially saying that the cast to const char *
>>>> destroys any type information we can exploit here.  But if that's the
>>>> case, then I don't think we can even derive a range of [0,99].  What's
>>>> to say that "A" didn't result from a similar cast of some object that
>>>> was char[200] that happened out of the scope of what we could see
>>>> during
>>>> the strlen range computation?
>>>>
>>>> If that is what you're arguing, then I think there's a re-evaluation
>>>> that needs to happen WRT strlen range computation/
>>>>
>>>> And just to be clear, I do see this as a significant correctness
>>>> question.
>>>>
>>>> Martin, thoughts?
>>>
>>> The argument is that given:
>>>
>>>   struct S { char a[4], b; };
>>>
>>>   char a[8] = "1234567";
>>>
>>> this is valid and should pass:
>>>
>>>   __attribute__ ((noipa))
>>>   void f (struct S *p)
>>>   {
>>>     assert (7 == strlen (p->a));
>>>   }
>>>
>>>   int main (void)
>>>   {
>>>     f ((struct S*)a);
>>>   }
>>>
>>> (This is the basic premise behind pr86259.)
>>>
>>> This argument is wrong and the code is invalid.  For the access
>>> to p->a to be valid p must point to an object of struct S (it
>>> doesn't) and the p->a array must hold a nul-terminated string
>>> (it also doesn't).
>> I agree with you for C/C++, but I think it's been shown elsewhere in
>> this thread that GIMPLE semantics to not respect the subobject
>> boundaries.  That's a sad reality.
>>
>> [ ... ]
>>
>>>
>>> I care less about the optimization than I do about the basic
>>> premise that it's essential to respect subobject boundaries(*).
>> I understand, but the semantics of GIMPLE do not respect them.  We can
>> argue about whether or not those should change and what it would take to
>> fix that. But right now the existing semantics do not respect those
>> boundaries.
> 
> They don't respect them in all cases (i.e., when MEM_REF loses
> information about the structure of an access) but in a good
> number of them GCC can still derive useful information from
> the access.  It's relied on to great a effect by _FORTIFTY_SOURCE.
> I think it would be a welcome enhancement if besides out-of-
> bounds writes _FORTIFTY_SOURCE also prevented out-of-bounds
> reads.
> 
>>> It would make little sense to undo the strlen optimization
>>> without also undoing the optimization for the direct array
>>> access case.  Undoing either would raise the question about
>>> the validity of the _FORRTIFY_SOURCE=2 behavior.  That would
>>> be a huge step backwards in terms of code security.  If we
>>> did some of these but not others, it would make the behavior
>>> inconsistent and surprising, all to accommodate one instance
>>> of invalid code.
>> In the direct array access case I think (and I'm sure Jakub, Richi and
>> others will correct me if I'm wrong), we can use the object's type
>> because the dereferences are actually using the array's type.
> 
> Subscripting and pointer access are identical both in C/C++
> and in GCC's _FORTIFY_SOURCE.  The absence of a distinction
> between the two is essential for preventing writes past
> the end by string functions like strcpy (_FORTIFY_SOURCE).
> 
>>> If we had a valid test case where the strlen optimization
>>> leads to invalid code, or even if there were a significant
>>> number of bug reports showing that it breaks an invalid
>>> but common idiom, I would certainly feel compelled to
>>> make it right somehow.  But there has been just one bug
>>> report with clearly invalid code that should be easily
>>> corrected.
>> Again, I think you're too narrowly focused on C/C++ semantics here.
>> What matters are the semantics in GIMPLE.
> 
> I don't get that.  GCC is a C/C++ compiler (besides other
> languages), but not a GIMPLE compiler.   The only reason this
> came up at all is a bug report with an invalid C test case that
> reads past the end.  The only reason in my mind to consider
> relaxing an assumption/restriction would be a valid test case
> (in any supported language) that the optimization invalidates.
> 
> But as I said, far more essential than the optimization is
> the ability to detect these invalid access (both reads and
> writes), such as in:
> 
>   struct S { char a[4], b[2], c; };
> 
>   void f (struct S *p)
>   {
>     strcpy (p->a, "1234");        // certain buffer overflow
> 
>     sprintf (p->b, "%s", p->a);   // potential buffer overflow
> 
>     // ...but, to avoid false positives:
>     sprintf (p->a, "%s", p->b);   // no buffer overflow here
>   }
> 
> You've recently made comment elsewhere that you wish GCC would
> be more aggressive in detecting preventing undefined behavior
> by inserting traps.  I agree but I don't see how we can aim
> for both looser and stricter UB detection at the same time.
Certainly we can't insert a trap unless we are absolutely certain the
code in question is undefined behavior.  In fact, if a language comes
along that says division by zero is well defined and should (for
example) saturate to INT_MAX, then trapping after the division by zero
is clearly going to be wrong and we'd have to conditionalize the trap
insertion code.


In cases where we are certain an array dereference is out of bounds and
undefined, then I'd fully support using __builtin_trap to halt the
program immediately after the array dereference.

But in the cases we're arguing about we don't have that level of
certainty because we don't have good type information on the strlen
call.  It's lost and not really recoverable and thus we have to be
conservative in those cases.

Jeff`

Re: [PATCH] Make strlen range computations more conservative

Reply via email to