Re: [PATCH] Make strlen range computations more conservative

Martin Sebor Mon, 06 Aug 2018 09:33:08 -0700

On 08/06/2018 09:12 AM, Jeff Law wrote:

On 08/04/2018 03:56 PM, Martin Sebor wrote:

On 08/03/2018 01:00 AM, Jeff Law wrote:

On 07/24/2018 05:18 PM, Bernd Edlinger wrote:

On 07/24/18 23:46, Jeff Law wrote:

On 07/24/2018 01:59 AM, Bernd Edlinger wrote:

Hi!


This patch makes strlen range computations more conservative.

Firstly if there is a visible type cast from type A to B before
passing
then value to strlen, don't expect the type layout of B to restrict
the
possible return value range of strlen.

Why do you think this is the right thing to do?  ie, is there language
in the standards that makes you think the code as it stands today is
incorrect from a conformance standpoint?  Is there a significant
body of
code that is affected in an adverse way by the current code?  If so,
what code?


I think if you have an object, of an effective type A say char[100],
then
you can cast the address of A to B, say typedef char (*B)[2] for
instance
and then to const char *, say for use in strlen.  I may be wrong, but
I think
that we should at least try not to pick up char[2] from B, but instead
use A for strlen ranges, or leave this range open.  Currently the range
info for strlen is [0..1] in this case, even if we see the type cast
in the generic tree.

ISTM that you're essentially saying that the cast to const char *
destroys any type information we can exploit here.  But if that's the
case, then I don't think we can even derive a range of [0,99].  What's
to say that "A" didn't result from a similar cast of some object that
was char[200] that happened out of the scope of what we could see during
the strlen range computation?

If that is what you're arguing, then I think there's a re-evaluation
that needs to happen WRT strlen range computation/

And just to be clear, I do see this as a significant correctness
question.

Martin, thoughts?


The argument is that given:

  struct S { char a[4], b; };

  char a[8] = "1234567";

this is valid and should pass:

  __attribute__ ((noipa))
  void f (struct S *p)
  {
    assert (7 == strlen (p->a));
  }

  int main (void)
  {
    f ((struct S*)a);
  }

(This is the basic premise behind pr86259.)

This argument is wrong and the code is invalid.  For the access
to p->a to be valid p must point to an object of struct S (it
doesn't) and the p->a array must hold a nul-terminated string
(it also doesn't).

I agree with you for C/C++, but I think it's been shown elsewhere in
this thread that GIMPLE semantics to not respect the subobject
boundaries.  That's a sad reality.

[ ... ]


I care less about the optimization than I do about the basic
premise that it's essential to respect subobject boundaries(*).

I understand, but the semantics of GIMPLE do not respect them.  We can
argue about whether or not those should change and what it would take to
fix that. But right now the existing semantics do not respect those
boundaries.


They don't respect them in all cases (i.e., when MEM_REF loses
information about the structure of an access) but in a good
number of them GCC can still derive useful information from
the access.  It's relied on to great a effect by _FORTIFTY_SOURCE.
I think it would be a welcome enhancement if besides out-of-
bounds writes _FORTIFTY_SOURCE also prevented out-of-bounds
reads.

It would make little sense to undo the strlen optimization
without also undoing the optimization for the direct array
access case.  Undoing either would raise the question about
the validity of the _FORRTIFY_SOURCE=2 behavior.  That would
be a huge step backwards in terms of code security.  If we
did some of these but not others, it would make the behavior
inconsistent and surprising, all to accommodate one instance
of invalid code.

In the direct array access case I think (and I'm sure Jakub, Richi and
others will correct me if I'm wrong), we can use the object's type
because the dereferences are actually using the array's type.


Subscripting and pointer access are identical both in C/C++
and in GCC's _FORTIFY_SOURCE.  The absence of a distinction
between the two is essential for preventing writes past
the end by string functions like strcpy (_FORTIFY_SOURCE).

If we had a valid test case where the strlen optimization
leads to invalid code, or even if there were a significant
number of bug reports showing that it breaks an invalid
but common idiom, I would certainly feel compelled to
make it right somehow.  But there has been just one bug
report with clearly invalid code that should be easily
corrected.

Again, I think you're too narrowly focused on C/C++ semantics here.
What matters are the semantics in GIMPLE.


I don't get that.  GCC is a C/C++ compiler (besides other
languages), but not a GIMPLE compiler.   The only reason this
came up at all is a bug report with an invalid C test case that
reads past the end.  The only reason in my mind to consider
relaxing an assumption/restriction would be a valid test case
(in any supported language) that the optimization invalidates.

But as I said, far more essential than the optimization is
the ability to detect these invalid access (both reads and
writes), such as in:

  struct S { char a[4], b[2], c; };

  void f (struct S *p)
  {
    strcpy (p->a, "1234");        // certain buffer overflow

    sprintf (p->b, "%s", p->a);   // potential buffer overflow

    // ...but, to avoid false positives:
    sprintf (p->a, "%s", p->b);   // no buffer overflow here
  }

You've recently made comment elsewhere that you wish GCC would
be more aggressive in detecting preventing undefined behavior
by inserting traps.  I agree but I don't see how we can aim
for both looser and stricter UB detection at the same time.

In any event, it seems clear to me that I've lost the argument.
If you undo the optimization then please retain the functionality
for the warnings.  Otherwise you might as well remove those too.

Martin

Re: [PATCH] Make strlen range computations more conservative

Reply via email to