[PATCH] D136811: [-Wunsafe-buffer-usage] WIP: RFC: NFC: User documentation.

Aaron Ballman via Phabricator via cfe-commits Tue, 29 Nov 2022 05:54:30 -0800

aaron.ballman added inline comments.


================
Comment at: clang/docs/SafeBuffers.rst:92
+
+However, no automatic code modernizer for plain C is not provided,
+and the hardened C++ standard library cannot benefit C code, which limits
----------------



================
Comment at: clang/docs/SafeBuffers.rst:40-41
+  - Finally, in order to avoid bugs in newly converted code, the
+    Clang static analyzer provides a checker to find misconstructed
+    ``std::span`` objects.
+
----------------
NoQ wrote:
> aaron.ballman wrote:
> > xazax.hun wrote:
> > > jkorous wrote:
> > > > NoQ wrote:
> > > > > xazax.hun wrote:
> > > > > > Isn't this overly specific? What about `string_view`s? 
> > > > > Yeah I think we can cover string views as well!
> > > > This is based on our current plans for near future.
> > > > While we are positive we will use `std::span` in the FixIts it is very 
> > > > unlikely we'd use `string_view`. That's why having a checker for 
> > > > `std::span` is more important for us now.
> > > Does this mean you'd recommend using a `span<char>` over `string_view` in 
> > > all cases? I think that might surprise some users, so I wonder if it is 
> > > worth documenting. 
> > Oh, I was sort of assuming we'd be suggesting `string_view` over 
> > `span<char>` when appropriate. So agreed that we should clearly document 
> > this as it may surprise folks.
> > 
> > Out of curiosity, why the strong preference for `span`?
> `string_view` is a bit harder to recommend automatically because whether a 
> certain sequence of bytes constitutes a "string" is often a matter of opinion 
> / user's point of view. `string_view`'s methods such as `find` will not 
> necessarily make sense over original data, or may turn out to have 
> counter-intuitive behavior when users try to modify the code further.
> 
> Additionally, `string_view` doesn't offer write access, so at best we'll be 
> able to recommend it for `const` pointers.
> 
> So I think we should gather more data before considering it.
> string_view is a bit harder to recommend automatically because whether a 
> certain sequence of bytes constitutes a "string" is often a matter of opinion 
> / user's point of view.

But why does that matter for `string_view`? It's not a null-terminated 
interface, so it explicitly encodes a pointer/length interface, same as `span`.


================
Comment at: clang/docs/SafeBuffers.rst:128
+  ``+``, ``-``, ``+=``, ``-=``,
+      - unless that number is a compile time constant ``0``;
+      - subtraction between two pointers is also fine;
----------------
NoQ wrote:
> jkorous wrote:
> > aaron.ballman wrote:
> > > One case I think this is going to trip up is: `this + 1` (which is not 
> > > uncommon in practice: 
> > > https://sourcegraph.com/search?q=context:global+-file:.*test.*+file:.*%5C.cpp+this+%2B+1&patternType=standard).
> > >  How do users fix up that pointer arithmetic and do they ever get any 
> > > safety or security benefits from it? (Note, this doesn't apply to `+=`, 
> > > just `+`.)
> > That's a fair point and if we don't find a good solution we should ignore 
> > arithmetic with `this` and a constant.
> I'll leave this unaddressed for now. It looks like these constructs typically 
> *do* indicate that the remaining part of the object is a buffer. C++ doesn't 
> seem to provide a good "safe" way to represent such objects. So, yeah, we 
> need more data.
Fair enough -- it's worth documenting explicitly as an open question. FWIW, I 
agree that the construct is mostly used for accessing trailing objects as a 
buffer. The closest thing I've seen to a safe way to handle this is how LLVM 
does it with the `TrailingObjects` mixin.


================
Comment at: clang/docs/SafeBuffers.rst:132
+    attribute ``[[unsafe_buffer_usage]]``,
+      - unless the pointer is a compile time constant ``0`` or ``nullptr``;
+      - a number of C/C++ standard library buffer manipulation functions
----------------
NoQ wrote:
> aaron.ballman wrote:
> > NoQ wrote:
> > > aaron.ballman wrote:
> > > > `NULL` as well, I presume? (This matters for C where `NULL` often 
> > > > expands to `(void *)0`.)
> > > Hmm, I was thinking that `((void *)0)` is a compile-time constant `0`. 
> > > Like, I didn't say it's a *literal*. Is there a better term?
> > Are you okay with constant expressions? e.g.,
> > ```
> > consteval int foo() { return 0; }
> > 
> > void func(char *ptr) {
> >   char *offset_ptr = ptr + foo(); // Is this fine?
> > }
> > ```
> Yes sure, why not? If it ever becomes unsafe, we'll immediately warn.
Ah, C terminology is biting me here -- we use the term "constant" the same way 
C++ uses the term "literal", so when I read "constant 0" my brain interpreted 
that as literally just `0`.


================
Comment at: clang/docs/SafeBuffers.rst:213
+
+The attribute is NOT warranted when the function has runtime protection against
+overflows, assuming hardened libc++, assuming that containers constructed
----------------
NoQ wrote:
> aaron.ballman wrote:
> > jkorous wrote:
> > > NoQ wrote:
> > > > aaron.ballman wrote:
> > > > > jkorous wrote:
> > > > > > aaron.ballman wrote:
> > > > > > > One thing I think is worth pointing out about these docs is that 
> > > > > > > the first example effectively says "do add the attribute because 
> > > > > > > the size passed in to the function could be wrong" and the second 
> > > > > > > example says "don't add the attribute on the assumption that the 
> > > > > > > container has the correct size information". The advice feels a 
> > > > > > > bit conflicting to me because in one case we're assuming callers 
> > > > > > > pass in incorrect values and in the other case we're assuming 
> > > > > > > callers pass in correct values and the only distinction between 
> > > > > > > them is a "container was used".  But a significant portion of our 
> > > > > > > users don't use libc++ (they use libstdc++ or MS STL for example).
> > > > > > > 
> > > > > > > I think we should have more details on why the STL used at 
> > > > > > > runtime doesn't matter, or if it still really does matter, we may 
> > > > > > > want to reconsider the advice we give.
> > > > > > > 
> > > > > > > Also, we don't give similar advice for use of the pragmas. Should 
> > > > > > > we? (Maybe split the advice as to when to use markings in general 
> > > > > > > out into its own section and reference it from both the pragma 
> > > > > > > and attribute sections?)
> > > > > > > One thing I think is worth pointing out about these docs is that 
> > > > > > > the first example effectively says "do add the attribute because 
> > > > > > > the size passed in to the function could be wrong" and the second 
> > > > > > > example says "don't add the attribute on the assumption that the 
> > > > > > > container has the correct size information". The advice feels a 
> > > > > > > bit conflicting to me because in one case we're assuming callers 
> > > > > > > pass in incorrect values and in the other case we're assuming 
> > > > > > > callers pass in correct values and the only distinction between 
> > > > > > > them is a "container was used". But a significant portion of our 
> > > > > > > users don't use libc++ (they use libstdc++ or MS STL for example).
> > > > > > 
> > > > > > Our model depends on Standard Library used implementing the bounds 
> > > > > > checks.
> > > > > > With that the "container was used" distinction is absolutely 
> > > > > > crucial - that is what adds the bounds checks and already mitigates 
> > > > > > certain classes of bugs (even if the `span` passed in still has the 
> > > > > > same wrong size).
> > > > > > 
> > > > > > Let's assume we're starting with this buggy code:
> > > > > > 
> > > > > > ```
> > > > > > void caller() {
> > > > > >   int buffer[10];
> > > > > >   callee(buffer, 20);
> > > > > > }
> > > > > > 
> > > > > > void callee(int* ptr, unsigned size) {
> > > > > >   ptr[size-1] = 42;
> > > > > >   ptr[size] = 42;
> > > > > >   ptr[100] = 42;
> > > > > > }
> > > > > > ```
> > > > > > and transform it to (this is just a part of what we actually plan 
> > > > > > but hopefully illustrates the point):
> > > > > > ```
> > > > > > void caller() {
> > > > > >   int buffer[10];
> > > > > >   callee(std::span<int>(buffer, 20), 20);
> > > > > > }
> > > > > > 
> > > > > > void callee(std::span<int> ptr, unsigned size) {
> > > > > >   ptr[size-1] = 42; // still unmitigated
> > > > > >   ptr[size] = 42; // mitigated at runtime by 
> > > > > > std::span::operator[]() in hardened libc++
> > > > > >   ptr[100] = 42; // mitigated at runtime by std::span::operator[]() 
> > > > > > in hardened libc++
> > > > > > }
> > > > > > ```
> > > > > > Our model depends on Standard Library used implementing the bounds 
> > > > > > checks.
> > > > > 
> > > > > So will the model consider things like libstdc++ that don't implement 
> > > > > the bounds checks? Or MSVC STL with Safe Library Support 
> > > > > (https://learn.microsoft.com/en-us/cpp/standard-library/safe-libraries-cpp-standard-library?view=msvc-170)
> > > > >  which seems to have some of the hardening you're talking about, but 
> > > > > not all of it?
> > > > > 
> > > > > > ```
> > > > > > void caller() {
> > > > > >   int buffer[10];
> > > > > >   callee(std::span<int>(buffer, 20), 20);
> > > > > > }
> > > > > > 
> > > > > > void callee(std::span<int> ptr, unsigned size) {
> > > > > >   ptr[size-1] = 42; // still unmitigated
> > > > > >   ptr[size] = 42; // mitigated at runtime by 
> > > > > > std::span::operator[]() in hardened libc++
> > > > > >   ptr[100] = 42; // mitigated at runtime by std::span::operator[]() 
> > > > > > in hardened libc++
> > > > > > }
> > > > > > ```
> > > > > 
> > > > > Understood; my point was that the comments in `callee()` are 
> > > > > incorrect in other STL implementations. The model seems to presume 
> > > > > all STLs are hardened (or will be in the relatively near future) but 
> > > > > I don't think that's a valid presumption (for example, STLs exist 
> > > > > which are focused solely on performance, like EASTL). So that makes 
> > > > > me question whether the model *should* presume that container state 
> > > > > is valid when it comes from an untrusted source.
> > > > > One thing I think is worth pointing out about these docs is that the 
> > > > > first example effectively says "do add the attribute because the size 
> > > > > passed in to the function could be wrong" and the second example says 
> > > > > "don't add the attribute on the assumption that the container has the 
> > > > > correct size information". The advice feels a bit conflicting to me 
> > > > > because in one case we're assuming callers pass in incorrect values 
> > > > > and in the other case we're assuming callers pass in correct values 
> > > > > and the only distinction between them is a "container was used".
> > > > 
> > > > Yeah that's an important thing for us to properly formulate, let me try 
> > > > to elaborate on what I meant.
> > > > 
> > > > So we trust containers and spans, like we trust smart pointers: if you 
> > > > simply use them "normally", they're correct by construction. You have 
> > > > to go out of your way to get them wrong. Like, you take an array, it 
> > > > knows the size, you implicitly convert it to span, it's now 
> > > > well-formed, you pass it from function to function, it's still 
> > > > well-formed as it still carries the original bounds information, you 
> > > > subspan it... well this needs a runtime bounds check yeah, but other 
> > > > than that it's hard to get it wrong. And our tool is supposed to 
> > > > transform the entire program to do exactly this. And once we're there, 
> > > > none of these operations require manual intervention from the user, so 
> > > > there's very little room for error.
> > > > 
> > > > On the contrary, if bounds information is passed separately, there's a 
> > > > lot of room for error every single time it's passed from function to 
> > > > function. There's literally nothing that prevents such information from 
> > > > going out of sync. There's nothing that prevents the user from passing 
> > > > a size of data he *intends to be written* into the buffer, instead of 
> > > > the size of the buffer. So we don't trust that.
> > > > 
> > > > Does that make sense?
> > > The warnings about unsafe buffer usage would be valid no matter what 
> > > Standard Library implementation is used and I can imagine the FixIts to 
> > > exist in different flavors.
> > > We don't plan to target any other Standard Library implementation but the 
> > > machinery we'll bring up should allow that to be implemented in the 
> > > future.
> > > Does that make sense?
> > 
> > It does, but the situation I'm thinking of is where the user is migrating 
> > their unsafe code to this new programming mode and aren't able to use 
> > libc++ to get the extra hardening checks. Consider this contrived example:
> > ```
> > extern "C" void func(int *array, size_t count) {
> >   // Do buffer things with array and count
> > }
> > ```
> > The user can't (easily) modify this code to accept a span directly because 
> > that could potentially break ABI. So they wrap the data in a span:
> > ```
> > extern "C" void func(int *array, size_t count) {
> >   std::span<int> sp(array, count);
> >   // Do buffer things with sp
> > }
> > ```
> > In both cases, the user is taking it on faith that the data passed into the 
> > function is correct. But you want the user to believe the second case is 
> > more trustworthy because it uses `span` -- but more trustworthy based on 
> > what? There's no hardened checks in the STL they're using and the 
> > potentially-wrong information is the same flavor of potentially-wrong in 
> > either case.
> > 
> > My fear is that we push users to make changes like that by telling them 
> > it's safer when it's actually functionally equivalent instead, but when 
> > they migrates their code they accidentally *introduce* a new bug where 
> > there wasn't one before.
> > 
> > I'm used to thinking about this sort of stuff as a kind of taint analysis. 
> > We need to know where that data comes from in order to know whether we can 
> > trust it or not. Did the "count" come from user input on the command 
> > line... or did it come from a call to `.size()` on some other container? 
> > These have different levels of trust (to me).
> > 
> > Btw, this brings up a different question about fix-it hints. Are you 
> > planning to add fix-it hints to interface boundaries? If so, are you 
> > planning to pay attention to things like `extern "C"` or the function 
> > linkage to try to reduce the amount of ABI breaks suggested?
> > In both cases, the user is taking it on faith that the data passed into the 
> > function is correct. But you want the user to believe the second case is 
> > more trustworthy because it uses span
> 
> No, that's definitely not our intention. That's why I'm explicitly specifying 
> that this assumption only applies to containers constructed *outside* the 
> function. In your example both functions should wear the attribute. And 
> there's no way to produce a safe alternative (which wouldn't need 
> `[[unsafe_buffer_usage]]`) under `extern "C"` linkage.
> 
> But on the C side, there's no expectation of having a safe alternative in the 
> first place. Calls to this function become just another unsafe operation, 
> which we can hope to isolate and reduce, but never to eliminate. So it's fine 
> that a safe alternative isn't provided at all.
> 
> > Btw, this brings up a different question about fix-it hints. Are you 
> > planning to add fix-it hints to interface boundaries? If so, are you 
> > planning to pay attention to things like extern "C" or the function linkage 
> > to try to reduce the amount of ABI breaks suggested?
> 
> So, yes, we want fixes on interface boundaries, but we only want fixes that 
> don't break source or binary compatibility. So our fixes will always preserve 
> the original interface, possibly annotate it as unsafe, and then provide a 
> safe alternative that call sites can, but *don't have to*, transition to. So 
> in this case we're aiming for something like this:
> 
> ```lang=c++
> [[clang::unsafe_buffer_usage]]
> extern "C" void func(int *array, size_t count) {
>   func(std::span<int>(array, count));
> }
> 
> void func(std::span<int> sp) {
>   // Do buffer things with sp
> }
> ```
> where the original function is preserved, the code is not duplicated, and the 
> safe function doesn't have `extern "C"`. In this case the original function 
> is still unsafe (wears the attribute) because the container isn't coming from 
> outside.
>>In both cases, the user is taking it on faith that the data passed into the 
>>function is correct. But you want the user to believe the second case is more 
>>trustworthy because it uses span
>No, that's definitely not our intention. That's why I'm explicitly specifying 
>that this assumption only applies to containers constructed *outside* the 
>function. In your example both functions should wear the attribute. And 
>there's no way to produce a safe alternative (which wouldn't need 
>[[unsafe_buffer_usage]]) under extern "C" linkage.

My point is that the assumption seems invalid to me -- it's a matter of 
garbage-in, garbage-out, but you're saying "this potentially garbage-in 
situation is assumed to not happen" and then building analysis checks around 
that assumption. I realize we have to make some heuristic decisions for the 
analysis, but this one seems dangerous for something attempting to improve 
buffer safety because nothing validates that assumption. Or do you have plans 
for the future to add an analysis to cover the situation of shoving garbage 
into the span?

To be clear, this is the situation I'm concerned about:
```
// SomeLibraryTU.cpp
void func(int *buffer, size_t count) { // Old interface for compatibility
  func(std::span<int>(buffer, count));
}

void func(std::span<int> sp) {  // New, safe interface
  if (!sp.empty()) {
    int *data = sp.data();
    // Do stuff on data (other than pointer arithmetic, only looking at first 
element)
  }
}

// SomeApplicationTU.cpp
int main() {
  func(0, 1); // Oops, wrong count!
}
```
My complaint is that the library code is perfectly safe but the application 
code is not, and it sounds like this situation is purposefully not considered 
as part of this analysis... but it seems like this situation is the common case 
we'd *want* to catch.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136811/new/

https://reviews.llvm.org/D136811

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D136811: [-Wunsafe-buffer-usage] WIP: RFC: NFC: User documentation.

Reply via email to