Request for comments on language extension: Safe arrays and pointers for C.
We have proposed an extension to C (primarily) and C++ (possibly) to address buffer overflow prevention. Buffer overflows are still a huge practical problem in C, and much important code is still written in C. This is a new approach that may actually work. The proposal, "Safe arrays amd pointers for C, round 2", is here: http://www.animats.com/papers/languages/safearraysforc41.pdf This has been discussed on Lambda the Ultimate and comp.std.c, and the flaws found in those discussions have been dealt with. So far, no one has found a killer reason why this can't work. The proposal is to combine the concepts of C variable length array formal parameters and C++ references. This allows passing arrays as arrays, rather than pointers. For "strict mode" translation units, arrays must be passed in this way. For compatibility with old code, strict mode code can call non-strict mode code, and vice versa. When strict mode code calls strict mode code, there is checking to insure that sizes match. This approach doesn't require array descriptors or "fat pointers". Example: Standard UNIX/Linux/Posix read, new strict form: int read(size_t n; int fd, void_space(&buf)[n], size_t n); The array parameter as a sized reference, an array of size n. "void_space" is a new type, like "void *" for type matching purposes, but like "char" for space allocation. The initial "size_t n;" is an existing GCC extension, a forward parameter declaration, needed because the array parameter precedes the size parameter. In non-strict code, this can be called with the good old form: char inbuf[512]; int stat = read(somefd, inbuf, 512); The size is not checked in non-strict code. In strict code, the compiler would generate a size check, based on the prototype, that the size of the actual parameter matched the size of the variable length formal parameter. In strict code, arrays generally have to be passed around as references, to keep the associated size information. There's also a way to do this for structs. So this goes beyond C variable length arrays. Again, there are no array descriptors; declarations tell the language where to find the size of an array. So code is compatible at the object level. There's more, but that's the main idea. Programs would be migrated to strict mode from the bottom up. First standard libraries, then security-critical libraries, then security-critical applications. What I'd like for now is an an estimate of how hard this would be to implement in GCC. Most of the necessary features, or something close to them, are already implemented in GCC. Implementors, please comment. Thanks. John Nagle Animats
Re: Request for comments on language extension: Safe arrays and pointers for C.
On 8/31/2012 3:32 PM, Joseph S. Myers wrote: > My comments are: > > * TL;DR. Then, perhaps, commenting is premature. > * No specification given at the level of edits to C11. That's correct. This is still an informal proposal. > * At a glance, no or inadequate explanation of why library solutions > and appropriate coding practices (such as the use of managed string > libraries) are inadequate. If that approach was going to work, it would have succeeded by now. Safer C string libraries date back to the 1980s. New ones are still being proposed. > * How does this compare to the array size checking you get with > _FORTIFY_SOURCE in glibc (and associated GCC extensions)? There's a long history of guard-word schemes for detecting heap overruns, but none have been enormously successful. The FORTIFY_SOURCE mechanism is interesting, but can't check the cases where size information has been lost before the point of checking. With C arrays, the size information is always known at array creation, but can be lost as the array is passed around. This proposal is about not losing size information. > * How does this relate to various cases in the secure coding rules > draft TS (a specification for static analyzers, but should still be > relevant here if you can point to examples of bad code therein that > would be detected reliably through use of your proposals)? The "Arrays" section of the CERT guide, https://www.securecoding.cert.org/confluence/pages/viewpage.action?pageId=263 is a style guide, not a spec for hard checking. See "ARR30-C. Do not form or use out of bounds pointers or array subscripts". > * Why hasn't this been done before - what is so novel that avoids the > pitfalls encountered by previous related work? That's a good question. It was tough to fit this into the existing world of C, C++, and existing code. It turns out, however, to be possible. It's the addition of C++ references to C that makes this work. C++ references provide a way to reference arrays without losing size information. This proposal merely generalizes C++ references, along the lines of C variable length arrays, to handle cases where size information is non-constant. That provides, at last, a way to pass arrays around in C without losing size information. > An insightful analysis of such work and the issues - not necessarily > technical - with it is needed to demonstrate there is a genuine > difference here. Reading onward to page 17 of the paper, where SAL, Cyclone, and the Safe C compiler are discussed, may be helpful. The alternatives either lead to a new language (like Cyclone), or heavy run time overhead (like the Safe C compiler). Microsoft's Structured Annotation Language provides syntax for specifying length. But those are just annotations; they're not used by the actual code or for checking. C99 variable length array parameters also provide syntax for associating dimension information with arrays passed to function. But, during conversion to a pointer, the length of the first dimension is lost. So the dimension information passed is just a comment; it's not used, checked, or accessible within the program. The only use for that feature is multidimensional array indexing. > * Is this really in accordance with the Spirit of C? There is a school of thought that celebrates the freedom of the C programmer to write bad code. The fact that we have millions of machines exploited by buffer overflows on a regular basis perhaps indicates that such freedom can be misused. > * In general we're skeptical of new language extensions given the > problems historically associated with past ones. Assessing what > pitfalls there might be in a proposal and the work required to > implement it is itself a substantial amount of work (I'd guess > several hours at least for this document); it's more likely to happen > if there's something to excite people about the proposal (as well as > if all the other issues I list are addressed), and I don't see > anything particularly exciting here. That's especially the case > given how many previous attempts there have been at addressing this > sort of issue. There's certainly a history of failure in this area. That's why it's worth looking at something that might work. > * If proposals are written by people with substantial experience in C > compiler implementation they are more likely to be sound - what such > experience has gone into writing this document? > > * Consider attending a WG14 meeting and presenting the proposals in > person there (having had them included in a pre-meeting mailing), if > you want a wider range of implementer opinions. That may happen, but I'm still getting comments informally at this point. I'd like to see enough of this implemented in GCC as an extension that people could try it out. John Nagle Animats
Re: Request for comments on language extension: Safe arrays and pointers for C.
On 9/1/2012 9:59 AM, James Dennett wrote: > On Fri, Aug 31, 2012 at 2:55 PM, John Nagle > wrote: >> We have proposed an extension to C (primarily) and C++ (possibly) >> to address buffer overflow prevention. Buffer overflows are still >> a huge practical problem in C, and much important code is still >> written in C. This is a new approach that may actually work. ... > Could you say a little more of why it appears necessary to introduce > references into C for this? The reason I'm puzzled is that C already > has the ability to pass arrays in a way that preserves their size > (just pass the address of the array) -- what is it that references > change in this picture that justifies such a radical change? Could > we just permit pointer-to-array-of-n elements to convert to > pointer-to-array-of-(n-1) elements, and/or provide some way to slice > explicitly? That's an important point. C99 already has variable-length array parameters: int fn(size_t n, float vec[n]); Unfortunately, when the parameter is received in the function body, per N1570 §6.7.6.3p7: 'A declaration of a parameter as "array of _type_" shall be adjusted to "qualified pointer to _type_", where the type qualifiers (if any) are those specified within the [ and ] of the array type derivation.' What this means is that, in the body of the function, "vec" has type "float *", and "sizeof vec" is the size of a pointer. The standard currently requires losing the size of the array. While C99 variable-length array parameters aren't used much (searches of open-source code have failed to find any use cases, Microsoft refuses to implement them, and N1570 makes them optional), these semantics also apply to passing fixed-length arrays: int fn(float vec[4]); As before, "vec" is delivered as "float* vec". The constant case is widely used, and changing the semantics there might silently break existing code that uses "sizeof". We had a go-round on this on comp.std.c, and the conclusion was that changing the semantics of C array passing would break too much. The real reason for using references is that size information is needed in other places than parameters. It's needed in return types, on the left side of assignments, in casts, and in structures. References to arrays have associated information; pointers don't. As for slicing, see "array_slice" in the paper. It's not a built-in; it's a macro that uses "decltype" and a cast to generate the appropriate result type. Personally, I'd like to have a Python-like slicing notation: arr[start:endplus1] but that's not essential to the proposal, so I'm not suggesting it. > Of course to make this succeed you'll need buy-in from implementors > and of the standards committee(s), who will need to trust that the > other (and therefore that users) will find this worth the cost. It > generally takes a lot of work (in terms of robust specification and > possibly implementation in a fork of an open source compiler or two) > to generate the consensus necessary for a proposal to succeed. > Something that might ultimately seek to change or even disallow much > existing C code has an even higher bar -- getting an ISO committee to > remove existing support is no small achievement (e.g., look at how > long gets() persisted). I'd love to see a reduction in the number of > buffer overruns that are present in code, but it's an uphill > struggle. Of course. Support may come from the security community. CERT still reports buffer overflows, usually in C/C++ code, as the single biggest source of vulnerabilities. Vulnerabilities in software are now a public policy level issue. In the last week, software attacks have taken down Saudi Aramco and RasGas, two of the world's largest energy producers. This issue is growing in importance as "info-war" moves from a potential threat to reality. It's now something that has to be fixed. John Nagle
Re: Request for comments on language extension: Safe arrays and pointers for C.
On 9/2/2012 1:12 AM, Florian Weimer wrote: > * John Nagle: > >>We have proposed an extension to C (primarily) and C++ (possibly) >> to address buffer overflow prevention. Buffer overflows are still >> a huge practical problem in C, and much important code is still >> written in C. This is a new approach that may actually work. > > Would you please state publicly if you have any IPR claims necessarily > infringed by an implementation, or if you aware of any such claims by > others? I have no IPR claims in this area. At the language level, I doubt that anyone does, or could. However, there is the potential for a static analysis tool that automatically retrofits sized declarations to existing code, turning non-strict code into strict code. Some of the commercial static analysis systems may have IP in this area. It would probably be narrow, though; proof of correctness systems have been around for a while. I worked on one decades ago, the Pascal-F Verifier. > I'm not sure if the proposed extension would actually help. At work, > we have a coded corpus of vulnerabilities to answer such questions. > For memory safety vulnerabilities related to buffer overflows, the > coding is not yet very accurate, mainly due to lack of a widely > agreed-upon taxonomy, but at least it can serve as a starting point > for a review. There is an "official taxonomy", in “Information Technology — Programming Languages — Guidance to Avoiding Vulnerabilities in Programming Languages through Language Selection and Use”, ISO/IEC TR 24772. The current draft is at http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1583.pdf It's a list of features in programming languages that offer attack points. It can be used to classify programming language vulnerabilities, to classify attacks which exploit those vulnerabilities, or to classify program bugs which relate to those vulnerabilities. At the language level, for example, C has vulnerability 6.10, "Unchecked array indexing". > That being said, it's certainly a very interesting topic! This is a problem that should have been solved a long time ago. There are programmers who think it's inherent that a fast, low-level language must be unsafe. In fact, many of the vulnerabilities in C simply reflect what could be crammed into a compiler that had to run on a PDP-11 with a 64K address space. Others have approached this problem. They either came up with a new language (Modula, Ada, Java), changed C so much it became a new language (Cyclone, Microsoft SAL), or had to add extra run time data to carry around size information (Safe C Compiler, GCC fat pointers.) The problem today is coming up with a backwards compatible solution that can be applied to the huge legacy C code base. This is tough, but not impossible. I'm proposing an optional "strict mode" for C, in which array sizes have to match in all the places where a mismatch creates a vulnerability. In non-strict mode, today's loose rules apply. Non-strict mode code can call strict mode code and vice versa. Non-strict mode code can use the strict mode features. This provides a migration path to converting security-critical code to strict mode. There are other C "tightening up" proposals around. Most are on the library side, and address the usual suspects - memcpy, strcat, etc. This is broader, deals with arrays in general, and covers the usual suspects as well. John Nagle
Re: Request for comments on language extension: Safe arrays and pointers for C.
On 9/3/2012 8:29 AM, Andrew Haley wrote: > On 09/03/2012 04:20 PM, Joseph S. Myers wrote: >> On Mon, 3 Sep 2012, Andrew Haley wrote: >> >>> This isn't the only way to proceed. I'd encourage someone wanting to >>> do this to branch GCC and implement a rough cut of the feature. That >> >> That would very likely be "build one to throw away" - features built >> without a clear definition of how they interact with other language >> features have been particularly problematic in the past. So have >> extensions built based on "take this feature from another language, and >> put it in GNU C". > > The alternative is worse: to design and fully specify a language > feature and suggest that people adopt it without at any point trying > that feature in real applications. > >>> will provide useful information about the amount of work likely to be >>> needed to complete the task. Also, it will provide the opportunity to >>> try out the language feature to see how well it works in practice. >> >> Whether people *will* use it is probably the more significant question >> than whether it *can* be used to address particular issues. > > Well, of course. But the only way to find out is by an iterative > process: design something, try it, and refine. Supporting that is one > of GCC's primary goals, and has been since the beginning of the > project. > > Andrew. > Exactly. That's why I'm raising this issue on the GCC list. GCC already has many of the necessary extensions, such as forward parameter declarations. It has VLAs on the C side, and references on the C++ side. So most of the necessary machinery is already implemented within GCC. A first step would be a GCC version which allowed variable length arrays in references and structures, but only made the array parameter size checks, not full subscript checks. That would allow trying to port some code over to strict mode, and would wring out the concept. Think of it as FORTIFY on steroids. It can do the parameter checks FORTIFY does, but for any function with an array parameter and a size. It's not limited to a built-in list of the usual suspect functions. John Nagle
Re: Request for comments on language extension: Safe arrays and pointers for C, September draft.
Here's the September 2012 draft of my "Safe arrays and pointers for C" proposal: http://www.animats.com/papers/languages/safearraysforc43.pdf This incorporates most of the substantive issues raised in previous discussions. Brief summary: - Optional "strict mode" via pragma which prohibits some unsafe pointer usages. - Prevents buffer overflows in strict mode. - Bring C++ references into C, so programmers can talk about arrays. - Expressions allowed in array dimensions (like VLA params, but in a few more contexts.) - Strict code can call non-strict code, and vice versa. - Libraries and APIs with array params can be given strict declarations, and can be called from strict code (safely) and non-strict code (unsafely), allowing gradual conversion. The goal is to eliminate buffer overflows in strict mode code, providing a substantial improvement in security and reliability for security-critical C programs. I'm proposing this as an enhancement to GCC, in two phases. Phase 1: Add language mode flag for this feature set. Support new language features. No bounds checking in this phase. Phase 2: Add optional bounds checking. I'd appreciate comments on how difficult phase 1 would be. John Nagle