Request for comments on language extension: Safe arrays and pointers for C.

2012-08-31 Thread John Nagle
   We have proposed an extension to C (primarily) and C++ (possibly)
to address buffer overflow prevention.  Buffer overflows are still
a huge practical problem in C, and much important code is still
written in C.  This is a new approach that may actually work.

The proposal,
"Safe arrays amd pointers for C, round 2", is here:

http://www.animats.com/papers/languages/safearraysforc41.pdf

This has been discussed on Lambda the Ultimate and comp.std.c,
and the flaws found in those discussions have been dealt with.
So far, no one has found a killer reason why this can't work.

The proposal is to combine the concepts of C variable length
array formal parameters and C++ references.  This allows
passing arrays as arrays, rather than pointers.
For "strict mode" translation units, arrays must be passed in
this way.  For compatibility with old code, strict mode
code can call non-strict mode code, and vice versa.
When strict mode code calls strict mode code, there is
checking to insure that sizes match.  This approach
doesn't require array descriptors or "fat pointers".

Example: Standard UNIX/Linux/Posix read, new strict form:

  int read(size_t n; int fd, void_space(&buf)[n], size_t n);

The array parameter as a sized reference, an array of size n.

"void_space" is a new type, like "void *" for type matching
purposes, but like "char" for space allocation.

The initial "size_t n;" is an existing GCC extension, a forward
parameter declaration, needed because the array parameter
precedes the size parameter.

In non-strict code, this can be called with the good old form:

char inbuf[512];
int stat = read(somefd, inbuf, 512);

The size is not checked in non-strict code.  In strict code,
the compiler would generate a size check, based on the
prototype, that the size of the actual parameter matched the
size of the variable length formal parameter.

In strict code, arrays generally have to be passed around as references,
to keep the associated size information.  There's also a way to
do this for structs. So this goes beyond C variable length arrays.
Again, there are no array descriptors; declarations tell the
language where to find the size of an array.  So code is
compatible at the object level.

There's more, but that's the main idea.  Programs would
be migrated to strict mode from the bottom up.  First
standard libraries, then security-critical libraries,
then security-critical applications.

What I'd like for now is an an estimate of how hard this would
be to implement in GCC.  Most of the necessary features, or
something close to them, are already implemented in GCC.
Implementors, please comment.  Thanks.

John Nagle
Animats


Re: Request for comments on language extension: Safe arrays and pointers for C.

2012-08-31 Thread John Nagle
On 8/31/2012 3:32 PM, Joseph S. Myers wrote:
> My comments are:
>
> * TL;DR.

   Then, perhaps, commenting is premature.

> * No specification given at the level of edits to C11.

   That's correct.  This is still an informal proposal.

> * At a glance, no or inadequate explanation of why library solutions
> and appropriate coding practices (such as the use of managed string
> libraries) are inadequate.

   If that approach was going to work, it would have succeeded by now.
Safer C string libraries date back to the 1980s.  New ones are
still being proposed.

> * How does this compare to the array size checking you get with
> _FORTIFY_SOURCE in glibc (and associated GCC extensions)?

   There's a long history of guard-word schemes for detecting
heap overruns, but none have been enormously successful.  The
FORTIFY_SOURCE mechanism is interesting, but can't check the
cases where size information has been lost before the point of
checking.

With C arrays, the size information is always known at array
creation, but can be lost as the array is passed around.
This proposal is about not losing size information.

> * How does this relate to various cases in the secure coding rules
> draft TS (a specification for static analyzers, but should still be
> relevant here if you can point to examples of bad code therein that
> would be detected reliably through use of your proposals)?

   The "Arrays" section of the CERT guide,
https://www.securecoding.cert.org/confluence/pages/viewpage.action?pageId=263
is a style guide, not a spec for hard checking. See
"ARR30-C. Do not form or use out of bounds pointers or array subscripts".

> * Why hasn't this been done before - what is so novel that avoids the
>  pitfalls encountered by previous related work?

   That's a good question.  It was tough to fit this into the
existing world of C, C++, and existing code.  It turns out,
however, to be possible.

   It's the addition of C++ references to C that makes this work.
C++ references provide a way to reference arrays without losing
size information.  This proposal merely generalizes C++ references,
along the lines of C variable length arrays, to handle cases where
size information is non-constant.  That provides, at last, a way
to pass arrays around in C without losing size information.

> An insightful analysis of such work and the issues - not necessarily
> technical - with it is needed to demonstrate there is a genuine
> difference here.

Reading onward to page 17 of the paper, where SAL, Cyclone, and
the Safe C compiler are discussed, may be helpful.  The alternatives
either lead to a new language (like Cyclone), or heavy run time
overhead (like the Safe C compiler).

Microsoft's Structured Annotation Language provides syntax
for specifying length.  But those are just annotations; they're
not used by the actual code or for checking.  C99 variable
length array parameters also provide syntax for associating
dimension information with arrays passed to function.
But, during conversion to a pointer, the length of the
first dimension is lost.  So the dimension information
passed is just a comment; it's not used, checked, or
accessible within the program.  The only use for that
feature is multidimensional array indexing.

> * Is this really in accordance with the Spirit of C?

There is a school of thought that celebrates the freedom
of the C programmer to write bad code.  The fact that we have
millions of machines exploited by buffer overflows on a regular
basis perhaps indicates that such freedom can be misused.

> * In general we're skeptical of new language extensions given the
> problems historically associated with past ones.  Assessing what
> pitfalls there might be in a proposal and the work required to
> implement it is itself a substantial amount of work (I'd guess
> several hours at least for this document); it's more likely to happen
> if there's something to excite people about the proposal (as well as
> if all the other issues I list are addressed), and I don't see
> anything particularly exciting here.  That's especially the case
> given how many previous attempts there have been at addressing this
> sort of issue.

   There's certainly a history of failure in this area. That's
why it's worth looking at something that might work.

> * If proposals are written by people with substantial experience in C
>  compiler implementation they are more likely to be sound - what such
>  experience has gone into writing this document?
>
> * Consider attending a WG14 meeting and presenting the proposals in
> person there (having had them included in a pre-meeting mailing), if
> you want a wider range of implementer opinions.

That may happen, but I'm still getting comments informally at
this point.  I'd like to see enough of this implemented in GCC
as an extension that people could try it out.

John Nagle
Animats



Re: Request for comments on language extension: Safe arrays and pointers for C.

2012-09-01 Thread John Nagle
On 9/1/2012 9:59 AM, James Dennett wrote:
> On Fri, Aug 31, 2012 at 2:55 PM, John Nagle  
> wrote:
>> We have proposed an extension to C (primarily) and C++ (possibly) 
>> to address buffer overflow prevention.  Buffer overflows are still 
>> a huge practical problem in C, and much important code is still 
>> written in C.  This is a new approach that may actually work.
...
> Could you say a little more of why it appears necessary to introduce 
> references into C for this?  The reason I'm puzzled is that C already
> has the ability to pass arrays in a way that preserves their size
> (just pass the address of the array) -- what is it that references
> change in this picture that justifies such a radical change?  Could
> we just permit pointer-to-array-of-n elements to convert to
> pointer-to-array-of-(n-1) elements, and/or provide some way to slice
> explicitly?

   That's an important point.  C99 already has variable-length
array parameters:

int fn(size_t n, float vec[n]);

Unfortunately, when the parameter is received in the function body,
per N1570 §6.7.6.3p7: 'A declaration of a parameter as "array of _type_"
shall be adjusted to "qualified pointer to _type_", where the type
qualifiers (if any) are those specified within the [ and ] of
the array type derivation.'

What this means is that, in the body of the function,
"vec" has type "float *", and "sizeof vec" is the size of
a pointer.  The standard currently requires losing the size
of the array.

While C99 variable-length array parameters aren't used much
(searches of open-source code have failed to find any use
cases, Microsoft refuses to implement them, and N1570 makes
them optional), these semantics also apply to passing
fixed-length arrays:

int fn(float vec[4]);

As before, "vec" is delivered as "float* vec".  The constant
case is widely used, and changing the semantics there might silently
break existing code that uses "sizeof".  We had a go-round on
this on comp.std.c, and the conclusion was that changing the
semantics of C array passing would break too much.

The real reason for using references is that size information
is needed in other places than parameters.  It's needed in
return types, on the left side of assignments, in casts, and
in structures.  References to arrays have associated information;
pointers don't.

As for slicing, see "array_slice" in the paper.  It's not a
built-in; it's a macro that uses "decltype" and a cast to
generate the appropriate result type.  Personally, I'd
like to have a Python-like slicing notation:

arr[start:endplus1]

but that's not essential to the proposal, so I'm not suggesting it.

> Of course to make this succeed you'll need buy-in from implementors 
> and of the standards committee(s), who will need to trust that the 
> other (and therefore that users) will find this worth the cost.  It 
> generally takes a lot of work (in terms of robust specification and 
> possibly implementation in a fork of an open source compiler or two) 
> to generate the consensus necessary for a proposal to succeed. 
> Something that might ultimately seek to change or even disallow much 
> existing C code has an even higher bar -- getting an ISO committee to
> remove existing support is no small achievement (e.g., look at how 
> long gets() persisted). I'd love to see a reduction in the number of 
> buffer overruns that are present in code, but it's an uphill 
> struggle.

Of course.  Support may come from the security community.  CERT
still reports buffer overflows, usually in C/C++ code, as the single
biggest source of vulnerabilities.  Vulnerabilities in software are now
a public policy level issue.  In the last week, software attacks have
taken down Saudi Aramco and RasGas, two of the world's largest energy
producers.  This issue is growing in importance as "info-war" moves
from a potential threat to reality.  It's now something that has to
be fixed.

John Nagle



Re: Request for comments on language extension: Safe arrays and pointers for C.

2012-09-02 Thread John Nagle
On 9/2/2012 1:12 AM, Florian Weimer wrote:
> * John Nagle:
> 
>>We have proposed an extension to C (primarily) and C++ (possibly)
>> to address buffer overflow prevention.  Buffer overflows are still
>> a huge practical problem in C, and much important code is still
>> written in C.  This is a new approach that may actually work.
> 
> Would you please state publicly if you have any IPR claims necessarily
> infringed by an implementation, or if you aware of any such claims by
> others?

   I have no IPR claims in this area.

   At the language level, I doubt that anyone does, or could.
However, there is the potential for a static analysis tool that
automatically retrofits sized declarations to existing code,
turning non-strict code into strict code.  Some of the commercial
static analysis systems may have IP in this area.  It would probably
be narrow, though; proof of correctness systems have been around
for a while.  I worked on one decades ago, the Pascal-F Verifier.

> I'm not sure if the proposed extension would actually help.  At work,
> we have a coded corpus of vulnerabilities to answer such questions.
> For memory safety vulnerabilities related to buffer overflows, the
> coding is not yet very accurate, mainly due to lack of a widely
> agreed-upon taxonomy, but at least it can serve as a starting point
> for a review.

There is an "official taxonomy", in  “Information Technology —
Programming Languages — Guidance to Avoiding Vulnerabilities in
Programming Languages through Language Selection and Use”,
ISO/IEC TR 24772.  The current draft is at

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1583.pdf

It's a list of features in programming languages that offer
attack points.  It can be used to classify programming
language vulnerabilities, to classify attacks which
exploit those vulnerabilities, or to classify program
bugs which relate to those vulnerabilities.

At the language level, for example, C has vulnerability
6.10, "Unchecked array indexing".

> That being said, it's certainly a very interesting topic!

   This is a problem that should have been solved a long time ago.
There are programmers who think it's inherent that a fast,
low-level language must be unsafe.  In fact, many of
the vulnerabilities in C simply reflect what could
be crammed into a compiler that had to run on a PDP-11
with a 64K address space.

   Others have approached this problem.  They either came up with
a new language (Modula, Ada, Java), changed C so much it became a
new language (Cyclone, Microsoft SAL), or had to add extra run
time data to carry around size information (Safe C Compiler,
GCC fat pointers.)  The problem today is coming up with a
backwards compatible solution that can be applied to the
huge legacy C code base.  This is tough, but not impossible.

   I'm proposing an optional "strict mode" for C,
in which array sizes have to match in all the places where a
mismatch creates a vulnerability.  In non-strict mode, today's
loose rules apply.  Non-strict mode code can call strict mode
code and vice versa.  Non-strict mode code can use the strict
mode features.  This provides a migration path to converting
security-critical code to strict mode.

   There are other C "tightening up" proposals around.
Most are on the library side, and address the usual suspects
- memcpy, strcat, etc.  This is broader, deals with arrays
in general, and covers the usual suspects as well.

John Nagle




Re: Request for comments on language extension: Safe arrays and pointers for C.

2012-09-03 Thread John Nagle
On 9/3/2012 8:29 AM, Andrew Haley wrote:
> On 09/03/2012 04:20 PM, Joseph S. Myers wrote:
>> On Mon, 3 Sep 2012, Andrew Haley wrote:
>>
>>> This isn't the only way to proceed.  I'd encourage someone wanting to
>>> do this to branch GCC and implement a rough cut of the feature.  That
>>
>> That would very likely be "build one to throw away" - features built 
>> without a clear definition of how they interact with other language 
>> features have been particularly problematic in the past.  So have 
>> extensions built based on "take this feature from another language, and 
>> put it in GNU C".
> 
> The alternative is worse: to design and fully specify a language
> feature and suggest that people adopt it without at any point trying
> that feature in real applications.
> 
>>> will provide useful information about the amount of work likely to be
>>> needed to complete the task.  Also, it will provide the opportunity to
>>> try out the language feature to see how well it works in practice.
>>
>> Whether people *will* use it is probably the more significant question 
>> than whether it *can* be used to address particular issues.
> 
> Well, of course.  But the only way to find out is by an iterative
> process: design something, try it, and refine.  Supporting that is one
> of GCC's primary goals, and has been since the beginning of the
> project.
> 
> Andrew.
> 
Exactly. That's why I'm raising this issue on the GCC list.
GCC already has many of the necessary extensions, such as
forward parameter declarations.  It has VLAs on the C side,
and references on the C++ side.  So most of the necessary
machinery is already implemented within GCC.

A first step would be a GCC version which allowed variable
length arrays in references and structures, but only made
the array parameter size checks, not full subscript checks.
That would allow trying to port some code over to strict mode,
and would wring out the concept.

   Think of it as FORTIFY on steroids.  It can do the parameter
checks FORTIFY does, but for any function with an array parameter
and a size.  It's not limited to a built-in list of the usual
suspect functions.

John Nagle



Re: Request for comments on language extension: Safe arrays and pointers for C, September draft.

2012-10-12 Thread John Nagle
Here's the September 2012 draft of my "Safe arrays and pointers for C"
proposal:

http://www.animats.com/papers/languages/safearraysforc43.pdf

This incorporates most of the substantive issues raised in
previous discussions.

Brief summary:

- Optional "strict mode" via pragma which prohibits some unsafe
  pointer usages.
- Prevents buffer overflows in strict mode.
- Bring C++ references into C, so programmers can talk about arrays.
- Expressions allowed in array dimensions (like VLA params, but
  in a few more contexts.)
- Strict code can call non-strict code, and vice versa.
- Libraries and APIs with array params can be given strict
  declarations, and can be called from strict code (safely) and
  non-strict code (unsafely), allowing gradual conversion.

The goal is to eliminate buffer overflows in strict mode code,
providing a substantial improvement in security and reliability
for security-critical C programs.

I'm proposing this as an enhancement to GCC, in two phases.

Phase 1: Add language mode flag for this feature set.
 Support new language features.  No bounds checking
 in this phase.

Phase 2: Add optional bounds checking.

I'd appreciate comments on how difficult phase 1 would be.

John Nagle