On 14 Mar 2025, at 15:18, Martin Uecker wrote:
Am Freitag, dem 14.03.2025 um 14:42 -0400 schrieb John McCall:
On 14 Mar 2025, at 14:13, Martin Uecker wrote:
Am Freitag, dem 14.03.2025 um 10:11 -0700 schrieb David Tarditi:
Hi Martin,
The C design of VLAs misunderstood dependent typing.
They probably did not care about theory, but the design is
not inconsistent with theory.
This is almost true, but for bad reasons. The theory of dependent
types is heavily concerned with deciding whether two types are the
same, and C simply sidesteps this question because type identity is
largely meaningless in C. Every value of variably-modified type is
(or decays to) a pointer, and all pointers in C freely convert to one
another (within the object/function categories). _Generic is based on
type compatibility, not equality. So in that sense, the standard
doesn’t say anything inconsistent with theory because it doesn’t
even try to say anything.
The reason it is not quite true is that C does have rules for
compatible and composite types, and alas, those rules for
variably-modified types are not consistent with theory. Two VLA types
of compatible element type are always statically considered
compatible, and it’s simply UB if the sizes aren’t the same. The
composite type of a VLA and a fixed-size array type is always the
fixed-size array type. The standard is literally incomplete about the
composite type of two VLAs; if you use a ternary operator where both
operands are casts to VLA types, the standard just says it’s
straight-up just undefined behavior (because one of the types has a
bound that’s unevaluated) and doesn’t even bother telling us what
the static type is supposed to be.
Yes, I guess this is all true.
But let's rephrase my point a bit more precisely: One could take
a strict subset of C that includes variably modified types but
obviously has to forbid a lot other things (e.g. arbitrary pointer
conversions or unsafe down-casts and much more) and make this a
memory-safe language with dependent types. This would also
require adding run-time checks at certain places where there
is now UB, in particular where two VLA types need to be compatible.
Mmm. You can certainly subset C to the point that it’s memory-safe,
but
it wouldn’t really be anything like C anymore. As long as C has a
heap,
I don’t see any path to achieving temporal safety without significant
extensions to the language. But if we’re just talking about spatial
safety,
then sure, that could be a lot closer to C today.
Is that your vision, then, that you’d like to see the same sort of
checks
that `-fbounds-safety` does, but you want them based firmly in the
language
as a dynamic check triggered by pointer type conversion, with bounds
specified using variably-modified types? It’s a pretty elegant vision,
and
I can see the attraction. It has some real merits, which I’ll get to
below.
I do see at least two significant challenges, though.
The first and biggest problem is that, in general, array bounds can only
be
expressed on a pointer value if it’s got pointer to array type. Most C
array
code today works primarily with pointers to elements; programmers just
use
array types to create concrete arrays, and they very rarely use pointers
to
array type at all. There are a bunch of reasons for that:
- Pointers to arrays have to be dereferenced twice: `(*ptr)[idx]`
instead
of `ptr[idx]`.
- That makes them more error-prone, because it is easy to do pointer
arithmetic at the wrong level, e.g. by writing `ptr[idx]`, which will
stride by multiples of the entire array size. That may even pass the
compiler without complaint because of C’s laxness about
conversions.
- Keeping the bound around in the pointer type is more work and
doesn’t do
anything useful right now.
- A lot of C programmers dislike nested declarator syntax and can’t
remember
how it works. Those of us who can write it off the top of our heads
are
quite atypical.
Now, there is an exception: you can write a parameter using an array
type,
and it actually declares a pointer parameter. You could imagine using
this
as a syntax for an enforceable array bound for arguments, although the
committee did already decide that these bounds were meaningless without
`static`. Unfortunately, you can’t do this in any other position and
still
end up with just a pointer, so it’s not helpful as a general syntax
for
associating bounds with pointers.
The upshot is that this isn’t really something people can just adopt
by
adding annotations. It’s not just a significant rewrite, it’s a
rewrite that
programmers will have very legitimate objections to. I think that makes
this
at best a complement to the “sidecar” approach taken by
`-fbounds-safety`
where we can track top-level bounds to a specific pointer value.
The second problem is that there are some extralingual problems that
`-fbounds-safety` has to solve around bounds that *aren’t* just local
evaluations of bounds expressions, and a type-conversion-driven approach
doesn’t help with any of them.
As you mentioned, the design of variably-modified types is based on
evaluating the bounds expression at some specific point in the program
execution. Since these types can only be written locally, the evaluation
point is obvious. If we wanted to dynamically enforce bounds during
initialization, it would simply be another use of the same computed
bound:
```
int count = ...;
int (*ptr)[count * 10] = source_ptr;
```
Here we would evaluate `count * 10` exactly once and use it both as (1)
part
of the destination type when initializing `ptr` with `source_ptr` and
(2)
part of the type of `ptr` for all uses of it. For example, if
`source_ptr`
were of type `int (*)[100]`, we would dynamically check that
`count * 10 <= 100`. This all works perfectly with an arbitrary bounds
expression; it could even contain an opaque function call.
Note that we don’t need any special behavior specifically for
initialization. If we later assign a new value into `ptr`, we will still
be
converting the new value to the type `int (*)[< count * 10 >]`, using
the
value computed at the time of declaration of the variable. This model
would
simply require that conversion to validate the bounds during assignment
just
as it would during initialization.
Now, with nested arrays, variance does become a problem. Let’s reduce
bounds expression to their evaluated bounds to make this easier to
write.
- `int (*)[11]` can be converted to `int(*)[10]` because we’re simply
allowing fewer elements to be used.
- By the same token, `int (*(*)[11])[5]` can be converted to
`int (*(*)[10])[5]`. This is the same logic as the above, just with
an
element type that happens to be a pointer to array type.
- But `int (*(*)[11])[5]` cannot be safely converted to `int
(*(*)[11])[4]`,
because while it’s safe to *read* an `int (*)[4]` from this array,
it’s
not safe to assign one into it.
- `int (* const (*)[11])[5]` can be safely converted to
`int (* const (*)[11])[4]`, but only if this dialect also enforces
const-
correctness, at least on array pointers.
Anyway, a lot of this changes if we want to use the same concept for
non-local pointers to arrays, because we no longer have an obvious point
of
execution at which to evaluate the bounds expression. Instead, we are
forced
into re-evaluating it every time we access the variable holding the
array.
Consider:
```
struct X {
int count;
int (*ptr)[count * 10]; // using my preferred syntax
};
void test(struct X *xp) {
// For the purposes of the conversion check here, the
// source type is int (*)[< xp->count * 10 >], freshly
// evaluated as part of the member access.
int (*local)[100] = xp->ptr;
}
```
This has several immediate consequences.
Firstly, we need to already be able to compute the correct bound when we
do
the dynamic checks for assignments into this field. For local variably-
modified types, everything in the expression was already in scope and
presumably initialized, so this wasn’t a problem. Here, we’re not
helped
by scope, and we are dependent on the `count` field already having been
initialized.
Secondly, we must be very concerned about anything that could change the
result of this evaluation. So we cannot allow an arbitrary expression;
it must be something that we can fully analyze for what could change it.
And if refers to variables or fields (which it presumably always will),
we
must prevent assignments to those, or at least validate that any
assignments aren’t causing unsound changes to the bound expression.
Thirdly, that concern must apply non-locally: if we allow the address of
the
pointer field to be taken (which is totally fine in the local case!),
we can no directly reason about mutations through that pointer, so we
have to prevent changes to the bounds variables/fields while the pointer
is
outstanding.
And finally, we must be able to recognize *combinations* of assignments,
because when we’re initializing (or completely rewriting) this
structure,
we will need to able to assign to both `count` and `ptr` and not have
the
same restrictions in place that we would for separate assignments.
None of this falls out naturally from separate, local language rules; it
all has to be invented for the purpose of serving this dynamic check.
And
in fact, `-fbounds-safety` has to do all of this already just to make
basic checks involving pointers in structs work.
If that can all be established, though, I think the
type-conversion-based
approach using variably-modified types has some very nice properties as
a
complement to what we’re doing in `-fbounds-safety`.
For one, it interacts with the `-fbounds-safety` analysis very cleanly.
If
bounds in types are dynamically enforced (which is not true in normal C,
but could be in this dialect), then the type becomes a source for
reliable
reliable information for the bounds-safety analysis. Conversely, if
a pointer is converted to a variably-modified type, the analysis done
by `-bounds-safety` could be used as an input to the conversion check.
For another, I think it may lead towards an cleaner story for arrays of
pointers to arrays than `-fbounds-safety` can achieve today, as long as
the inner arrays are of uniform length.
But ultimately, I think it’s still at best a complement to the
attributes
we need for `-fbounds-safety`.
John.