Re: [RFC] [C]New syntax for the argument of counted_by attribute for C language

Qing Zhao Fri, 28 Mar 2025 09:05:42 -0700


> On Mar 28, 2025, at 08:51, Yeoul Na <yeoul...@apple.com> wrote:
> 
> 
> 
>> On Mar 27, 2025, at 9:17 AM, Qing Zhao <qing.z...@oracle.com> wrote:
>> 
>> Yeoul,
>> 
>> Thanks for the writeup.
>> 
>> So, basically, This writeup insisted on introducing a new “structure scope” 
>> (similar as the instance scope in C++) into C language ONLY for counted_by 
>> attribute:
>> 
>> 1. Inside counted_by attribute, the name lookup starts:
>> 
>>    A. Inside the current structure first (the NEW structure scope added to 
>> C);
>>    B. Then outside the structure; (other current C scopes, local scope or 
>> global scope)
>> 
>> 2. When trying to reference a variable outside of the structure scope that 
>> name_conflicts with
>>    a structure member, a new builtin function “__builtin_global_ref” is 
>> introduced for such 
>>    purpose.
>> 
>>   ( I think that __builtin_global_ref might not accurate, because the outer 
>> scope might be either global scope or local scope)
> 
> 
> Clarification: __builtin_global_ref will see the global scope directly. This 
> is similar to global scope resolution syntax (‘::’) in C++.


Yes, that’s my thought too. 

Then, you still need another builtin to refer to the local variable with the 
same name as the structure member, for example, 
In the below example, if the “len” inside the counted_by refers to the “const 
int len = 20”, how do you specify this?
> 
> constexpr int len = 10;
> 
> void foo (void)
> {
>   const int len = 20;
> 
>   struct s {
>     int len;
>     int *__counted_by(__builtin_global_ref(len)) buf; // refers to global 
> ‘len'
>   };
> }
> 
> Here are some reasons why we chose to provide a global scope resolution 
> builtin, not a builtin to see an outer scope or just a local scope:
> 
> 1) The builtin is a substitute for some “scope resolution specifier”. Scope 
> specifiers typically meant to choose a “specific" scope.
> 2) To the best of my knowledge there is no precedence in any other C family 
> language to provide a scope resolution for local scopes.

However, there is possibility that in the above example, the “len” might refer 
to the local variable len, not the global one. How do you specify that?

> 3) Name conflicts with local variables can be easily renamed.

Then more source code change in different places is needed, I am not sure 
whether this is easy to do in some cases. 

> 4) If we provide a builtin that selects outer scope instead, there is no way 
> to choose a global ‘len' if it’s shadowed by a local variable, so then the 
> member name has to be renamed anyway in order to choose a global `len`. 

Yes, that’s true. So maybe two builtins are needed?

> 5) This way, code can be written compatibly both in C and C++.
> 
>> 
>> 3. Where there is confliction between counted_by and VLA such as:
>> 
>> constexpr int len = 10;
>> 
>> struct s {
>>  int len;
>>  int *__counted_by(len) buf; // refers to struct member `len`.
>>  int arr[len]; // refers to global constexpr `len`
>> };
>> 
>> Issue compiler warning to user to ask the user to use __builtin_global_ref 
>> to distinguish. 
> 
> Additionally, our proposal suggests __builtin_member_ref to explicitly use a 
> member in a similar situation.
> The builtin could be replaced by ‘__self' or some other syntax once the 
> standard committee decides in the future, but earlier in the thread JeanHeyd 
> pointed out that:
> 
> "I would like to gently push back about __self__, or __self, or self, because 
> all of these identifiers are fairly common identifiers in code. When I 
> writing the paper for __self_func ( 
> https://thephd.dev/_vendor/future_cxx/papers/C%20-%20__self_func.html ), I 
> searched GitHub and other source code indexing and repository services: 
> __self, __self__, and self has a substantial amount of uses. If there's an 
> alternative spelling to consider, I think that would be helpful."
> 
> Thus, I think instead of trying to stick to a certain syntax right now, using 
> some builtin will allow us to easily migrate to a new syntax by guarding the 
> current usage under a macro.
> 
> Writing the builtin could be cumbersome but this shall be written only when 
> there is an ambiguity. Btw, I’m open to any other name suggestions for the 
> builtins!

I think that it’s better to stick to one approach:

A. Add a new keyword “__self”/ or __builtin_self() to explicitly refer to the 
member variable, keep all other no changes. 

OR:

A. Add one new instance scope into C, lookup the name inside the new scope 
first, then outer scope. If try to refer to variables outside the instance 
scope, using new added “scope resolution specifier”, such as __builtin_global_… 
__builtin_local_… for that purpose.
     For A, fixing the VLA inside structure to have the same lookup rule as 
counted-by. 


Anything mixing these two is not good to me...
> 
>> 
>> Are the above the correct understanding of your writeup?
> 
> Yes, it’s mostly correct, except some clarifications I made above. Thank you!

Thank you for the clarifications.

Qing
> 
>> 
>> 
>> From my understanding:
>> 
>> 1. This design started from the C++’s point of view by adding a new 
>> “structure scope” to C;
>> 2. This design conflicts with the current VLA default scope rule (which 
>> based on the default C scopes) in C.
>>     In the above example that mixes counted_by and VLA, it’s so weird that  
>> there are two difference name
>>     lookup rules inside the same structure. 
>>     It’s clearly a design bug. Either VLA or counted_by need to be fixed to 
>> make them consistent. 
>> 
>> 
>> I personally do not completely object to introduce a new “structure scope” 
>> into C, but it’s so hard for me to accept
>> that there are two different name lookup rules inside the same structure: 
>> one rule for VLA, another rule for counted_by
>> attribute.  (If introducing a new “structure scope” to C,  I think it’s 
>> better to change VLA to “structure scope” too, not sure
>> whether this is feasible or not)
>> 
>> I still think that introduce a new keyword “__self” for referring member 
>> variable inside structure without adding 
>> a new “structure scope" should be the best approach to resolve this issue in 
>> C. 
>> 
>> However, I am really hoping that the discussion can be converged soon. So, I 
>> am okay with adding a new “structure scope”
>> If most of people agreed on that approach. 
> 
> Thanks for the flexibility!
> 
>> 
>> Qing
>> 
>> 
>>> On Mar 26, 2025, at 12:59, Yeoul Na <yeoul...@apple.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> Thanks for all the discussions.
>>> 
>>> I posted the design rationale for our current approach in 
>>> https://discourse.llvm.org/t/rfc-forward-referencing-a-struct-member-within-bounds-annotations/85510.
>>>  This clarifies some of the questions that are asked in this thread. The 
>>> document also proposes diagnostics to mitigate potential ambiguity, and 
>>> propose new builtins that can be used as a suppression and disambiguation 
>>> mechanism.
>>> 
>>> Best regards,
>>> Yeoul
>>> 
>>>> On Mar 26, 2025, at 9:11 AM, Yeoul Na <yeoul...@apple.com> wrote:
>>>> 
>>>> Sorry for the delay.
>>>> 
>>>> I’m planning on sending out our design rationale of the current approach 
>>>> without the new syntax today.
>>>> 
>>>> - Yeoul
>>>> 
>>>>> On Mar 14, 2025, at 9:22 PM, John McCall <rjmcc...@apple.com> wrote:
>>>>> 
>>>>> On 14 Mar 2025, at 15:18, Martin Uecker wrote:
>>>>> Am Freitag, dem 14.03.2025 um 14:42 -0400 schrieb John McCall:
>>>>> On 14 Mar 2025, at 14:13, Martin Uecker wrote:
>>>>> Am Freitag, dem 14.03.2025 um 10:11 -0700 schrieb David Tarditi:
>>>>> Hi Martin,
>>>>> The C design of VLAs misunderstood dependent typing.
>>>>> They probably did not care about theory, but the design is 
>>>>> not inconsistent with theory.
>>>>> This is almost true, but for bad reasons. The theory of dependent types 
>>>>> is heavily concerned with deciding whether two types are the same, and C 
>>>>> simply sidesteps this question because type identity is largely 
>>>>> meaningless in C. Every value of variably-modified type is (or decays to) 
>>>>> a pointer, and all pointers in C freely convert to one another (within 
>>>>> the object/function categories). _Generic is based on type compatibility, 
>>>>> not equality. So in that sense, the standard doesn’t say anything 
>>>>> inconsistent with theory because it doesn’t even try to say anything.
>>>>> The reason it is not quite true is that C does have rules for compatible 
>>>>> and composite types, and alas, those rules for variably-modified types 
>>>>> are not consistent with theory. Two VLA types of compatible element type 
>>>>> are always statically considered compatible, and it’s simply UB if the 
>>>>> sizes aren’t the same. The composite type of a VLA and a fixed-size array 
>>>>> type is always the fixed-size array type. The standard is literally 
>>>>> incomplete about the composite type of two VLAs; if you use a ternary 
>>>>> operator where both operands are casts to VLA types, the standard just 
>>>>> says it’s straight-up just undefined behavior (because one of the types 
>>>>> has a bound that’s unevaluated) and doesn’t even bother telling us what 
>>>>> the static type is supposed to be.
>>>>> Yes, I guess this is all true.
>>>>> But let's rephrase my point a bit more precisely: One could take 
>>>>> a strict subset of C that includes variably modified types but 
>>>>> obviously has to forbid a lot other things (e.g. arbitrary pointer 
>>>>> conversions or unsafe down-casts and much more) and make this a 
>>>>> memory-safe language with dependent types. This would also 
>>>>> require adding run-time checks at certain places where there 
>>>>> is now UB, in particular where two VLA types need to be compatible.
>>>>> Mmm. You can certainly subset C to the point that it’s memory-safe, but
>>>>> it wouldn’t really be anything like C anymore. As long as C has a heap,
>>>>> I don’t see any path to achieving temporal safety without significant
>>>>> extensions to the language. But if we’re just talking about spatial 
>>>>> safety,
>>>>> then sure, that could be a lot closer to C today.
>>>>> Is that your vision, then, that you’d like to see the same sort of checks
>>>>> that -fbounds-safety does, but you want them based firmly in the language
>>>>> as a dynamic check triggered by pointer type conversion, with bounds
>>>>> specified using variably-modified types? It’s a pretty elegant vision, and
>>>>> I can see the attraction. It has some real merits, which I’ll get to 
>>>>> below.
>>>>> I do see at least two significant challenges, though.
>>>>> The first and biggest problem is that, in general, array bounds can only 
>>>>> be
>>>>> expressed on a pointer value if it’s got pointer to array type. Most C 
>>>>> array
>>>>> code today works primarily with pointers to elements; programmers just use
>>>>> array types to create concrete arrays, and they very rarely use pointers 
>>>>> to
>>>>> array type at all. There are a bunch of reasons for that:
>>>>>    • Pointers to arrays have to be dereferenced twice: (*ptr)[idx] instead
>>>>> of ptr[idx].
>>>>>    • That makes them more error-prone, because it is easy to do pointer
>>>>> arithmetic at the wrong level, e.g. by writing ptr[idx], which will
>>>>> stride by multiples of the entire array size. That may even pass the
>>>>> compiler without complaint because of C’s laxness about conversions.
>>>>>    • Keeping the bound around in the pointer type is more work and 
>>>>> doesn’t do
>>>>> anything useful right now.
>>>>>    • A lot of C programmers dislike nested declarator syntax and can’t 
>>>>> remember
>>>>> how it works. Those of us who can write it off the top of our heads are
>>>>> quite atypical.
>>>>> Now, there is an exception: you can write a parameter using an array type,
>>>>> and it actually declares a pointer parameter. You could imagine using this
>>>>> as a syntax for an enforceable array bound for arguments, although the
>>>>> committee did already decide that these bounds were meaningless without
>>>>> static. Unfortunately, you can’t do this in any other position and still
>>>>> end up with just a pointer, so it’s not helpful as a general syntax for
>>>>> associating bounds with pointers.
>>>>> The upshot is that this isn’t really something people can just adopt by
>>>>> adding annotations. It’s not just a significant rewrite, it’s a rewrite 
>>>>> that
>>>>> programmers will have very legitimate objections to. I think that makes 
>>>>> this
>>>>> at best a complement to the “sidecar” approach taken by -fbounds-safety
>>>>> where we can track top-level bounds to a specific pointer value.
>>>>> The second problem is that there are some extralingual problems that
>>>>> -fbounds-safety has to solve around bounds that aren’t just local
>>>>> evaluations of bounds expressions, and a type-conversion-driven approach
>>>>> doesn’t help with any of them.
>>>>> As you mentioned, the design of variably-modified types is based on
>>>>> evaluating the bounds expression at some specific point in the program
>>>>> execution. Since these types can only be written locally, the evaluation
>>>>> point is obvious. If we wanted to dynamically enforce bounds during
>>>>> initialization, it would simply be another use of the same computed bound:
>>>>> int count = ...;
>>>>> int (*ptr)[count * 10] = source_ptr;
>>>>> 
>>>>> Here we would evaluate count * 10 exactly once and use it both as (1) part
>>>>> of the destination type when initializing ptr with source_ptr and (2)
>>>>> part of the type of ptr for all uses of it. For example, if source_ptr
>>>>> were of type int (*)[100], we would dynamically check that
>>>>> count * 10 <= 100. This all works perfectly with an arbitrary bounds
>>>>> expression; it could even contain an opaque function call.
>>>>> Note that we don’t need any special behavior specifically for
>>>>> initialization. If we later assign a new value into ptr, we will still be
>>>>> converting the new value to the type int (*)[< count * 10 >], using the
>>>>> value computed at the time of declaration of the variable. This model 
>>>>> would
>>>>> simply require that conversion to validate the bounds during assignment 
>>>>> just
>>>>> as it would during initialization.
>>>>> Now, with nested arrays, variance does become a problem. Let’s reduce
>>>>> bounds expression to their evaluated bounds to make this easier to write.
>>>>>    • int (*)[11] can be converted to int(*)[10] because we’re simply
>>>>> allowing fewer elements to be used.
>>>>>    • By the same token, int (*(*)[11])[5] can be converted to
>>>>> int (*(*)[10])[5]. This is the same logic as the above, just with an
>>>>> element type that happens to be a pointer to array type.
>>>>>    • But int (*(*)[11])[5] cannot be safely converted to int 
>>>>> (*(*)[11])[4],
>>>>> because while it’s safe to read an int (*)[4] from this array, it’s
>>>>> not safe to assign one into it.
>>>>>    • int (* const (*)[11])[5] can be safely converted to
>>>>> int (* const (*)[11])[4], but only if this dialect also enforces const-
>>>>> correctness, at least on array pointers.
>>>>> Anyway, a lot of this changes if we want to use the same concept for
>>>>> non-local pointers to arrays, because we no longer have an obvious point 
>>>>> of
>>>>> execution at which to evaluate the bounds expression. Instead, we are 
>>>>> forced
>>>>> into re-evaluating it every time we access the variable holding the array.
>>>>> Consider:
>>>>> struct X {
>>>>> int count;
>>>>> int (*ptr)[count * 10]; // using my preferred syntax
>>>>> };
>>>>> 
>>>>> void test(struct X *xp) {
>>>>> // For the purposes of the conversion check here, the
>>>>> // source type is int (*)[< xp->count * 10 >], freshly
>>>>> // evaluated as part of the member access.
>>>>> int (*local)[100] = xp->ptr;
>>>>> }
>>>>> 
>>>>> This has several immediate consequences.
>>>>> Firstly, we need to already be able to compute the correct bound when we 
>>>>> do
>>>>> the dynamic checks for assignments into this field. For local variably-
>>>>> modified types, everything in the expression was already in scope and
>>>>> presumably initialized, so this wasn’t a problem. Here, we’re not helped
>>>>> by scope, and we are dependent on the count field already having been
>>>>> initialized.
>>>>> Secondly, we must be very concerned about anything that could change the
>>>>> result of this evaluation. So we cannot allow an arbitrary expression;
>>>>> it must be something that we can fully analyze for what could change it.
>>>>> And if refers to variables or fields (which it presumably always will), we
>>>>> must prevent assignments to those, or at least validate that any
>>>>> assignments aren’t causing unsound changes to the bound expression.
>>>>> Thirdly, that concern must apply non-locally: if we allow the address of 
>>>>> the
>>>>> pointer field to be taken (which is totally fine in the local case!),
>>>>> we can no directly reason about mutations through that pointer, so we
>>>>> have to prevent changes to the bounds variables/fields while the pointer 
>>>>> is
>>>>> outstanding.
>>>>> And finally, we must be able to recognize combinations of assignments,
>>>>> because when we’re initializing (or completely rewriting) this structure,
>>>>> we will need to able to assign to both count and ptr and not have the
>>>>> same restrictions in place that we would for separate assignments.
>>>>> None of this falls out naturally from separate, local language rules; it
>>>>> all has to be invented for the purpose of serving this dynamic check. And
>>>>> in fact, -fbounds-safety has to do all of this already just to make
>>>>> basic checks involving pointers in structs work.
>>>>> If that can all be established, though, I think the type-conversion-based
>>>>> approach using variably-modified types has some very nice properties as a
>>>>> complement to what we’re doing in -fbounds-safety.
>>>>> For one, it interacts with the -fbounds-safety analysis very cleanly. If
>>>>> bounds in types are dynamically enforced (which is not true in normal C,
>>>>> but could be in this dialect), then the type becomes a source for reliable
>>>>> reliable information for the bounds-safety analysis. Conversely, if
>>>>> a pointer is converted to a variably-modified type, the analysis done
>>>>> by -bounds-safety could be used as an input to the conversion check.
>>>>> For another, I think it may lead towards an cleaner story for arrays of
>>>>> pointers to arrays than -fbounds-safety can achieve today, as long as
>>>>> the inner arrays are of uniform length.
>>>>> But ultimately, I think it’s still at best a complement to the attributes
>>>>> we need for -fbounds-safety.
>>>>> John.
>>>> 
>>> 
>> 
> 
> Yeoul

Re: [RFC] [C]New syntax for the argument of counted_by attribute for C language

Reply via email to