On 17/06/2024 10:54 am, Jan Beulich wrote:
> On 14.06.2024 19:07, Andrew Cooper wrote:
>> More fallout from looking at the code generation...
>>
>> for_each_set_bit() forces it's bitmap parameter out into memory.  For an
>> arbitrary sized bitmap, this is fine - and likely preferable as it's an
>> in-memory to begin with.
>>
>> However, more than half the current users of for_each_set_bit() are
>> operating over an single int/long, and this too is spilled to the
>> stack.  Worse, x86 seems to be the only architecture which (tries, but
>> not very well) to optimise find_{first,next}_bit() for GPR-sized
>> quantities, meaning that for_each_set_bit() hides 2 backing function calls.
>>
>> The ARM (v)GIC code in particular suffers horribly because of this.
>>
>> We also have several interesting opencoded forms:
>> * evtchn_check_pollers() is a (preprocessor identical) opencoding.
>> * hvm_emulate_writeback() is equivalent.
>> * for_each_vp() exists just to hardcode a constant and swap the other
>> two parameters.
>>
>> and several others forms which I think could be expressed more cleanly
>> as for_each_set_bit().
> I agree.
>
>> We also have the while()/ffs() forms which are "just" for_each_set_bit()
>> and some even manage to not spill their main variable to memory.
>>
>>
>> I want to get to a position where there is one clear API to use, and
>> that the compiler will handle nicely.  Xen's code generation will
>> definitely improve as a consequence.
>>
>>
>> Sadly, transforming the ideal while()/ffs() form into a for() loop is a
>> bit tricky.  This works:
>>
>> for ( unsigned int v = (val), (bit);
>>       v;
>>       v &= v - 1 )
>> if ( 1 )
>> {
>>     (bit) = ffs(v) - 1;
>>     goto body;
>> }
>> else
>>     body:
>>
>> which is a C metaprogramming trick borrowed from PuTTY to make:
>>
>> for_each_BLAH ( bit, val )
>> {
>>     // nice loop body
>> }
>>
>> work, while having the ffs() calculated logically within the loop body.
> What's wrong with
>
> #define for_each_set_bit(iter, val) \
>     for ( unsigned int v_ = (val), iter; \
>           v_ && ((iter) = ffs(v_) - 1, true); \
>           v_ &= v_ - 1 )
>
> ? I'll admit though that it's likely a matter of taste which one is
> "uglier". Yet I'd be in favor of avoiding the scope trickery.

Oh, of course.

FWIW, Frediano pointed out a form without the goto, but this is better
still.

It's certainly less bad than having to also explain the metaprogramming
to get scope working nicely.


>> The first issue I expect people to have with the above is the raw 'body'
>> label, although with a macro that can be fixed using body_ ## __COUNTER__.
>>
>> A full example is https://godbolt.org/z/oMGfah696 although a real
>> example in Xen is going to have to be variadic for at least ffs() and
>> ffsl().
> How would variadic-ness help with this? Unless we play some type
> trickery (like typeof((val) + 0U), thus yielding at least an unsigned,
> but an unsigned long if the incoming value is such, followed by a
> compile-time conditional operator to select between ffs() and ffsl()),
> I don't think we'd get away with just a single construct for both the
> int and long (for Arm32: long long) cases.

Ideally we want _Generic() but we can't have that yet.

In lieu of that, I was thinking a chain of __builtin_choose_expr()
instantiating variants for int/long/long-long.

The complication is that we need a double for() loop for the
long/long-long in order to declare the iterator as strictly unsigned
int.Without this, a CLTQ gets emitted to extend the result of the ffs*()
call. This https://godbolt.org/z/Px88EWdPb ought to do. I'll probably
end up expressing that as __for_each_set_bit() taking in a type derived
from typeof(), and an ffs*() variant to use.
>> Now, from an API point of view, it would be lovely if we could make a
>> single for_each_set_bit() which covers both cases, and while I can
>> distinguish the two forms by whether there are 2 or 3 args,
> With the 3-argument form specifying the number of bits in the 3rd arg?
> I'd fear such mixed uses may end up confusing.
>
>> I expect
>> MISRA is going to have a fit at that.  Also there's a difference based
>> on the scope of 'bit' and also whether modifications to 'val' in the
>> loop body take effect on the loop condition (they don't because a copy
>> is taken).
>>
>> So I expect everyone is going to want a new API to use here.  But what
>> to call it?
>>
>> More than half of the callers in Xen really want the GPR form, so we
>> could introduce a new bitmap_for_each_set_bit(), move all the callers
>> over, then introduce a "new" for_each_set_bit() which is only of the GPR
>> form.
>>
>> Or does anyone want to suggest an alternative name?
> I'd be okay-ish with those, maybe with slight shortening to bitmap_for_each()
> or bitmap_for_each_set().

Lets go with bitmap_for_each().  While we've got some examples looking
for the first clear bit, I don't believe we've got any examples wanting
to loop over all clear bits.

Are you happy repurposing for_each_set_bit() to be a 2-argument macro
operating strictly on a single GPR?

~Andrew

Reply via email to