On 17.06.2024 13:18, Andrew Cooper wrote: > On 17/06/2024 10:54 am, Jan Beulich wrote: >> On 14.06.2024 19:07, Andrew Cooper wrote: >>> More fallout from looking at the code generation... >>> >>> for_each_set_bit() forces it's bitmap parameter out into memory. For an >>> arbitrary sized bitmap, this is fine - and likely preferable as it's an >>> in-memory to begin with. >>> >>> However, more than half the current users of for_each_set_bit() are >>> operating over an single int/long, and this too is spilled to the >>> stack. Worse, x86 seems to be the only architecture which (tries, but >>> not very well) to optimise find_{first,next}_bit() for GPR-sized >>> quantities, meaning that for_each_set_bit() hides 2 backing function calls. >>> >>> The ARM (v)GIC code in particular suffers horribly because of this. >>> >>> We also have several interesting opencoded forms: >>> * evtchn_check_pollers() is a (preprocessor identical) opencoding. >>> * hvm_emulate_writeback() is equivalent. >>> * for_each_vp() exists just to hardcode a constant and swap the other >>> two parameters. >>> >>> and several others forms which I think could be expressed more cleanly >>> as for_each_set_bit(). >> I agree. >> >>> We also have the while()/ffs() forms which are "just" for_each_set_bit() >>> and some even manage to not spill their main variable to memory. >>> >>> >>> I want to get to a position where there is one clear API to use, and >>> that the compiler will handle nicely. Xen's code generation will >>> definitely improve as a consequence. >>> >>> >>> Sadly, transforming the ideal while()/ffs() form into a for() loop is a >>> bit tricky. This works: >>> >>> for ( unsigned int v = (val), (bit); >>> v; >>> v &= v - 1 ) >>> if ( 1 ) >>> { >>> (bit) = ffs(v) - 1; >>> goto body; >>> } >>> else >>> body: >>> >>> which is a C metaprogramming trick borrowed from PuTTY to make: >>> >>> for_each_BLAH ( bit, val ) >>> { >>> // nice loop body >>> } >>> >>> work, while having the ffs() calculated logically within the loop body. >> What's wrong with >> >> #define for_each_set_bit(iter, val) \ >> for ( unsigned int v_ = (val), iter; \ >> v_ && ((iter) = ffs(v_) - 1, true); \ >> v_ &= v_ - 1 ) >> >> ? I'll admit though that it's likely a matter of taste which one is >> "uglier". Yet I'd be in favor of avoiding the scope trickery. > > Oh, of course. > > FWIW, Frediano pointed out a form without the goto, but this is better > still. > > It's certainly less bad than having to also explain the metaprogramming > to get scope working nicely. > > >>> The first issue I expect people to have with the above is the raw 'body' >>> label, although with a macro that can be fixed using body_ ## __COUNTER__. >>> >>> A full example is https://godbolt.org/z/oMGfah696 although a real >>> example in Xen is going to have to be variadic for at least ffs() and >>> ffsl(). >> How would variadic-ness help with this? Unless we play some type >> trickery (like typeof((val) + 0U), thus yielding at least an unsigned, >> but an unsigned long if the incoming value is such, followed by a >> compile-time conditional operator to select between ffs() and ffsl()), >> I don't think we'd get away with just a single construct for both the >> int and long (for Arm32: long long) cases. > > Ideally we want _Generic() but we can't have that yet. > > In lieu of that, I was thinking a chain of __builtin_choose_expr() > instantiating variants for int/long/long-long.
Is __builtin_choose_expr() going to be a win over the conditional operator? All forms of ffs*() return unsigned int, so the main difference between the two is not interesting here. > The complication is that we need a double for() loop for the > long/long-long in order to declare the iterator as strictly unsigned > int.Without this, a CLTQ gets emitted to extend the result of the ffs*() > call. This https://godbolt.org/z/Px88EWdPb ought to do. I'll probably > end up expressing that as __for_each_set_bit() taking in a type derived > from typeof(), and an ffs*() variant to use. The double for isn't very nice, but what do you do. If it's to be kept as a helper, then for for_each_set_bit_uint() I'd suggest, however, to avoid typeof(), and use unsigned int directly. I'm not sure though that helper constructs are really going to be needed, other than to perhaps help readability. >>> Now, from an API point of view, it would be lovely if we could make a >>> single for_each_set_bit() which covers both cases, and while I can >>> distinguish the two forms by whether there are 2 or 3 args, >> With the 3-argument form specifying the number of bits in the 3rd arg? >> I'd fear such mixed uses may end up confusing. >> >>> I expect >>> MISRA is going to have a fit at that. Also there's a difference based >>> on the scope of 'bit' and also whether modifications to 'val' in the >>> loop body take effect on the loop condition (they don't because a copy >>> is taken). >>> >>> So I expect everyone is going to want a new API to use here. But what >>> to call it? >>> >>> More than half of the callers in Xen really want the GPR form, so we >>> could introduce a new bitmap_for_each_set_bit(), move all the callers >>> over, then introduce a "new" for_each_set_bit() which is only of the GPR >>> form. >>> >>> Or does anyone want to suggest an alternative name? >> I'd be okay-ish with those, maybe with slight shortening to bitmap_for_each() >> or bitmap_for_each_set(). > > Lets go with bitmap_for_each(). While we've got some examples looking > for the first clear bit, I don't believe we've got any examples wanting > to loop over all clear bits. > > Are you happy repurposing for_each_set_bit() to be a 2-argument macro > operating strictly on a single GPR? As to re-purposing - yes, I think so. The construct is semantically different enough from what we have right now that, during backporting, I believe the compiler will flag any unconverted uses. The "single GPR" aspect worries me, though: I think we want it to be "scalar", permitting Arm32 to also operate on long long / uint64_t. Eventually, if need be, we could even introduce a uint128_t form (backed by a suitable ffs128()). Jan