On 17/06/2024 10:54 am, Jan Beulich wrote: > On 14.06.2024 19:07, Andrew Cooper wrote: >> More fallout from looking at the code generation... >> >> for_each_set_bit() forces it's bitmap parameter out into memory. For an >> arbitrary sized bitmap, this is fine - and likely preferable as it's an >> in-memory to begin with. >> >> However, more than half the current users of for_each_set_bit() are >> operating over an single int/long, and this too is spilled to the >> stack. Worse, x86 seems to be the only architecture which (tries, but >> not very well) to optimise find_{first,next}_bit() for GPR-sized >> quantities, meaning that for_each_set_bit() hides 2 backing function calls. >> >> The ARM (v)GIC code in particular suffers horribly because of this. >> >> We also have several interesting opencoded forms: >> * evtchn_check_pollers() is a (preprocessor identical) opencoding. >> * hvm_emulate_writeback() is equivalent. >> * for_each_vp() exists just to hardcode a constant and swap the other >> two parameters. >> >> and several others forms which I think could be expressed more cleanly >> as for_each_set_bit(). > I agree. > >> We also have the while()/ffs() forms which are "just" for_each_set_bit() >> and some even manage to not spill their main variable to memory. >> >> >> I want to get to a position where there is one clear API to use, and >> that the compiler will handle nicely. Xen's code generation will >> definitely improve as a consequence. >> >> >> Sadly, transforming the ideal while()/ffs() form into a for() loop is a >> bit tricky. This works: >> >> for ( unsigned int v = (val), (bit); >> v; >> v &= v - 1 ) >> if ( 1 ) >> { >> (bit) = ffs(v) - 1; >> goto body; >> } >> else >> body: >> >> which is a C metaprogramming trick borrowed from PuTTY to make: >> >> for_each_BLAH ( bit, val ) >> { >> // nice loop body >> } >> >> work, while having the ffs() calculated logically within the loop body. > What's wrong with > > #define for_each_set_bit(iter, val) \ > for ( unsigned int v_ = (val), iter; \ > v_ && ((iter) = ffs(v_) - 1, true); \ > v_ &= v_ - 1 ) > > ? I'll admit though that it's likely a matter of taste which one is > "uglier". Yet I'd be in favor of avoiding the scope trickery.
Oh, of course. FWIW, Frediano pointed out a form without the goto, but this is better still. It's certainly less bad than having to also explain the metaprogramming to get scope working nicely. >> The first issue I expect people to have with the above is the raw 'body' >> label, although with a macro that can be fixed using body_ ## __COUNTER__. >> >> A full example is https://godbolt.org/z/oMGfah696 although a real >> example in Xen is going to have to be variadic for at least ffs() and >> ffsl(). > How would variadic-ness help with this? Unless we play some type > trickery (like typeof((val) + 0U), thus yielding at least an unsigned, > but an unsigned long if the incoming value is such, followed by a > compile-time conditional operator to select between ffs() and ffsl()), > I don't think we'd get away with just a single construct for both the > int and long (for Arm32: long long) cases. Ideally we want _Generic() but we can't have that yet. In lieu of that, I was thinking a chain of __builtin_choose_expr() instantiating variants for int/long/long-long. The complication is that we need a double for() loop for the long/long-long in order to declare the iterator as strictly unsigned int.Without this, a CLTQ gets emitted to extend the result of the ffs*() call. This https://godbolt.org/z/Px88EWdPb ought to do. I'll probably end up expressing that as __for_each_set_bit() taking in a type derived from typeof(), and an ffs*() variant to use. >> Now, from an API point of view, it would be lovely if we could make a >> single for_each_set_bit() which covers both cases, and while I can >> distinguish the two forms by whether there are 2 or 3 args, > With the 3-argument form specifying the number of bits in the 3rd arg? > I'd fear such mixed uses may end up confusing. > >> I expect >> MISRA is going to have a fit at that. Also there's a difference based >> on the scope of 'bit' and also whether modifications to 'val' in the >> loop body take effect on the loop condition (they don't because a copy >> is taken). >> >> So I expect everyone is going to want a new API to use here. But what >> to call it? >> >> More than half of the callers in Xen really want the GPR form, so we >> could introduce a new bitmap_for_each_set_bit(), move all the callers >> over, then introduce a "new" for_each_set_bit() which is only of the GPR >> form. >> >> Or does anyone want to suggest an alternative name? > I'd be okay-ish with those, maybe with slight shortening to bitmap_for_each() > or bitmap_for_each_set(). Lets go with bitmap_for_each(). While we've got some examples looking for the first clear bit, I don't believe we've got any examples wanting to loop over all clear bits. Are you happy repurposing for_each_set_bit() to be a 2-argument macro operating strictly on a single GPR? ~Andrew