More fallout from looking at the code generation... for_each_set_bit() forces it's bitmap parameter out into memory. For an arbitrary sized bitmap, this is fine - and likely preferable as it's an in-memory to begin with.
However, more than half the current users of for_each_set_bit() are operating over an single int/long, and this too is spilled to the stack. Worse, x86 seems to be the only architecture which (tries, but not very well) to optimise find_{first,next}_bit() for GPR-sized quantities, meaning that for_each_set_bit() hides 2 backing function calls. The ARM (v)GIC code in particular suffers horribly because of this. We also have several interesting opencoded forms: * evtchn_check_pollers() is a (preprocessor identical) opencoding. * hvm_emulate_writeback() is equivalent. * for_each_vp() exists just to hardcode a constant and swap the other two parameters. and several others forms which I think could be expressed more cleanly as for_each_set_bit(). We also have the while()/ffs() forms which are "just" for_each_set_bit() and some even manage to not spill their main variable to memory. I want to get to a position where there is one clear API to use, and that the compiler will handle nicely. Xen's code generation will definitely improve as a consequence. Sadly, transforming the ideal while()/ffs() form into a for() loop is a bit tricky. This works: for ( unsigned int v = (val), (bit); v; v &= v - 1 ) if ( 1 ) { (bit) = ffs(v) - 1; goto body; } else body: which is a C metaprogramming trick borrowed from PuTTY to make: for_each_BLAH ( bit, val ) { // nice loop body } work, while having the ffs() calculated logically within the loop body. The first issue I expect people to have with the above is the raw 'body' label, although with a macro that can be fixed using body_ ## __COUNTER__. A full example is https://godbolt.org/z/oMGfah696 although a real example in Xen is going to have to be variadic for at least ffs() and ffsl(). Now, from an API point of view, it would be lovely if we could make a single for_each_set_bit() which covers both cases, and while I can distinguish the two forms by whether there are 2 or 3 args, I expect MISRA is going to have a fit at that. Also there's a difference based on the scope of 'bit' and also whether modifications to 'val' in the loop body take effect on the loop condition (they don't because a copy is taken). So I expect everyone is going to want a new API to use here. But what to call it? More than half of the callers in Xen really want the GPR form, so we could introduce a new bitmap_for_each_set_bit(), move all the callers over, then introduce a "new" for_each_set_bit() which is only of the GPR form. Or does anyone want to suggest an alternative name? Thoughts? ~Andrew