> Another thought: for non-x86 platforms, the SIMD nodes degenerate to > "simple loop", and looping over up to 32 elements is not great > (although possibly okay). We could do binary search, but that has bad > branch prediction.
I am not sure that for relevant non-x86 platforms SIMD / vector instructions would not be used (though it would be a good idea to verify) Do you know any modern platforms that do not have SIMD ? I would definitely test before assuming binary search is better. Often other approaches like counting search over such small vectors is much better when the vector fits in cache (or even a cache line) and you always visit all items as this will completely avoid branch predictions and allows compiler to vectorize and / or unroll the loop as needed. Cheers Hannu