On Fri, 5 Jan 2024 07:03:34 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5307: >> >>> 5305: assert(bt == T_LONG || bt == T_DOUBLE, ""); >>> 5306: vmovmskpd(rtmp, mask, vec_enc); >>> 5307: shlq(rtmp, 5); >> >> Might this need to be 6? If I understand right, then you want to have a >> 64bit stride, hence 2^6, right? >> If that is correct, then this did not show in your tests, and you need a >> regression test anyway. > > This computes the byte offset from start of the table, both integer and long > permute table have same row sizes, 8 int elements vs 4 long elements. Ah, I understand now. Maybe leave a comment for that? >> test/micro/org/openjdk/bench/jdk/incubator/vector/ColumnFilterBenchmark.java >> line 76: >> >>> 74: longinCol = new long[size]; >>> 75: longoutCol = new long[size]; >>> 76: lpivot = size / 2; >> >> I'd be interested to see what happens if you move up or down the "density" >> of elements that you accept. Would the simple branch prediction be faster if >> the density is low enough, i.e. we almost take no element. >> >> Though maybe that is not compiler problem but a user-problem? > > Included fuzzy filter micro with varying mask density. >  You are using `VectorMask<Integer> pred = VectorMask.fromLong(ispecies, maskctr++);`. That basically systematically iterates over all masks, which is nice for a correctness test. But that would use different density inside one test run, right? The average over the loop is still at `50%`, correct? I was thinking more a run where the percentage over the whole loop is lower than maybe `1%`. That would get us to a point where maybe the branch prediction of non-vectorized code might be faster, what do you think? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1442670411 PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1442676633