> Hm. These results are so similar that I'm tempted to suggest we just > remove the section of code dedicated to alignment. Is there any reason not > to do that?
It seems that the double load overhead from unaligned memory access isn’t too taxing, even on larger inputs. We can remove it to simplify the code. > Does this hand-rolled loop unrolling offer any particular advantage? What > do the numbers look like if we don't do this or if we process, say, 4 > vectors at a time? The unrolled version performs better than the non-unrolled one, but processing four vectors provides no additional benefit. The numbers and code used are given below. buf | Not Unrolled | Unrolled x2 | Unrolled x4 ------+-------------+-------------+------------- 16 | 4.774 | 4.759 | 5.634 32 | 6.872 | 6.486 | 7.348 64 | 11.070 | 10.249 | 10.617 128 | 20.003 | 16.205 | 16.764 256 | 40.234 | 28.377 | 29.108 512 | 83.825 | 53.420 | 53.658 1024 | 191.181 | 101.677 | 102.727 2048 | 389.160 | 200.291 | 201.544 4096 | 785.742 | 404.593 | 399.134 8192 | 1587.226 | 811.314 | 810.961 /* Process 4 vectors */ for (; i < loop_bytes; i += vec_len * 4) { vec64_1 = svld1(pred, (const uint64 *) (buf + i)); accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64_1)); vec64_2 = svld1(pred, (const uint64 *) (buf + i + vec_len)); accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64_2)); vec64_3 = svld1(pred, (const uint64 *) (buf + i + 2 * vec_len)); accum3 = svadd_x(pred, accum3, svcnt_x(pred, vec64_3)); vec64_4 = svld1(pred, (const uint64 *) (buf + i + 3 * vec_len)); accum4 = svadd_x(pred, accum4, svcnt_x(pred, vec64_4)); } -Chiranmoy