> Hm.  These results are so similar that I'm tempted to suggest we just
> remove the section of code dedicated to alignment.  Is there any reason not
> to do that?

It seems that the double load overhead from unaligned memory access isn’t
too taxing, even on larger inputs. We can remove it to simplify the code.

> Does this hand-rolled loop unrolling offer any particular advantage?  What
> do the numbers look like if we don't do this or if we process, say, 4
> vectors at a time?

The unrolled version performs better than the non-unrolled one, but
processing four vectors provides no additional benefit. The numbers
and code used are given below.

 buf  | Not Unrolled | Unrolled x2 | Unrolled x4
------+-------------+-------------+-------------
   16  |     4.774  |     4.759   |     5.634
   32  |     6.872  |     6.486   |     7.348
   64  |    11.070  |    10.249   |    10.617
  128  |    20.003  |    16.205   |    16.764
  256  |    40.234  |    28.377   |    29.108
  512  |    83.825  |    53.420   |    53.658
 1024  |   191.181  |   101.677   |   102.727
 2048  |   389.160  |   200.291   |   201.544
 4096  |   785.742  |   404.593   |   399.134
 8192  |  1587.226  |   811.314   |   810.961

/* Process 4 vectors */
for (; i < loop_bytes; i += vec_len * 4)
{
      vec64_1 = svld1(pred, (const uint64 *) (buf + i));
      accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64_1));
      vec64_2 = svld1(pred, (const uint64 *) (buf + i + vec_len));
      accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64_2));

      vec64_3 = svld1(pred, (const uint64 *) (buf + i + 2 * vec_len));
      accum3 = svadd_x(pred, accum3, svcnt_x(pred, vec64_3));
      vec64_4 = svld1(pred, (const uint64 *) (buf + i + 3 * vec_len));
      accum4 = svadd_x(pred, accum4, svcnt_x(pred, vec64_4));
}

-Chiranmoy

Reply via email to