Re: [PATCH] SVE popcount support

Nathan Bossart Wed, 05 Feb 2025 08:11:26 -0800

On Tue, Feb 04, 2025 at 09:01:33AM +0000, [email protected] 
wrote:
>> +    /*
>> +     * For smaller inputs, aligning the buffer degrades the performance.
>> +     * Therefore, the buffers only when the input size is sufficiently 
>> large.
>> +     */
> 
>> Is the inverse true, i.e., does aligning the buffer improve performance for
>> larger inputs?  I'm also curious what level of performance degradation you
>> were seeing.
> 
> Here is a comparison of all three cases. Alignment is marginally better for 
> inputs
> above 1024B, but the difference is small. Unaligned performs better for 
> smaller inputs.
> Aligned After 128B => the current implementation "if (aligned != buf && bytes 
> > 4 * vec_len)"
> Always Aligned => condition "bytes > 4 * vec_len" is removed.
> Unaligned => the whole if block was removed
> 
>  buf    | Always Aligned | Aligned After 128B | Unaligned
> --------+---------------+--------------------+------------
>    16   |       37.851  |           38.203   |     34.971
>    32   |       37.859  |           38.187   |     34.972
>    64   |       37.611  |           37.405   |     34.121
>   128   |       45.357  |           45.897   |     41.890
>   256   |       62.440  |           63.454   |     58.666
>   512   |      100.120  |          102.767   |     99.861
>  1024   |      159.574  |          158.594   |    164.975
>  2048   |      282.354  |          281.198   |    283.937
>  4096   |      532.038  |          531.068   |    533.699
>  8192   |     1038.973  |         1038.083   |   1039.206
> 16384   |     2028.604  |         2025.843   |   2033.940


Hm.  These results are so similar that I'm tempted to suggest we just
remove the section of code dedicated to alignment.  Is there any reason not
to do that?

+       /* Process 2 complete vectors */
+       for (; i < loop_bytes; i += vec_len * 2)
+       {
+               vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i)), 
mask64);
+               accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+               vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i + 
vec_len)), mask64);
+               accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+       }

Does this hand-rolled loop unrolling offer any particular advantage?  What
do the numbers look like if we don't do this or if we process, say, 4
vectors at a time?

-- 
nathan

Re: [PATCH] SVE popcount support

Reply via email to