On Tue, Feb 04, 2025 at 09:01:33AM +0000, chiranmoy.bhattacha...@fujitsu.com wrote: >> + /* >> + * For smaller inputs, aligning the buffer degrades the performance. >> + * Therefore, the buffers only when the input size is sufficiently >> large. >> + */ > >> Is the inverse true, i.e., does aligning the buffer improve performance for >> larger inputs? I'm also curious what level of performance degradation you >> were seeing. > > Here is a comparison of all three cases. Alignment is marginally better for > inputs > above 1024B, but the difference is small. Unaligned performs better for > smaller inputs. > Aligned After 128B => the current implementation "if (aligned != buf && bytes > > 4 * vec_len)" > Always Aligned => condition "bytes > 4 * vec_len" is removed. > Unaligned => the whole if block was removed > > buf | Always Aligned | Aligned After 128B | Unaligned > --------+---------------+--------------------+------------ > 16 | 37.851 | 38.203 | 34.971 > 32 | 37.859 | 38.187 | 34.972 > 64 | 37.611 | 37.405 | 34.121 > 128 | 45.357 | 45.897 | 41.890 > 256 | 62.440 | 63.454 | 58.666 > 512 | 100.120 | 102.767 | 99.861 > 1024 | 159.574 | 158.594 | 164.975 > 2048 | 282.354 | 281.198 | 283.937 > 4096 | 532.038 | 531.068 | 533.699 > 8192 | 1038.973 | 1038.083 | 1039.206 > 16384 | 2028.604 | 2025.843 | 2033.940
Hm. These results are so similar that I'm tempted to suggest we just remove the section of code dedicated to alignment. Is there any reason not to do that? + /* Process 2 complete vectors */ + for (; i < loop_bytes; i += vec_len * 2) + { + vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i)), mask64); + accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64)); + vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i + vec_len)), mask64); + accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64)); + } Does this hand-rolled loop unrolling offer any particular advantage? What do the numbers look like if we don't do this or if we process, say, 4 vectors at a time? -- nathan