On 31/03/2024 13:12, Pádraig Brady wrote:
On 31/03/2024 00:18, Evgeny Nizhibitsky wrote:
Here is the proposed patch for both simplifying and consistently speeding up 
the avx version of wc -l by 10% in up to 1 billion rows scenarios on 7800X3D 
(probably should be tested on different data samples and CPUs).

The patch was mangled, but I manually applied it.
Probably best to attach rather than pasting any further patches.
Attaching here in case others want to try.

This is good as it simplifies the code,
and should have the same portability, to machines and compilers.
I'll adjust the configure.ac check to be more aligned.

As for performance, I tested on my laptop with no change:

    # on an i7-5600U with 1 billion short lines
    $ yes | head -n1000000000 > /dev/shm/yes

    $ time src/wc-old -l /dev/shm/yes
    1000000000 /dev/shm/yes
    real    0m0.351s
    user    0m0.060s
    sys     0m0.288s

    $ time src/wc-new -l /dev/shm/yes
    1000000000 /dev/shm/yes
    real    0m0.356s
    user    0m0.098s
    sys     0m0.255s

Since you change the I/O size from 16 to 256 KiB,
it's more aligned with the recent I/O size adjustment in:
https://github.com/coreutils/coreutils/commit/fcfba90d0
In fact perhaps much of the speedup is just from that change.
Can you test on your system with the buffer reduced back to 16KiB
to see how much that impacts the performance?

Oh I see you commented in the code that the 10-15% speed-up
was due to the buffer size change.

In testing more on my i7-5600U laptop, shows the 16 -> 256 KiB change
improved performance by about 5%.  On the other hand, the new logic
is about 5% slower on my laptop, cancelling out any win.

The new code is simpler though, so still a win in that regard.
I'll test on a few more platforms for comparison.

cheers,
Pádraig

Reply via email to