Yes, it's true that simplifying and speeding-up by the bufsize increase are two different things although the former allowed the latter.
I just landed more tests with hyperfine for various configurations spanning over the current master version and a new approach with a range of bufsizes from 16 KiB up to 1 MiB, running on 1 billion yes'es like you did (1by), a generated file for the recent 1 billion row challenge (1brc, with entries like "<station name>;<temperature:0.2f>") and the first 100 million rows for both of them (100my and 100mrc, respectively), all in /dev/shm, yet again with 7800X3D: The reported timings are as follows: | version | 100my | 100mrc | 1by | 1brc | | ------- | ------- | ------- | ------- | ------- | | master | 21.3 ms ± 1.0 ms | 163.1 ms ± 1.5 ms | 197.1 ms ± 3.0 ms | 1.680 s ± 0.010 s | | 16 KiB | 21.0 ms ± 1.1 ms | 163.7 ms ± 2.1 ms | 194.3 ms ± 2.5 ms | 1.658 s ± 0.015 s | | 32 KiB | 20.2 ms ± 0.7 ms | 158.9 ms ± 3.0 ms | 194.6 ms ± 6.4 ms | 1.620 s ± 0.023 s | | 64 KiB | 19.8 ms ± 0.6 ms | 154.0 ms ± 5.3 ms | 187.5 ms ± 7.2 ms | 1.553 s ± 0.013 s | | 128 KiB | 18.8 ms ± 0.6 ms | 148.9 ms ± 5.4 ms | 178.4 ms ± 1.3 ms | 1.530 s ± 0.013 s | | 256 KiB | 19.2 ms ± 0.8 ms | 145.8 ms ± 1.5 ms | 176.4 ms ± 1.6 ms | 1.522 s ± 0.017 s | | 512 KiB | 19.6 ms ± 0.7 ms | 146.4 ms ± 1.0 ms | 183.0 ms ± 5.0 ms | 1.512 s ± 0.014 s | | 1 MiB | 19.3 ms ± 0.7 ms | 145.7 ms ± 1.8 ms | 188.4 ms ± 6.2 ms | 1.499 s ± 0.012 s | And the corresponding speed-up values are as follows: | version | 100my | 100mrc | 1by | 1brc | | ------- | ------- | ------- | ------- | ------- | | master | 0% | 0% | 0% | 0% | | 16 KiB | 1% | 0% | 1% | 1% | | 32 KiB | 5% | 3% | 1% | 4% | | 64 KiB | 8% | 6% | 5% | 8% | | 128 KiB | 13% | 10% | 10% | 10% | | 256 KiB | 11% | 12% | 12% | 10% | | 512 KiB | 9% | 11% | 8% | 11% | | 1 MiB | 10% | 12% | 5% | 12% | So again in my case the new approach is on par with the old one while the sweet spot bufsize of 256 KB seems to bring the best value. Still more testing on different CPUs and sample files should probably be conducted. Regards, Evgeny On Sun, 31 Mar 2024 at 18:17, Pádraig Brady <p...@draigbrady.com> wrote: > On 31/03/2024 13:12, Pádraig Brady wrote: > > On 31/03/2024 00:18, Evgeny Nizhibitsky wrote: > >> Here is the proposed patch for both simplifying and consistently > speeding up the avx version of wc -l by 10% in up to 1 billion rows > scenarios on 7800X3D (probably should be tested on different data samples > and CPUs). > > > > The patch was mangled, but I manually applied it. > > Probably best to attach rather than pasting any further patches. > > Attaching here in case others want to try. > > > > This is good as it simplifies the code, > > and should have the same portability, to machines and compilers. > > I'll adjust the configure.ac check to be more aligned. > > > > As for performance, I tested on my laptop with no change: > > > > # on an i7-5600U with 1 billion short lines > > $ yes | head -n1000000000 > /dev/shm/yes > > > > $ time src/wc-old -l /dev/shm/yes > > 1000000000 /dev/shm/yes > > real 0m0.351s > > user 0m0.060s > > sys 0m0.288s > > > > $ time src/wc-new -l /dev/shm/yes > > 1000000000 /dev/shm/yes > > real 0m0.356s > > user 0m0.098s > > sys 0m0.255s > > > > Since you change the I/O size from 16 to 256 KiB, > > it's more aligned with the recent I/O size adjustment in: > > https://github.com/coreutils/coreutils/commit/fcfba90d0 > > In fact perhaps much of the speedup is just from that change. > > Can you test on your system with the buffer reduced back to 16KiB > > to see how much that impacts the performance? > > Oh I see you commented in the code that the 10-15% speed-up > was due to the buffer size change. > > In testing more on my i7-5600U laptop, shows the 16 -> 256 KiB change > improved performance by about 5%. On the other hand, the new logic > is about 5% slower on my laptop, cancelling out any win. > > The new code is simpler though, so still a win in that regard. > I'll test on a few more platforms for comparison. > > cheers, > Pádraig >