Dear GNU coreutils maintainers, It seems that I found a way to both speed-up (~10%) and simplify (13 insertions, 43 deletions) the wc -l avx code while playing with it, at least on several million to 1 billion row files I tested with my cpu.
It mostly involves using _mm256_movemask_epi8 and __builtin_popcount instead of the two accumulators handling that allowed me to increase the buffer size. I also have a further ~10% improvement in code by using 2 separate threads instead of 1 to mitigate the usr time overhead, although it’s naturally more complicated. Whom should I discuss this potential contribution with? Best wishes, Evgeny