> Is that from some kind of rigorous measurement under perf? As you
> surely know, 0.6% wall-clock time can be from boost clock variation
> or just run-to-run noise on x86.

I compared it using hyperfine which does rigorous measurements yes.
It was well above the run-to-run variability.

I had some other patches that didn't meet that bar, e.g. 
i've been experimenting with more modern hashes for inchash
and multiple ggc free lists, but so far no above noise
results.

> 
> I have looked at this code before. When AVX2 is available, so is SSSE3,
> and then a much more efficient approach is available: instead of comparing
> against \r \n \\ ? one-by-one, build a vector
> 
>   0  1  2  3  4  5  6  7  8  9    a   b    c     d   e   f
> { 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?' }
> 
> where each character C we're seeking is at position (C % 16). Then
> you can match against them all at once using PSHUFB:
> 
>   t = _mm_shuffle_epi8 (lut, data);
>   t = t == data;

I thought the PSHUFB trick only worked for some bit patterns?

At least according to this paper: https://arxiv.org/pdf/1902.08318

But yes if it applies here it's a good idea.


> 
> As you might recognize this handily beats the fancy SSE4.1 loop as well.
> I did not pursue this because I did not measure a substantial improvement
> (we're way into the land of diminishing returns here) and it seemed like
> maintainers might not like to be distracted with that, but if we are
> touching this code, might as well use the more efficient algorithm.
> I'll be happy to propose a patch if people think it's worthwhile.

Yes makes sense.

(of course it would be even better to teach the vectorizer about it,
although this will require fixing some other issues first, see PR116126)

-Andi

Reply via email to