On Tue, 30 Jul 2024, Andi Kleen wrote:
> > I have looked at this code before. When AVX2 is available, so is SSSE3,
> > and then a much more efficient approach is available: instead of comparing
> > against \r \n \\ ? one-by-one, build a vector
> > 
> >   0  1  2  3  4  5  6  7  8  9    a   b    c     d   e   f
> > { 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?' }
> > 
> > where each character C we're seeking is at position (C % 16). Then
> > you can match against them all at once using PSHUFB:
> > 
> >   t = _mm_shuffle_epi8 (lut, data);
> >   t = t == data;
> 
> I thought the PSHUFB trick only worked for some bit patterns?
> 
> At least according to this paper: https://arxiv.org/pdf/1902.08318
> 
> But yes if it applies here it's a good idea.

I wouldn't mention it if it did not apply.

> > As you might recognize this handily beats the fancy SSE4.1 loop as well.
> > I did not pursue this because I did not measure a substantial improvement
> > (we're way into the land of diminishing returns here) and it seemed like
> > maintainers might not like to be distracted with that, but if we are
> > touching this code, might as well use the more efficient algorithm.
> > I'll be happy to propose a patch if people think it's worthwhile.
> 
> Yes makes sense.

Okay, so what are the next steps here? Can someone who could eventually
supply a review indicate their buy-in for switching our SSE4.1 routine
for the SSSE3 PSHUFB-based one? And then for the 256-bit variant, assuming
it still brings an improvement over the faster PSHUFB scanner?

> (of course it would be even better to teach the vectorizer about it,
> although this will require fixing some other issues first, see PR116126)

(I disagree, FWIW)

(and you trimmed the part about XGETBV)

Alexander

Reply via email to