> Am 30.07.2024 um 19:22 schrieb Alexander Monakov <amona...@ispras.ru>:
> 
> 
> On Tue, 30 Jul 2024, Andi Kleen wrote:
>>> I have looked at this code before. When AVX2 is available, so is SSSE3,
>>> and then a much more efficient approach is available: instead of comparing
>>> against \r \n \\ ? one-by-one, build a vector
>>> 
>>>  0  1  2  3  4  5  6  7  8  9    a   b    c     d   e   f
>>> { 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?' }
>>> 
>>> where each character C we're seeking is at position (C % 16). Then
>>> you can match against them all at once using PSHUFB:
>>> 
>>>  t = _mm_shuffle_epi8 (lut, data);
>>>  t = t == data;
>> 
>> I thought the PSHUFB trick only worked for some bit patterns?
>> 
>> At least according to this paper: https://arxiv.org/pdf/1902.08318
>> 
>> But yes if it applies here it's a good idea.
> 
> I wouldn't mention it if it did not apply.
> 
>>> As you might recognize this handily beats the fancy SSE4.1 loop as well.
>>> I did not pursue this because I did not measure a substantial improvement
>>> (we're way into the land of diminishing returns here) and it seemed like
>>> maintainers might not like to be distracted with that, but if we are
>>> touching this code, might as well use the more efficient algorithm.
>>> I'll be happy to propose a patch if people think it's worthwhile.
>> 
>> Yes makes sense.
> 
> Okay, so what are the next steps here? Can someone who could eventually
> supply a review indicate their buy-in for switching our SSE4.1 routine
> for the SSSE3 PSHUFB-based one? And then for the 256-bit variant, assuming
> it still brings an improvement over the faster PSHUFB scanner?

I’ll happily approve such change.

>> (of course it would be even better to teach the vectorizer about it,
>> although this will require fixing some other issues first, see PR116126)
> 
> (I disagree, FWIW)

I also think writing optimized code with intrinsics is fine.

Richard 

> (and you trimmed the part about XGETBV)
> 
> Alexander

Reply via email to