> Am 30.07.2024 um 19:22 schrieb Alexander Monakov <amona...@ispras.ru>:
>
>
> On Tue, 30 Jul 2024, Andi Kleen wrote:
>>> I have looked at this code before. When AVX2 is available, so is SSSE3,
>>> and then a much more efficient approach is available: instead of comparing
>>> against \r \n \\ ? one-by-one, build a vector
>>>
>>> 0 1 2 3 4 5 6 7 8 9 a b c d e f
>>> { 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?' }
>>>
>>> where each character C we're seeking is at position (C % 16). Then
>>> you can match against them all at once using PSHUFB:
>>>
>>> t = _mm_shuffle_epi8 (lut, data);
>>> t = t == data;
>>
>> I thought the PSHUFB trick only worked for some bit patterns?
>>
>> At least according to this paper: https://arxiv.org/pdf/1902.08318
>>
>> But yes if it applies here it's a good idea.
>
> I wouldn't mention it if it did not apply.
>
>>> As you might recognize this handily beats the fancy SSE4.1 loop as well.
>>> I did not pursue this because I did not measure a substantial improvement
>>> (we're way into the land of diminishing returns here) and it seemed like
>>> maintainers might not like to be distracted with that, but if we are
>>> touching this code, might as well use the more efficient algorithm.
>>> I'll be happy to propose a patch if people think it's worthwhile.
>>
>> Yes makes sense.
>
> Okay, so what are the next steps here? Can someone who could eventually
> supply a review indicate their buy-in for switching our SSE4.1 routine
> for the SSSE3 PSHUFB-based one? And then for the 256-bit variant, assuming
> it still brings an improvement over the faster PSHUFB scanner?
I’ll happily approve such change.
>> (of course it would be even better to teach the vectorizer about it,
>> although this will require fixing some other issues first, see PR116126)
>
> (I disagree, FWIW)
I also think writing optimized code with intrinsics is fine.
Richard
> (and you trimmed the part about XGETBV)
>
> Alexander