On Tue, 30 Jul 2024, Andi Kleen wrote: > > I have looked at this code before. When AVX2 is available, so is SSSE3, > > and then a much more efficient approach is available: instead of comparing > > against \r \n \\ ? one-by-one, build a vector > > > > 0 1 2 3 4 5 6 7 8 9 a b c d e f > > { 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?' } > > > > where each character C we're seeking is at position (C % 16). Then > > you can match against them all at once using PSHUFB: > > > > t = _mm_shuffle_epi8 (lut, data); > > t = t == data; > > I thought the PSHUFB trick only worked for some bit patterns? > > At least according to this paper: https://arxiv.org/pdf/1902.08318 > > But yes if it applies here it's a good idea.
I wouldn't mention it if it did not apply. > > As you might recognize this handily beats the fancy SSE4.1 loop as well. > > I did not pursue this because I did not measure a substantial improvement > > (we're way into the land of diminishing returns here) and it seemed like > > maintainers might not like to be distracted with that, but if we are > > touching this code, might as well use the more efficient algorithm. > > I'll be happy to propose a patch if people think it's worthwhile. > > Yes makes sense. Okay, so what are the next steps here? Can someone who could eventually supply a review indicate their buy-in for switching our SSE4.1 routine for the SSSE3 PSHUFB-based one? And then for the 256-bit variant, assuming it still brings an improvement over the faster PSHUFB scanner? > (of course it would be even better to teach the vectorizer about it, > although this will require fixing some other issues first, see PR116126) (I disagree, FWIW) (and you trimmed the part about XGETBV) Alexander