> On 30 Jul 2024, at 19:01, Andi Kleen <a...@firstfloor.org> wrote:
>
> External email: Use caution opening links or attachments
>
>
>> Is that from some kind of rigorous measurement under perf? As you
>> surely know, 0.6% wall-clock time can be from boost clock variation
>> or just run-to-run noise on x86.
>
> I compared it using hyperfine which does rigorous measurements yes.
> It was well above the run-to-run variability.
FWIW when I was experimenting with these paths I found that an -fsyntax-only
compilation helps make the changes more pronounced.
Thanks,
Kyrill
>
> I had some other patches that didn't meet that bar, e.g.
> i've been experimenting with more modern hashes for inchash
> and multiple ggc free lists, but so far no above noise
> results.
>
>>
>> I have looked at this code before. When AVX2 is available, so is SSSE3,
>> and then a much more efficient approach is available: instead of comparing
>> against \r \n \\ ? one-by-one, build a vector
>>
>> 0 1 2 3 4 5 6 7 8 9 a b c d e f
>> { 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?' }
>>
>> where each character C we're seeking is at position (C % 16). Then
>> you can match against them all at once using PSHUFB:
>>
>> t = _mm_shuffle_epi8 (lut, data);
>> t = t == data;
>
> I thought the PSHUFB trick only worked for some bit patterns?
>
> At least according to this paper: https://arxiv.org/pdf/1902.08318
>
> But yes if it applies here it's a good idea.
>
>
>>
>> As you might recognize this handily beats the fancy SSE4.1 loop as well.
>> I did not pursue this because I did not measure a substantial improvement
>> (we're way into the land of diminishing returns here) and it seemed like
>> maintainers might not like to be distracted with that, but if we are
>> touching this code, might as well use the more efficient algorithm.
>> I'll be happy to propose a patch if people think it's worthwhile.
>
> Yes makes sense.
>
> (of course it would be even better to teach the vectorizer about it,
> although this will require fixing some other issues first, see PR116126)
>
> -Andi