https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809

--- Comment #7 from Qing Zhao <qing.zhao at oracle dot com> ---
I have studied the inlining of memcmp and str(n)cmp in GCC, the following are a
summary of my study so far:

        1. memcmp is different from str(n)cmp as the following:

        • strcmp compares null-terminated C strings
        • strncmp compares at most N characters of null-terminated C strings
        • memcmp compares binary byte buffers of N bytes.

among these three, both strcmp and strncmp might early stop at NULL terminator
of the compared strings.  as a result, we have to compare the char one by one
for str(n)cmp.

on the other hand, memcmp will NOT early stop, it will compare exactly N bytes
of both buffers. As a result, the compiler can compare multiple bytes at one
time.  

So, memcpy should be much easier for the compiler to optimize than str(n)cmp. 

        2. when the memcpy’s result is only compared against zero, it's easier
to be optimized than its result compared with other values. 

        3. the implementation of
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52171 have been in latest GCC, and
it optimizes the following
cases on ALL platforms:

memcmp (p, "fishi", 5) != 0
__builtin_memcmp (p, "fishiiii", 8) == 0

and the implementation is a target independent one. when the length of constant
string is multiple word size, it is optimized by “strlen” pass,
when it’s NOT multiple word size, it is optimized by “expand” pass. 

        4. However, this implementation to PR52171 does NOT optimize the
following cases:

A.   memcmp’s result is NOT compared to zero

memcmp (p, "fish", 4);                   
__builtin_memcmp (p, “fishi”, 5);   

B.  strncmp and strcmp, when the result is or is NOT compared to zero
strncmp (p, "fi", 2) != 0
__builtin_strncmp (p, "fi", 2) != 0

strcmp (p, "fish") != 0
__builtin_strcmp (p, "fish") != 0

strncmp (p, "fi", 2)                         
__builtin_strncmp (p, "fi", 2)        

strcmp (p, "fish”)                           
__builtin_strcmp (p, "fish”)           

Per the reason I mentioned in 1.  strncmp and strcmp can ONLY compare the char
one by one,  the implementation of PR52171 can NOT be extended 
for str(n)cmp.  


        5. currently,  glibc optimizes strcmp with constant string up to size 3
in the header as Wilco mentioned in the Description part of the PR. the
optimization is as following:

strcmp (p, “fis”)

will be transformed to:

D.2234 = __s2_len = 3;, __s2_len <= 3; ? TARGET_EXPR <D.2233, {
      const unsigned char * __s1 = (const unsigned char *) p;
      register int __result = (int) *__s1 - (int) *(const unsigned char *)
"fis";

            const unsigned char * __s1 = (const unsigned char *) p;
            register int __result = (int) *__s1 - (int) *(const unsigned char
*) "fis";
      {
        if (__s2_len != 0 && __result == 0)
          {
            __result = (int) *(__s1 + 1) - (int) *((const unsigned char *)
"fis" + 1);
            if (__s2_len > 1 && __result == 0)
              {
                __result = (int) *(__s1 + 2) - (int) *((const unsigned char *)
"fis" + 2);
                if (__s2_len > 2 && __result == 0)
                  {
                    __result = (int) *(__s1 + 3) - (int) *((const unsigned char
*) "fis" + 3);
                  }
              }
          }
      }
      D.2233 = __result;
    }> : __builtin_strcmp ((const char *) p, (const char *) "fis");

I.e, each character is compared one by one, and the result of the previous
comparison is used to control whether the next char should be compared or not,
as a result,  comparing of one char needs one load, one compare and one branch.
 not very cheap operations. therefore, we cannot apply the above optimization
on longer strings even though its constant string.

The Request of this PR is to move the above transformation from Glibc into GCC.

However, my run-time performance data showed in Comment 5 and 6, shows:

the glibc optimization is SLOWER than the direct call to strcmp on aarch64.

      6.  On the other hand, on a platform that has hardward strcmp insn,  for
example, X86 platform, expand the str(n)cmp to use the hardware strcmp insn
will be very fast.  GCC currently has done this for several platforms.  

          However, for a platform that does NOT have hardware strcmp insns, for
example, aarch64,  inline str(n)cmp might NOT be a good idea, I think.

Reply via email to