https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809
--- Comment #7 from Qing Zhao <qing.zhao at oracle dot com> --- I have studied the inlining of memcmp and str(n)cmp in GCC, the following are a summary of my study so far: 1. memcmp is different from str(n)cmp as the following: • strcmp compares null-terminated C strings • strncmp compares at most N characters of null-terminated C strings • memcmp compares binary byte buffers of N bytes. among these three, both strcmp and strncmp might early stop at NULL terminator of the compared strings. as a result, we have to compare the char one by one for str(n)cmp. on the other hand, memcmp will NOT early stop, it will compare exactly N bytes of both buffers. As a result, the compiler can compare multiple bytes at one time. So, memcpy should be much easier for the compiler to optimize than str(n)cmp. 2. when the memcpy’s result is only compared against zero, it's easier to be optimized than its result compared with other values. 3. the implementation of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52171 have been in latest GCC, and it optimizes the following cases on ALL platforms: memcmp (p, "fishi", 5) != 0 __builtin_memcmp (p, "fishiiii", 8) == 0 and the implementation is a target independent one. when the length of constant string is multiple word size, it is optimized by “strlen” pass, when it’s NOT multiple word size, it is optimized by “expand” pass. 4. However, this implementation to PR52171 does NOT optimize the following cases: A. memcmp’s result is NOT compared to zero memcmp (p, "fish", 4); __builtin_memcmp (p, “fishi”, 5); B. strncmp and strcmp, when the result is or is NOT compared to zero strncmp (p, "fi", 2) != 0 __builtin_strncmp (p, "fi", 2) != 0 strcmp (p, "fish") != 0 __builtin_strcmp (p, "fish") != 0 strncmp (p, "fi", 2) __builtin_strncmp (p, "fi", 2) strcmp (p, "fish”) __builtin_strcmp (p, "fish”) Per the reason I mentioned in 1. strncmp and strcmp can ONLY compare the char one by one, the implementation of PR52171 can NOT be extended for str(n)cmp. 5. currently, glibc optimizes strcmp with constant string up to size 3 in the header as Wilco mentioned in the Description part of the PR. the optimization is as following: strcmp (p, “fis”) will be transformed to: D.2234 = __s2_len = 3;, __s2_len <= 3; ? TARGET_EXPR <D.2233, { const unsigned char * __s1 = (const unsigned char *) p; register int __result = (int) *__s1 - (int) *(const unsigned char *) "fis"; const unsigned char * __s1 = (const unsigned char *) p; register int __result = (int) *__s1 - (int) *(const unsigned char *) "fis"; { if (__s2_len != 0 && __result == 0) { __result = (int) *(__s1 + 1) - (int) *((const unsigned char *) "fis" + 1); if (__s2_len > 1 && __result == 0) { __result = (int) *(__s1 + 2) - (int) *((const unsigned char *) "fis" + 2); if (__s2_len > 2 && __result == 0) { __result = (int) *(__s1 + 3) - (int) *((const unsigned char *) "fis" + 3); } } } } D.2233 = __result; }> : __builtin_strcmp ((const char *) p, (const char *) "fis"); I.e, each character is compared one by one, and the result of the previous comparison is used to control whether the next char should be compared or not, as a result, comparing of one char needs one load, one compare and one branch. not very cheap operations. therefore, we cannot apply the above optimization on longer strings even though its constant string. The Request of this PR is to move the above transformation from Glibc into GCC. However, my run-time performance data showed in Comment 5 and 6, shows: the glibc optimization is SLOWER than the direct call to strcmp on aarch64. 6. On the other hand, on a platform that has hardward strcmp insn, for example, X86 platform, expand the str(n)cmp to use the hardware strcmp insn will be very fast. GCC currently has done this for several platforms. However, for a platform that does NOT have hardware strcmp insns, for example, aarch64, inline str(n)cmp might NOT be a good idea, I think.