Hi, This is the first time I am asking for a design review for fixing a GCC enhancement request, Let me know if I need to send this email to other mailing list as well.
I have been studying PR 78809 for some time https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809> with a lot of help from Wilco and other people, and detailed study about the previous discussion and current GCC behavior, I was able to come up with the following writeup (basically serve as a design doc), and ready for implementation. Please take a look at it, and let me know any comments and suggestions: thanks a lot. Qing ======================================== str(n)cmp and memcmp optimization in gcc -- A design document for PR78809 11/01/2017 Qing Zhao =========================================================== 0. Summary: For PR 78809 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809, Will add the following str(n)cmp and memcmp optimizations into GCC 8: A. for strncmp (s1, s2, n) if one of "s1" or "s2" is a constant string, "n" is a constant, and larger than the length of the constant string: change strncmp (s1, s2, n) to strcmp (s1, s2); B. for strncmp (s1, s2, n) (!)= 0 or strcmp (s1, s2) (!)= 0 if the result is ONLY used to do a simple equality test against zero, one of "s1" or "s2" is a small constant string, n is a constant, and the other non-constant string is guaranteed to not read beyond the end of the string: change strncmp (s1, s2, n) or strcmp (s1, s2) to corresponding memcmp (s1, s2, n); (NOTE, currently, memcmp(s1, s2, N) (!)=0 has been optimized to a simple sequence to access all bytes and accumulate the overall result in GCC by https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52171 ) as a result, such str(n)cmp call would like to be replaced by simple sequences to access all types and accumulate the overall results. C. for strcmp (s1, s2), strncmp (s1, s2, n), and memcmp (s1, s2, n) if the result is NOT used to do simple equality test against zero, one of "s1" or "s2" is a small constant string, n is a constant, and the Min value of the length of the constant string and "n" is smaller than a predefined threshold T, inline the call by a byte-to-byte comparision sequence to avoid calling overhead. A would like to be put in gimple fold phase (in routine "gimple_fold_builtin_string_compare" of gimple-fold.c) B would like to be put in strlen phase (add new "handle_builtin_str(n)cmp" routines in tree-ssa-strlen.c) C would like to be put in expand phase (tree-ssa-strlen.c or builtins.c): run-time performance testing is needed to decide the predefined threshold T. The reasons to put the optimizations in the above order are: * A needs very simple information from source level, and A should be BEFORE B to simplify algorithm in B, gimple-fold phase should be a good place to do it. (Currently, similar transformation such as replacing strncat with strcat is put in this phase); * B needs type information (size and alignment) for the safety checking, and B should be BEFORE expand phase to utilize the available memcmp optimization in expand phase, strlen phase is a good place to do it. * C would like to replace the call to byte-to-byte comparision, since we want some high-level optimization been applied first on the calls, the inlining of the call might be better to be done in the late stage of the tree optimization stage. expand phase is a good place to do it. These 3 optimization can be implemented seperated, 3 patches might be needed. the following are details to explain the above. 1. some background: #include <string.h> int strcmp(const char *s1, const char *s2); int strncmp(const char *s1, const char *s2, size_t n); int memcmp(const void *s1, const void *s2, size_t n); • strcmp compares null-terminated C strings • strncmp compares at most N characters of null-terminated C strings • memcmp compares binary byte buffers of N bytes. The major common part among these three is: * they all return an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2. The major different part among these three is: * both strcmp and strncmp might early stop at NULL terminator of the compared strings. but memcmp will NOT early stop, it means to compare exactly N bytes of both buffers. * strcmp compare the whole string, but strncmp only compare the first n chars (or fewer, if the string ends sooner) of the string. So, when optimizing memcmp and str(n)cmp, we need to consider the following: * The compiler can compare multiple bytes at the same time and doesn't have to worry about beyond the end of a string for memcmp, but have to worry about read beyond the end of a string for str(n)cmp when comparing multiple bytes at the same time because it might early stop at Null terminator. * when the "n" of strncmp is sufficiently large, strncmp means to compare the whole strings, its behavior is same as strcmp. however, strncmp is likely to be moderately slower because it has to keep track of a counter. therefore, using strcmp to replace strncmp when "n" is big enough might be better for the performance. 2. Optimization: Three possible optimization applied on calls to str(n)cmp and memcmp: A. When the "n" of strncmp is larger than the length of constant string "s1" or "s2", it can be replaced with corresponding strcmp. B. When the result of the str(n)cmp or memcmp is ONLY used to do a simple equality test against zero, the compiler is allowed to generate a simpler sequence to access all bytes and accumulate the overall result when guarantee that no read beyond the end of a string. C. When the result of the str(n)cmp or memcmp is NOT used to do simple equality test against zero, if one of the s1 or s2 is a small constant string, it might be better for the compiler to inline the call by a byte-to-byte comparision sequence to avoid calling overhead. 3. The current situation of GCC: For 2.A strncmp is replaced by strcmp when "n" is large enough: * strncmp(p, "fishi", 6) * __builtin_strncmp("fishi", q, 6) GLIBC used to replace the strncmp with strcmp when "n" is large enough, but this optimization was removed in: https://sourceware.org/git/?p=glibc.git;a=commit;h=f7db120f67d853e0cfa2 The reasons to delete it from glibc and move to GCC are: ** when doing it in glibc, the user cannot disable it by using -fno-builtin-strcmp/strncmp; ** __builtin_strncmp cannot be replaced similarly; ** when compile C++ or not using glibc, the transformation is not applied. For 2.B when the result is used to do equality test against zero: * memcmp (!)= 0 is currently optimized by GCC to use a simple sequcence to access all bytes and accumulate the overall result. all the following patterns are optimized: memcmp (pa, qa, 5) != 0 __builtin_memcmp (pa, qa, 29) != 0 memcmp (p, "fishi", 5) != 0 __builtin_memcmp (p, "fishiiii", 8) != 0 * strncmp (!)= 0 the following pattersn are optimized ONLY when size is 1 strncmp (pa, qa, 1) != 0 strncmp (p, "f", 1) != 0 __builtin_strncmp (pa, qa, 1) != 0 __builtin_strncmp (p, "f", 1) != 0 * strcmp (!)== 0 the following patterns is optimized Only when size is 0 (by gimple) __builtin_strcmp (p, "") != 0 the following patterns is optimized when size <= 3 (by glibc), this is for 2.B. strcmp (p, "abc") != 0 See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52171 For 2.C when the result is Not used to do equality test against zero, but one of the s1 or s2 is small constant string, it might be better to inline the call by a byte-to-byte comparision sequence to avoid calling overhead. * memcmp is currently optimized by glibc ONLY when size is 1; * strncmp is currently optimized by gimple-fold phase ONLY when size is 1; * strcmp was inlined by byte-to-byte comparison by glibc when size <= 3, but this functionality was removed from the recent Glibc in: https://sourceware.org/git/?p=glibc.git;a=commit;h=f7db120f67d853e0cfa2 the reasons to delete it from glibc and move to GCC are: ** when doing it in glibc, the user cannot disable it by using -fno-builtin-strcmp/strncmp; ** __builtin_str(n)cmp cannot be expanded similarly; ** when compile C++ or not using glibc, the expansion is not applied. 4. Missing optimization for str(n)cmp and memcmp: For 2.A, we miss the whole optimization in GCC. For 2.B, we miss the following optimization: str(n)cmp compare with constant string, when the result is used to do equality test against zero, if satisfy safety checking to read beyond the end of the string, we can transform them to corresponding memcmp, which has been optimized well in GCC. For 2.C, we miss the whole optimization in GCC. 5. Implementation: * for 2.A strncmp is replaced by strcmp when "n" is large enough strncmp (s, STR, C) in which, s is a pointer to a string, STR is a constant string, C is a constant. if (strlen(STR) < C) { it can be safely transformed to strcmp (s, STR) } * for 2.B str(n)cmp (!)= 0 is replaced by memcmp (!)= 0 when safety checking is passed. strcmp (s, STR) (!)= 0 in which, s is a pointer to a string, STR is a constant string. if (max(sizeof_array(s), alignments(s)) > strlen(STR)) { it can be safely transformed to memcmp (s, STR, strlen(STR)+1) (!)= 0 } strncmp (s, STR, C) (!)= 0 in which, s is a pointer to a string, STR is a constant string, C is a constant. if (C <= strlen(STR) && max(sizeof_array(s), alignments(s)) > C) { it can be safely transformed to memcmp (s, STR, C) (!)= 0 } (when C > strlen(STR), the strncmp has been replaced with corresponding strcmp already). * for 2.C strcmp is expanded when n is small than threshold T. strcmp (s, STR) in which, s is a pointer to a string, STR is a constant string if ((n = strlen(STR)) <= T) /* T might be 3, we can do experiment to decide it */ { result = s[0] - STR[0]; if (result == 0) { result = s[1] - STR[1]; ... ... if (result == 0) result = s[n]; } } strncmp (s, STR, C) in which, s is a pointer to a string, STR is a constant string, C is a constant. if ((C <= strlen(STR) && C <= T) /* T might be 3, do experiments to decide it */ { result = s[0] - STR[0]; if (result == 0) { result = s[1] - STR[1]; ... ... if (result == 0) result = s[C]; } } memcmp (s, STR, C) in which, s is a pointer to a string, STR is a constant string, C is a constant similar as strncmp.