On Thu, Jun 04, 2015 at 09:57:59PM +0200, Jakub Jelinek wrote: > On Thu, Jun 04, 2015 at 06:36:33PM +0200, Ondřej Bílka wrote: > > On Thu, Jun 04, 2015 at 04:01:50PM +0000, Joseph Myers wrote: > > > On Thu, 4 Jun 2015, Richard Earnshaw wrote: > > > > > > > > Change that into > > > > > > > > > > int foo(char *s) > > > > > { > > > > > int l = strlen (s); > > > > > char *p = memchr (s, 'a', l); > > > > > return p+l; > > > > > } > > > And Joseph you shouldn't restrict yourself only to values that are > > present in variables to cover case where its implicit one from strcpy > > converted to stpcpy. > > memchr isn't handled in that pass right now at all, of course it could be > added, shouldn't be really hard. Feel free to file a PR and/or write > a patch. > > As for e.g. the inlining of the first (or a few more) iterations of strcmp > etc., that is certainly something that can be done in the compiler too and > the compiler should have much better information whether to do it or not, > as it shouldn't be done for -Os, or for basic blocks or functions predicted > cold, because it enlarges the code size quite a lot. > For strcmp in particular, right now we handle the cast of two string > literals at compile time, or e.g. strcmp (str, ""). For anything further, > supposedly it should check if the target has inline expansion of strcmp > (e.g. s390 or sh have them), then inlining one iteration is hardly > beneficial.
For strcmp I said that you need some userspace counters. A best expansion is determined by frequencies of first difference. As I mentioned checking just first byte will likely be benefical. Then depending of probability of mismatch in second byte migth want to also add bytewise check. After that you want to see gains in next eigth bytes whether do libcall or check these. Or if first byte probability is smaller and first seven bytes are relatively probable you want to check first eigth bytes. As strcmp is concerned a inline expansion from gcc is source of quite big performance regression. It expands to rep cmpsb which is two times slower than libcall on attached benchmark. See following bug. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 I don't know how good are s390 and sh. These need to be tested. Variable strcmp iterations make sense only on x64 and other architectures with instruction to test zero byte, otherwise its too large. With constant strcmp conversion to memcmp is ok, its expansion that isn't. Most of memcmp size comes from needing to determine branch for given n, and decide if its less or more. If you don't care about result it becomes lot simpler. You could expand following recursive implementation called memeq for 64 bit little endian archs with unaligned loads, For simplicity I omitted specialcasing for size 1..7. As entry points there could be savings by introducing streq and memeq in libc which only check equality and gcc transforming these when result is boolean. #define CROSS_PAGE(p, s) __builtin_expect (((uintptr_t) s) % 4096 > 4096 - s , 0) #define LOADU(x) (*((uint64_t *)x)) uint64_t memeq2 (char *x, char *y, size_t n) { if (n > 8) { return (LOADU (x) ^ LOADU (y)) | memeq2 (x + 8, y + 8, n - 8); } else if (n == 8) return 0; else return (LOADU (x) ^ LOADU (y)) << (8 * n); } uint64_t memeq (char *x, char *y, size_t n) { if (CROSS_PAGE (x, n) || CROSS_PAGE (y, n)) return memcmp (x, y, n); return memeq2 (x, y, n); } If you want sign best way that I know is following bit manipulation: int64_t memcmp2 (char *a, char *b, int n) { uint64_t va = LOADU (a), vb = LOADU (b); dif = (va ^ vb); dif = va ^ (va - 1); uint64_t mask = 0x0101010101010101ULL; if (n < 8) mask = mask >> (8 * (8 - n)); dif = 255 * (dif & mask); if (n < 8) return (va & dif) - (vb & dif); if (va & dif > vb & dif) return 1; if (va & dif < vb & dif) return -1; if (n == 8) return 0; return memcmp2 (a + 8, b + 8, n - 8); } For x64 you also could use sse2 checks: # include <stdint.h> # include <emmintrin.h> # define __LOAD(x) _mm_load_si128 ((__tp_vector *) (x)) # define __LOADU(x) _mm_loadu_si128 ((__tp_vector *) (x)) # define __get_mask(x) ((uint64_t) _mm_movemask_epi8 (x)) # define __EQ _mm_cmpeq_epi8 # define __XOR _mm_xor_si128 typedef __m128i __tp_vector; typedef uint64_t __tp_mask; static inline __attribute__ ((always_inline)) int __memcmp_16 (char *s, char *c, int n) { if (CROSS_PAGE (s) || CROSS_PAGE (c)) return memcmp (s, c, n); __tp_mask __m = (~__get_mask (__EQ (__LOADU (s), __LOADU (c)))) | 1UL << (n - 1); int found = __builtin_ctzl (__m); return s[found] - c[found]; }
memcmp_builtin.tar.bz2
Description: Binary data