On Thu, Jun 04, 2015 at 09:57:59PM +0200, Jakub Jelinek wrote:
> On Thu, Jun 04, 2015 at 06:36:33PM +0200, Ondřej Bílka wrote:
> > On Thu, Jun 04, 2015 at 04:01:50PM +0000, Joseph Myers wrote:
> > > On Thu, 4 Jun 2015, Richard Earnshaw wrote:
> > > 
> > > > > Change that into
> > > > > 
> > > > > int foo(char *s)
> > > > > {
> > > > >   int l = strlen (s);
> > > > >   char *p = memchr (s, 'a', l);
> > > > >   return p+l;
> > > > > }
> 
> > And Joseph you shouldn't restrict yourself only to values that are
> > present in variables to cover case where its implicit one from strcpy
> > converted to stpcpy.
> 
> memchr isn't handled in that pass right now at all, of course it could be
> added, shouldn't be really hard.  Feel free to file a PR and/or write
> a patch.
> 
> As for e.g. the inlining of the first (or a few more) iterations of strcmp
> etc., that is certainly something that can be done in the compiler too and
> the compiler should have much better information whether to do it or not,
> as it shouldn't be done for -Os, or for basic blocks or functions predicted
> cold, because it enlarges the code size quite a lot.
> For strcmp in particular, right now we handle the cast of two string
> literals at compile time, or e.g. strcmp (str, "").  For anything further,
> supposedly it should check if the target has inline expansion of strcmp
> (e.g. s390 or sh have them), then inlining one iteration is hardly
> beneficial.

For strcmp I said that you need some userspace counters. A best
expansion is determined by frequencies of first difference. As I
mentioned checking just first byte will likely be benefical. Then
depending of probability of mismatch in second byte migth want to also
add bytewise check. After that you want to see gains in next eigth bytes
whether do libcall or check these. Or if first byte probability is
smaller and first seven bytes are relatively probable you want to check
first eigth bytes.

As strcmp is concerned a inline expansion from gcc is source of quite
big performance regression. It expands to rep cmpsb which is two times
slower than libcall on attached benchmark. See following bug.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

I don't know how good are s390 and sh. These need to be tested.

Variable strcmp iterations make sense only on x64 and other
architectures with instruction to test zero byte, otherwise its too large.

With constant strcmp conversion to memcmp is ok, its expansion that
isn't.

Most of memcmp size comes from needing to determine branch for given n,
and decide if its less or more. If you don't care about result it
becomes lot simpler. You could expand following recursive
implementation called memeq for 64 bit little endian archs with unaligned loads,
For simplicity I omitted specialcasing for size 1..7.

As entry points there could be savings by introducing streq and memeq in
libc which only check equality and gcc transforming these when result is
boolean. 

#define CROSS_PAGE(p, s) __builtin_expect (((uintptr_t) s) % 4096 > 4096 - s , 
0)
#define LOADU(x) (*((uint64_t *)x))

uint64_t memeq2 (char *x, char *y, size_t n)
{
  if (n > 8)
    {
      return (LOADU (x) ^ LOADU (y)) | memeq2 (x + 8, y + 8, n - 8);
    }
  else if (n == 8)
    return 0;
  else
    return (LOADU (x) ^ LOADU (y)) << (8 * n);
}

uint64_t memeq (char *x, char *y, size_t n)
{
  if (CROSS_PAGE (x, n) || CROSS_PAGE (y, n))
    return memcmp (x, y, n);
  return memeq2 (x, y, n);
}

If you want sign best way that I know is following bit manipulation: 

int64_t
memcmp2 (char *a, char *b, int n)
{
  uint64_t va = LOADU (a), vb =  LOADU (b);
  dif = (va ^ vb);
  dif = va ^ (va - 1);
  uint64_t mask = 0x0101010101010101ULL;
  if (n < 8)
    mask = mask >> (8 * (8 - n));
  dif = 255 * (dif & mask);
  if (n < 8)
    return (va & dif) - (vb & dif);
  if (va & dif > vb & dif)
    return 1;
  if (va & dif < vb & dif)
    return -1;
  if (n == 8)
    return 0;
  return memcmp2 (a + 8, b + 8, n - 8);
}

For x64 you also could use sse2 checks: 

#  include <stdint.h>
#  include <emmintrin.h>
#  define __LOAD(x) _mm_load_si128 ((__tp_vector *) (x))
#  define __LOADU(x) _mm_loadu_si128 ((__tp_vector *) (x))
#  define __get_mask(x) ((uint64_t) _mm_movemask_epi8 (x))
#  define __EQ  _mm_cmpeq_epi8
#  define __XOR  _mm_xor_si128

typedef __m128i __tp_vector;
typedef uint64_t __tp_mask;


static inline __attribute__ ((always_inline))
int
__memcmp_16 (char *s, char *c, int n)
{
  if (CROSS_PAGE (s) || CROSS_PAGE (c))
    return memcmp (s, c, n);
  
  __tp_mask __m = (~__get_mask (__EQ (__LOADU (s), __LOADU (c)))) | 1UL << (n - 
1);
  int found = __builtin_ctzl (__m);
  return s[found] - c[found];
}

Attachment: memcmp_builtin.tar.bz2
Description: Binary data

Reply via email to