http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48006

           Summary: Inefficient optimization depends on builtin integer
                    type of same size.
           Product: gcc
           Version: 4.4.5
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: ca...@gcc.gnu.org


Created attachment 23561
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=23561
The test file used.

Working on M4RI, I found that changing a typedef from unsigned long long to
unsigned long caused one of the bench marks to become 50% slower. This is
peculiar since I'm on a 64-bit box where the size of both is 8 byte.

After investigation I ended up with the following function that is the cause
for at least 25% slow down, so a good case to investigate this (compiler) bug
(assuming you're willing to call not-optimal compiled code a bug).

===============Start of File=====================
#define RADIX 64
typedef unsigned long word;
typedef unsigned long size_t;

typedef struct _mm_block {
  size_t size;
  void *data;
} mmb_t;

typedef struct {
  mmb_t *blocks;
  size_t nrows;
  size_t ncols;
  size_t width;
  size_t offset;
  word** rows;
} mzd_t;

typedef unsigned char BIT;
#define ONE ((word)1)
#define GET_BIT(w, spot) (((w) >> (RADIX - 1 - (spot))) & ONE)

static inline BIT mzd_read_bit(const mzd_t *M, const size_t row, const size_t
col ) {
  return GET_BIT(M->rows[row][(col+M->offset)/RADIX], (col+M->offset) % RADIX);
}

void foo(mzd_t* DST, mzd_t const* A, int i, int eol)
{
#ifdef OLDCODE
    unsigned long long* temp = (unsigned long long*)DST->rows[i];
    for (int j = 0; j < eol; j += RADIX, ++temp)
      for (int k = RADIX - 1; k >= 0; --k)
        *temp |= ((unsigned long long)mzd_read_bit(A, j+k,
i+A->offset))<<(RADIX-1-k);
#else
    word* temp = DST->rows[i];
    for (int j = 0; j < eol; j += RADIX, ++temp)
      for (int k = RADIX - 1; k >= 0; --k)
        *temp |= ((word)mzd_read_bit(A, j+k, i+A->offset))<<(RADIX-1-k);
#endif
}
===================END OF FILE====================================

Compile this with on a x86_64 machine with:

gcc -std=gnu99 -O2 -c transposebody.c -fPIC -DPIC -o transposebody.o -DOLDCODE
-save-temps

one with and without the -DOLDCODE will show a remarkable difference in the
resulting assembly code, using more registers and a lot more instructions when
OLDCODE is not defined.

Note that the only difference is that with OLDCODE defined we cast the unsigned
char returned from mzd_read_bit to an unsigned long long instead of to an
unsigned long, and the type of temp is unsigned long long* instead of unsigned
long*.

$ uname -a
Linux hikaru 2.6.32-5-amd64 #1 SMP Wed Jan 12 03:40:32 UTC 2011 x86_64
GNU/Linux

Reply via email to