http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48006
Summary: Inefficient optimization depends on builtin integer type of same size. Product: gcc Version: 4.4.5 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassig...@gcc.gnu.org ReportedBy: ca...@gcc.gnu.org Created attachment 23561 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=23561 The test file used. Working on M4RI, I found that changing a typedef from unsigned long long to unsigned long caused one of the bench marks to become 50% slower. This is peculiar since I'm on a 64-bit box where the size of both is 8 byte. After investigation I ended up with the following function that is the cause for at least 25% slow down, so a good case to investigate this (compiler) bug (assuming you're willing to call not-optimal compiled code a bug). ===============Start of File===================== #define RADIX 64 typedef unsigned long word; typedef unsigned long size_t; typedef struct _mm_block { size_t size; void *data; } mmb_t; typedef struct { mmb_t *blocks; size_t nrows; size_t ncols; size_t width; size_t offset; word** rows; } mzd_t; typedef unsigned char BIT; #define ONE ((word)1) #define GET_BIT(w, spot) (((w) >> (RADIX - 1 - (spot))) & ONE) static inline BIT mzd_read_bit(const mzd_t *M, const size_t row, const size_t col ) { return GET_BIT(M->rows[row][(col+M->offset)/RADIX], (col+M->offset) % RADIX); } void foo(mzd_t* DST, mzd_t const* A, int i, int eol) { #ifdef OLDCODE unsigned long long* temp = (unsigned long long*)DST->rows[i]; for (int j = 0; j < eol; j += RADIX, ++temp) for (int k = RADIX - 1; k >= 0; --k) *temp |= ((unsigned long long)mzd_read_bit(A, j+k, i+A->offset))<<(RADIX-1-k); #else word* temp = DST->rows[i]; for (int j = 0; j < eol; j += RADIX, ++temp) for (int k = RADIX - 1; k >= 0; --k) *temp |= ((word)mzd_read_bit(A, j+k, i+A->offset))<<(RADIX-1-k); #endif } ===================END OF FILE==================================== Compile this with on a x86_64 machine with: gcc -std=gnu99 -O2 -c transposebody.c -fPIC -DPIC -o transposebody.o -DOLDCODE -save-temps one with and without the -DOLDCODE will show a remarkable difference in the resulting assembly code, using more registers and a lot more instructions when OLDCODE is not defined. Note that the only difference is that with OLDCODE defined we cast the unsigned char returned from mzd_read_bit to an unsigned long long instead of to an unsigned long, and the type of temp is unsigned long long* instead of unsigned long*. $ uname -a Linux hikaru 2.6.32-5-amd64 #1 SMP Wed Jan 12 03:40:32 UTC 2011 x86_64 GNU/Linux