Hey guys, I noticed a strange performance hit in one of our stencil codes, causing it to run twice as long.
To nail down the error, I reduced our code to the two attached demo programs. Basically they take two matrices and average each matrix element with its four direct neighbors. Depending on how these matrices are allocated, the performance hit occurs -- or does not. Here is the diff of the two files: @@ -17,8 +17,7 @@ void test(double (*grid)[GRID_WIDTH]) { - double (*gridOld)[GRID_WIDTH] = - malloc(GRID_WIDTH * GRID_HEIGHT * sizeof(double)); + double (*gridOld)[GRID_WIDTH] = gridOldArray; double (*gridNew)[GRID_WIDTH] = gridNewArray; printAddress(&gridNew[0][0]); printAddress(&gridOld[0][0]); where gridOldArray is a statically allocated array. Depending on the machines processor the performance hit varies from negligible to dramatic: Processor GCC Version Time(slow) Time(fast) Performance Hit ------------------ ----------- ---------- ---------- --------------- Core 2 Quad Q9550 4.3.3 12.19s 5.11s 138% Athlon 64 X2 3800+ 4.3.3 7.34s 6.61s 11% Opteron 2378 4.3.2 6.13s 5.60s 9% Opteron 2352 4.3.3 8.16s 7.96s 2% Xeon 3.00GHz 4.3.3 18.98s 14.67s 29% Apparently Intel systems are more susceptible to this effect. Can anyone reproduce these results? And could anyone explain, why this happens? Thanks in advance -Andreas -- ============================================ Andreas Schäfer Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany 0049/3641-9-46376 PGP/GPG key via keyserver I'm a bright... http://www.the-brights.net ============================================ (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!
#define GRID_WIDTH 1024 #define GRID_HEIGHT 1024 #define MAX_STEPS 1024 #include <stdio.h> #include <stdlib.h> #include <string.h> double grid[GRID_HEIGHT][GRID_WIDTH]; double gridNewArray[GRID_HEIGHT][GRID_WIDTH]; double gridOldArray[GRID_HEIGHT][GRID_WIDTH]; void printAddress(void *p) { printf("address %p\n", p); } void test(double (*grid)[GRID_WIDTH]) { double (*gridOld)[GRID_WIDTH] = gridOldArray; double (*gridNew)[GRID_WIDTH] = gridNewArray; printAddress(&gridNew[0][0]); printAddress(&gridOld[0][0]); // copy initial state for (int y = 0; y < GRID_HEIGHT; ++y) { memcpy(&gridOld[y][0], &grid[y][0], GRID_WIDTH * sizeof(double)); memset(&gridNew[y][0], 0, GRID_WIDTH * sizeof(double)); } // update matrices for (int step = 0; step < MAX_STEPS; ++step) { for (int y = 1; y < GRID_HEIGHT-1; ++y) for (int x = 1; x < GRID_WIDTH-1; ++x) gridNew[y][x] = (gridOld[y-1][x ] + gridOld[y ][x-1] + gridOld[y ][x ] + gridOld[y ][x+1] + gridOld[y+1][x ]) * 0.2; double (*tmp)[GRID_WIDTH] = gridOld; gridOld = gridNew; gridNew = tmp; } // copy result back for (int y = 0; y < GRID_HEIGHT; ++y) memcpy(&grid[y][0], &gridOld[y][0], GRID_WIDTH * sizeof(double)); } void setupGrid() { for (int y = 0; y < GRID_HEIGHT; ++y) for (int x = 0; x < GRID_WIDTH; ++x) grid[y][x] = 0; for (int y = 10; y < 20; ++y) for (int x = 10; x < 20; ++x) grid[y][x] = 1; } int main(int argc, char** argv) { setupGrid(); test(grid); printf("res: %f\n", grid[10][10]); // prevent dead code elimination return 0; }
slowdown.fast
Description: Binary data
test.sh
Description: Bourne shell script
pgpDDJSZ5oTFh.pgp
Description: PGP signature