Hello, I found a problem in openmpi's ptmalloc2. The problem is that TSD (thread specific data) does not work properly and it may cause peformance loss and segfault. In my case, heavy memory allocating applications sometimes make segfault.
Please see opal/mca/memory/linux/sysdeps/pthread/malloc-machine.h. When USE_TSD_DATA_HACK is defined, which is default of openmpi, the hacked TSD is used as shown below. #if defined(__sgi) || defined(USE_TSD_DATA_HACK) typedef void *tsd_key_t[256]; #define tsd_key_create(key, destr) do { \ int i; \ for(i=0; i<256; i++) (*key)[i] = 0; \ } while(0) #define tsd_setspecific(key, data) \ (key[(unsigned)pthread_self() % 256] = (data)) #define tsd_getspecific(key, vptr) \ (vptr = key[(unsigned)pthread_self() % 256]) On the other hand, thread ID(=pthread_self()) generated by pthread is not a continuous number, at least in my environment. An example of threads created by t-test1 included in ptmalloc2: [mishima@manage ptmalloc2]$ ./t-test1 4 4 Using posix threads. total=4 threads=4 i_max=10000 size=10000 bins=200 Created thread 41cb4940. Created thread 41eb5940. Created thread 420b6940. Created thread 422b7940. Since the interval of ID number is much larger than 256, each thread may share key-array address. Most of [pthread_self() % 256] is 64 as shown above, which means that the hacked TSD does not function at all. I think -DUSE_TSD_DATA_HACK=1 should be removed from openmpi's configuration. As far as I checked, when I use pthread's TSD by "#undef USE_DATA_HACK", the problem goes away. One more request is PGI compiler issue. PGI compiler does not have pre-defined macro __GNUC__. Therefore, PGI does not use fast inline mutex_lock wrriten in malloc-machine.h. Please consider to add 4 lines arround the head of malloc.c. --- opal/mca/memory/linux/malloc.c.org 2012-08-30 16:15:19.000000000 +0900 +++ opal/mca/memory/linux/malloc.c 2012-08-31 07:57:16.000000000 +0900 @@ -43,6 +43,11 @@ #define MORECORE opal_memory_linux_free_ptmalloc2_sbrk #define munmap(a,b) opal_memory_linux_free_ptmalloc2_munmap(a,b,1) +/* For PGI compiler to activate inline mutex_lock */ +#if defined(__PGI) +#define __GNUC__ 1 +#endif + /* make some non-GCC compilers happy */ #ifndef __GNUC__ #define __const const P.S. Since GNU and Intel compiler uses inline mutex_lock, mutex initialization is very fast and the hacked TSD problem does not cause segfault. Only the perfomance loss could be induced. The reason is a very long story, please let it omitted today. Best regards, Tetsuya Mishima