Please note that I had the problem with 13.1.0 but not with the 13.1.1


On 28/05/2014 00:47, Ralph Castain wrote:
On May 27, 2014, at 3:32 PM, Alain Miniussi <alain.miniu...@oca.eu> wrote:

Unfortunately, the debug library works like a charm (which make the 
uninitialized variable issue more likely).
Indeed - sounds like there is some optimization occurring that triggers the 
problem.

Still, the stack trace point to mca_btl_openib_add_procs in 
ompi/mca/btl/openib/btl_openib.c and there is only one division in that 
function (although not floating point) at the end:

    openib_btl->local_procs += local_procs;
    openib_btl->device->mem_reg_max = calculate_max_reg () / 
openib_btl->local_procs;

now, I'm not sure how much I would trust the local_procs initialization:

for (i = 0, local_procs = 0 ; i < (int) nprocs; i++) {

I suspect that a compiler could (wrongly) decide to pass the init of local_proc 
if procs = 0 or in  a few other corner cases.
Yeah, that could be a source of optimization, I suppose - somewhat troubling 
wrt the expected behavior, but you could sorta see someone doing that.

Anyway, applying the attache patch on btl_openlib.c seems to fix the issue on 
my small case (but I have no exhaustive test suite to run).

If there is a more serious patch process to follow (based on the dev version?) 
please let me know.
The fact that it resolves the issue would lend credence to the optimizer indeed 
skipping that step for some odd reason. I'll bring it to the attention of the 
folks who maintain that component and see if they can grok the problem.

Thanks!
Ralph

Alain

On 27/05/2014 17:30, Ralph Castain wrote:
Ah, good. On the setup that fails, could you use gdb to find the line number 
where it is dividing by zero? It could be an uninitialized variable that gcc 
inits one way and icc inits another.


On May 27, 2014, at 4:49 AM, Alain Miniussi <alain.miniu...@oca.eu> wrote:

So it's working with a gcc compiled openmpi:

[alainm@gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpicc --showme
gcc -I/softs/openmpi-1.8.1-gnu447/include -pthread -Wl,-rpath 
-Wl,/softs/openmpi-1.8.1-gnu447/lib -Wl,--enable-new-dtags 
-L/softs/openmpi-1.8.1-gnu447/lib -lmpi
(reverse-i-search)`mpicc': ^Cicc --showme:compile
[alainm@gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpicc --showme
gcc -I/softs/openmpi-1.8.1-gnu447/include -pthread -Wl,-rpath 
-Wl,/softs/openmpi-1.8.1-gnu447/lib -Wl,--enable-new-dtags 
-L/softs/openmpi-1.8.1-gnu447/lib -lmpi
[alainm@gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpicc ./test.c
[alainm@gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpiexec -n 2 ./a.out
[alainm@gurney mpi]$ ldd ./a.out
    linux-vdso.so.1 =>  (0x00007fffb47ff000)
    libmpi.so.1 => /softs/openmpi-1.8.1-gnu447/lib/libmpi.so.1 
(0x00002aaee80c1000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003bd9e00000)
    libc.so.6 => /lib64/libc.so.6 (0x0000003bd9200000)
    libopen-rte.so.7 => /softs/openmpi-1.8.1-gnu447/lib/libopen-rte.so.7 
(0x00002aaee83b8000)
    libopen-pal.so.6 => /softs/openmpi-1.8.1-gnu447/lib/libopen-pal.so.6 
(0x00002aaee8630000)
    libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003bd9600000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00002aaee8904000)
    librt.so.1 => /lib64/librt.so.1 (0x0000003bda600000)
    libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003beb000000)
    libutil.so.1 => /lib64/libutil.so.1 (0x0000003bea000000)
    libm.so.6 => /lib64/libm.so.6 (0x0000003bd9a00000)
    /lib64/ld-linux-x86-64.so.2 (0x0000003bd8e00000)
[alainm@gurney mpi]$ ./a.out
[alainm@gurney mpi]$

So it seems to be specific to Intel's compiler.


On 26/05/2014 17:35, Ralph Castain wrote:
If you wouldn't mind, yes - let's see if it is a problem with icc. We know some 
versions have bugs, though this may not be the issue here

On May 26, 2014, at 7:39 AM, Alain Miniussi <alain.miniu...@oca.eu> wrote:

Hi,

Did that too, with the same result:

[alainm@tagir mpi]$ mpirun -n 1 ./a.out
[tagir:05123] *** Process received signal ***
[tagir:05123] Signal: Floating point exception (8)
[tagir:05123] Signal code: Integer divide-by-zero (1)
[tagir:05123] Failing at address: 0x2adb507b3d9f
[tagir:05123] [ 0] /lib64/libpthread.so.0[0x30f920f710]
[tagir:05123] [ 1] 
/softs/openmpi-1.8.1-intel13/lib/openmpi/mca_btl_openib.so(mca_btl_openib_add_procs+0xe9f)[0x2adb507b3d9f]
[tagir:05123] [ 2] 
/softs/openmpi-1.8.1-intel13/lib/openmpi/mca_bml_r2.so(+0x1481)[0x2adb505a7481]
[tagir:05123] [ 3] 
/softs/openmpi-1.8.1-intel13/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xa8)[0x2adb51af02f8]
[tagir:05123] [ 4] 
/softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(ompi_mpi_init+0x9f6)[0x2adb4b78b236]
[tagir:05123] [ 5] 
/softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(MPI_Init+0xef)[0x2adb4b7ad74f]
[tagir:05123] [ 6] ./a.out[0x400dd1]
[tagir:05123] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x30f8a1ed1d]
[tagir:05123] [ 8] ./a.out[0x400cc9]
[tagir:05123] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 5123 on node tagir exited on signal 
13 (Broken pipe).
--------------------------------------------------------------------------
[alainm@tagir mpi]$


do you want me to try a gcc build ?

Alain

On 26/05/2014 16:09, Ralph Castain wrote:
Strange - I note that you are running these as singletons. Can you try running 
it under mpirun?

mpirun -n 1 ./a.out

just to see if it is the singleton that is causing the problem, or something in 
the openib btl itself.


On May 26, 2014, at 6:59 AM, Alain Miniussi <alain.miniu...@oca.eu> wrote:

Hi,

I have a failure with the following minimalistic testcase:
$: more ./test.c
#include "mpi.h"

int main(int argc, char* argv[]) {
    MPI_Init(&argc,&argv);
    MPI_Finalize();
    return 0;
}
$: mpicc -v
icc version 13.1.1 (gcc version 4.4.7 compatibility)
$: mpicc ./test.c
$: ./a.out
[tagir:02855] *** Process received signal ***
[tagir:02855] Signal: Floating point exception (8)
[tagir:02855] Signal code: Integer divide-by-zero (1)
[tagir:02855] Failing at address: 0x2aef6e5b2d9f
[tagir:02855] [ 0] /lib64/libpthread.so.0[0x30f920f710]
[tagir:02855] [ 1] 
/softs/openmpi-1.8.1-intel13/lib/openmpi/mca_btl_openib.so(mca_btl_openib_add_procs+0xe9f)[0x2aef6e5b2d9f]
[tagir:02855] [ 2] 
/softs/openmpi-1.8.1-intel13/lib/openmpi/mca_bml_r2.so(+0x1481)[0x2aef6e3a6481]
[tagir:02855] [ 3] 
/softs/openmpi-1.8.1-intel13/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xa8)[0x2aef6f8ef2f8]
[tagir:02855] [ 4] 
/softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(ompi_mpi_init+0x9f6)[0x2aef69572236]
[tagir:02855] [ 5] 
/softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(MPI_Init+0xef)[0x2aef6959474f]
[tagir:02855] [ 6] ./a.out[0x400dd1]
[tagir:02855] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x30f8a1ed1d]
[tagir:02855] [ 8] ./a.out[0x400cc9]
[tagir:02855] *** End of error message ***
$:

Versions info:
$: mpicc -v
icc version 13.1.1 (gcc version 4.4.7 compatibility)
$: ldd ./a.out
    linux-vdso.so.1 =>  (0x00007fffbb197000)
    libmpi.so.1 => /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1 
(0x00002b20262ee000)
    libm.so.6 => /lib64/libm.so.6 (0x00000030f8e00000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000030ff200000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00000030f9200000)
    libc.so.6 => /lib64/libc.so.6 (0x00000030f8a00000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00000030f9600000)
    libopen-rte.so.7 => /softs/openmpi-1.8.1-intel13/lib/libopen-rte.so.7 
(0x00002b202660d000)
    libopen-pal.so.6 => /softs/openmpi-1.8.1-intel13/lib/libopen-pal.so.6 
(0x00002b20268a1000)
    libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00002b2026ba6000)
    librt.so.1 => /lib64/librt.so.1 (0x00000030f9e00000)
    libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003109800000)
    libutil.so.1 => /lib64/libutil.so.1 (0x000000310aa00000)
    libimf.so => 
/softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libimf.so 
(0x00002b2026db0000)
    libsvml.so => 
/softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libsvml.so 
(0x00002b202726d000)
    libirng.so => 
/softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libirng.so 
(0x00002b2027c37000)
    libintlc.so.5 => 
/softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libintlc.so.5 
(0x00002b2027e3e000)
    /lib64/ld-linux-x86-64.so.2 (0x00000030f8600000)
$:

I tried to goole the issue, and saw something regarding an old vectorization 
bug with intel compiler, but that was a lonng time ago and seemed to be fixed 
for 1.6.x.
Also, "make check" went fine ???

Any idea ?

Cheers

--
---
Alain

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
---
Alain

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
---
Alain

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
---
Alain

<btl_openib-1.8.1.diff>_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
---
Alain

Reply via email to