Hi everyone,

I'm testing a new machine with 32 nodes of 32 cores each using the IMB
benchmark. It is working fine with 512 processes, but it crashes with
1024 processes after a running for a minute:

[pax11-17:16978] *** Process received signal ***
[pax11-17:16978] Signal: Bus error (7)
[pax11-17:16978] Signal code: Non-existant physical address (2)
[pax11-17:16978] Failing at address: 0x2b147b785450
[pax11-17:16978] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b1473b13370]
[pax11-17:16978] [ 1]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x8e)[0x2b14794a413e]
[pax11-17:16978] [ 2]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(ompi_free_list_grow+0x199)[0x2b147384f309]
[pax11-17:16978] [ 3]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_vader.so(+0x270d)[0x2b14794a270d]
[pax11-17:16978] [ 4]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x43)[0x2b1479ae3a13]
[pax11-17:16978] [ 5]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x89a)[0x2b1479ad90ca]
[pax11-17:16978] [ 6]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_ring+0x3f1)[0x2b147ad6ec41]
[pax11-17:16978] [ 7]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(MPI_Allreduce+0x17b)[0x2b147387d6bb]
[pax11-17:16978] [ 8] IMB-MPI1[0x40b316]
[pax11-17:16978] [ 9] IMB-MPI1[0x407284]
[pax11-17:16978] [10] IMB-MPI1[0x40250e]
[pax11-17:16978] [11]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1473d41b35]
[pax11-17:16978] [12] IMB-MPI1[0x401f79]
[pax11-17:16978] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 552 with PID 0 on node pax11-17
exited on signal 7 (Bus error).
--------------------------------------------------------------------------

The program is started from the slurm batch system using mpirun. The
same application is working fine when using mvapich2 instead.

Regards, Götz Waschk
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to