Hi everyone, I'm testing a new machine with 32 nodes of 32 cores each using the IMB benchmark. It is working fine with 512 processes, but it crashes with 1024 processes after a running for a minute:
[pax11-17:16978] *** Process received signal *** [pax11-17:16978] Signal: Bus error (7) [pax11-17:16978] Signal code: Non-existant physical address (2) [pax11-17:16978] Failing at address: 0x2b147b785450 [pax11-17:16978] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b1473b13370] [pax11-17:16978] [ 1] /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x8e)[0x2b14794a413e] [pax11-17:16978] [ 2] /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(ompi_free_list_grow+0x199)[0x2b147384f309] [pax11-17:16978] [ 3] /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_vader.so(+0x270d)[0x2b14794a270d] [pax11-17:16978] [ 4] /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x43)[0x2b1479ae3a13] [pax11-17:16978] [ 5] /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x89a)[0x2b1479ad90ca] [pax11-17:16978] [ 6] /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_ring+0x3f1)[0x2b147ad6ec41] [pax11-17:16978] [ 7] /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(MPI_Allreduce+0x17b)[0x2b147387d6bb] [pax11-17:16978] [ 8] IMB-MPI1[0x40b316] [pax11-17:16978] [ 9] IMB-MPI1[0x407284] [pax11-17:16978] [10] IMB-MPI1[0x40250e] [pax11-17:16978] [11] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1473d41b35] [pax11-17:16978] [12] IMB-MPI1[0x401f79] [pax11-17:16978] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 552 with PID 0 on node pax11-17 exited on signal 7 (Bus error). -------------------------------------------------------------------------- The program is started from the slurm batch system using mpirun. The same application is working fine when using mvapich2 instead. Regards, Götz Waschk _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users