Hi Goetz, Would you mind testing against the 2.1.0 release or the latest from the 1.10.x series (1.10.6)?
Thanks, Howard 2017-03-22 6:25 GMT-06:00 Götz Waschk <goetz.was...@gmail.com>: > Hi everyone, > > I'm testing a new machine with 32 nodes of 32 cores each using the IMB > benchmark. It is working fine with 512 processes, but it crashes with > 1024 processes after a running for a minute: > > [pax11-17:16978] *** Process received signal *** > [pax11-17:16978] Signal: Bus error (7) > [pax11-17:16978] Signal code: Non-existant physical address (2) > [pax11-17:16978] Failing at address: 0x2b147b785450 > [pax11-17:16978] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b1473b13370] > [pax11-17:16978] [ 1] > /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_ > vader.so(mca_btl_vader_frag_init+0x8e)[0x2b14794a413e] > [pax11-17:16978] [ 2] > /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(ompi_ > free_list_grow+0x199)[0x2b147384f309] > [pax11-17:16978] [ 3] > /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_ > vader.so(+0x270d)[0x2b14794a270d] > [pax11-17:16978] [ 4] > /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_ > ob1.so(mca_pml_ob1_send_request_start_prepare+0x43)[0x2b1479ae3a13] > [pax11-17:16978] [ 5] > /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_ > ob1.so(mca_pml_ob1_send+0x89a)[0x2b1479ad90ca] > [pax11-17:16978] [ 6] > /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_coll_ > tuned.so(ompi_coll_tuned_allreduce_intra_ring+0x3f1)[0x2b147ad6ec41] > [pax11-17:16978] [ 7] > /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(MPI_ > Allreduce+0x17b)[0x2b147387d6bb] > [pax11-17:16978] [ 8] IMB-MPI1[0x40b316] > [pax11-17:16978] [ 9] IMB-MPI1[0x407284] > [pax11-17:16978] [10] IMB-MPI1[0x40250e] > [pax11-17:16978] [11] > /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1473d41b35] > [pax11-17:16978] [12] IMB-MPI1[0x401f79] > [pax11-17:16978] *** End of error message *** > -------------------------------------------------------------------------- > mpirun noticed that process rank 552 with PID 0 on node pax11-17 > exited on signal 7 (Bus error). > -------------------------------------------------------------------------- > > The program is started from the slurm batch system using mpirun. The > same application is working fine when using mvapich2 instead. > > Regards, Götz Waschk > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users