On Tue, Apr 12, 2016 at 07:51:48PM +0900, Gilles Gouaillardet wrote:
what if you ulimit -c unlimiteddo orted generate some core dump ?
Hi Gilles, -thanks for you support!- nope, no core, just the "orte has lost"... I now tested with a simple hello-world mpi program- printf("rank, processor") in the middle and a printf("before mpi_init")/printf("after mpi_init"). Starting in the batch script with mpirun -verbose --mca mtl psm --mca btl vader,self --mca orte_base_help_aggregate 0 ~/mpihw/mpi_hello_world Results: Starting at Tue Apr 12 13:06:38 CEST 2016 Running on hosts: stek[090-189] Running on 100 nodes. Current working directory is /export/homelocal/sfriedel/mpihw Hello world before mpi_init [...] Hello world from processor stek150, rank 971 out of 1600 processors Program finished with exit code 0 at: Tue Apr 12 13:06:42 CEST 2016 Even with just 100 nodes: some jobs are failing (50/50), failing jobs: _no output_, _no core dumped_...only orte has lost... Running on >=350 nodes: almost all jobs are failing, but some jobs succeeded (similar output: only "orte has lost..." for failing jobs and the expected output for the other jobs). Weird. MfG/Sincerely Stefan Friedel -- IWR * 4.317 * INF205 * 69120 Heidelberg T +49 6221 5414404 * F +49 6221 5414427 stefan.frie...@iwr.uni-heidelberg.de
signature.asc
Description: PGP signature