On Tue, Apr 12, 2016 at 07:51:48PM +0900, Gilles Gouaillardet wrote:
what if you
ulimit -c unlimited

do orted generate some core dump ?
Hi Gilles,
-thanks for you support!- nope, no core, just the "orte has lost"...

I now tested with a simple hello-world mpi program- printf("rank, processor") in
the middle and a printf("before mpi_init")/printf("after mpi_init").

Starting in the batch script with

mpirun -verbose --mca mtl psm --mca btl vader,self --mca 
orte_base_help_aggregate 0 ~/mpihw/mpi_hello_world

Results:

Starting at Tue Apr 12 13:06:38 CEST 2016
Running on hosts: stek[090-189]
Running on 100 nodes.
Current working directory is /export/homelocal/sfriedel/mpihw
Hello world before mpi_init
[...]
Hello world from processor stek150, rank 971 out of 1600 processors
Program finished with exit code 0 at: Tue Apr 12 13:06:42 CEST 2016

Even with just 100 nodes: some jobs are failing (50/50), failing jobs: _no
output_, _no core dumped_...only orte has lost...

Running on >=350 nodes: almost all jobs are failing, but some jobs succeeded
(similar output: only "orte has lost..." for failing jobs and the expected
output for the other jobs).

Weird.

MfG/Sincerely
Stefan Friedel
--
IWR * 4.317 * INF205 * 69120 Heidelberg
T +49 6221 5414404 * F +49 6221 5414427
stefan.frie...@iwr.uni-heidelberg.de

Attachment: signature.asc
Description: PGP signature

Reply via email to