Hi,

I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux
Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0.
Unfortunately, my rankfiles don't work any longer.


loki rankfiles 136 cat rf_loki_nfs1
rank 0=loki slot=0:0-3;1:0-1
rank 1=loki slot=1:2-5
rank 2=nfs1 slot=0:4
rank 3=nfs1 slot=1:5


loki rankfiles 137 mpiexec -report-bindings -np 4 -rf rf_loki_nfs1 hostname
[nfs1:11461] [[41737,0],1] ORTE_ERROR_LOG: Not found in file ../../../../../openmpi-v3.x-201705250239-d5200ea/orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 408 [nfs1:11461] [[41737,0],1] ORTE_ERROR_LOG: Not found in file ../../../../../openmpi-v3.x-201705250239-d5200ea/orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 162 [nfs1:11461] [[41737,0],1] ORTE_ERROR_LOG: Not found in file ../../../../openmpi-v3.x-201705250239-d5200ea/orte/mca/rmaps/base/rmaps_base_map_job.c at line 370 [nfs1:11461] [[41737,0],1] ORTE_ERROR_LOG: Not found in file ../../../../openmpi-v3.x-201705250239-d5200ea/orte/mca/odls/base/odls_base_default_fns.c at line 425
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[41737,0],0] on node loki
  Remote daemon: [[41737,0],1] on node nfs1

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
loki rankfiles 138




I would be grateful, if somebody can fix the problem. Do you need anything
else? Thank you very much for any help in advance.


Kind regards

Siegmar
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to