I am trying to figure out why my jobs aren't getting distributed and need some help. I have an install of sun cluster tools on Rockscluster 5.2 (essentially centos4u2). this user's account has its home dir shared via nfs. I am getting some strange errors. here's an example run
[jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/mpirun -np 3 -hostfile list ./job2.sh bash: /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted: No such file or directory -------------------------------------------------------------------------- A daemon (pid 20362) died unexpectedly with status 127 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- mpirun: clean termination accomplished [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/ bin/ examples/ instrument/ man/ etc/ include/ lib/ share/ [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/orte orte-clean orted orte-iof orte-ps orterun [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted [therock.slu.loc:20365] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 125 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- [therock.slu.loc:20365] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orted/orted_main.c at line 325 [jian@therock ~]$ -- Nehemiah I. Dacres System Administrator Advanced Technology Group Saint Louis University