I have a script that launches a bunch of runs on some compute nodes of
a cluster. Once I get through the queue, I query PBS for my machine
file, then I copy that to a local file 'nodes' which I use for mpiexec:
mpiexec -machinefile /home/research/cary/projects/vpall/vptests/nodes
-np 6 /hom
e/research/cary/projects/vpall/builds/vorpal/par/vorpal/vorpal -i
bathtubAntenna
.in -dim 2 -o bathtubAntenna2p -n 100 -d 100
but this fails with
[node47:07004] [[25769,0],0] ORTE_ERROR_LOG: File open failure in file
../../../
../../orte/mca/ras/tm/ras_tm_module.c at line 153
[node47:07004] [[25769,0],0] ORTE_ERROR_LOG: File open failure in file
../../../
../../orte/mca/ras/tm/ras_tm_module.c at line 87
[node47:07004] [[25769,0],0] ORTE_ERROR_LOG: File open failure in file
../../../
../orte/mca/ras/base/ras_base_allocate.c at line 133
[node47:07004] [[25769,0],0] ORTE_ERROR_LOG: File open failure in file
../../../
../orte/mca/plm/base/plm_base_launch_support.c at line 72
[node47:07004] [[25769,0],0] ORTE_ERROR_LOG: File open failure in file
../../../
../../orte/mca/plm/tm/plm_tm_module.c at line 167
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
launch so we are aborting.
The appropriate code snippet is
/* setup the full path to the PBS file */
filename = opal_os_path(false, mca_ras_tm_component.nodefile_dir,
pbs_jobid, NULL);
fp = fopen(filename, "r");
if (NULL == fp) {
ORTE_ERROR_LOG(ORTE_ERR_FILE_OPEN_FAILURE);
free(filename);
return ORTE_ERR_FILE_OPEN_FAILURE;
}
which kind of looks like it might be trying to open my pbs file instead
of the file I gave on the command line? I really don't know, but does
anyone have any ideas here?
Thx....John Cary