Hi David
You are quite correct. IIRC, we didn't bother checking the local_err
because we found it to be unreliable - all Torque checks is that the
program exec's. It doesn't report back an error if it segfaults
instantly, for example, or aborts because it fails to find a required
library. So we added a simple timer that declares the launch a failure
if the daemon(s) fail to report back in a specified time.
However, it can't hurt to check the flag as well. I'll test it out
first just to ensure we don't get false failures.
Thanks
Ralph
On Aug 12, 2009, at 11:33 PM, David Singleton wrote:
Maybe this should go to the devel list but I'll start here.
In tracking the way the PBS tm API propagates error information
back to clients, I noticed that Open MPI is making an incorrect
assumption. (I'm looking 1.3.2.) The relevant code in
orte/mca/plm/tm/plm_tm_module.c is:
/* TM poll for all the spawns */
for (i = 0; i < launched; ++i) {
rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
if (TM_SUCCESS != rc) {
errno = local_err;
opal_output(0, "plm:tm: failed to poll for a spawned
daemon,"
" return status = %d", rc);
goto cleanup;
}
}
My reading of the way the tm API works is that tm_poll() can (will)
return TM_SUCCESS(0) even when the tm_spawn event being waited on
failed,
i.e. local_err needs to be checked even if rc=0. It looks like TM_
errors (rc values) are from tm protocol failures or incorrect calls
to tm. local_err is to do with why the actual requested action failed
and is usually some sort of internal PBSE_ error code. In fact it's
probably always PBSE_SYSTEM (15010) - I think it is for tm_spawn().
Something like the following is probably closer to what is needed.
/* TM poll for all the spawns */
for (i = 0; i < launched; ++i) {
rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
if (TM_SUCCESS != rc) {
errno = local_err;
opal_output(0, "plm:tm: failed to poll for a spawned
daemon,"
" return status = %d", rc);
goto cleanup;
}
if (local_err!=0) {
errno = local_err;
opal_output(0, "plm:tm: failed to spawn daemon,"
" error code = %d", errno );
goto cleanup;
}
}
I checked torque 2.3.3 to confirm that it's tm behaviour is the same
as
OpenPBS in this respect. No idea about PBSPro.
David
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users