Hi David

You are quite correct. IIRC, we didn't bother checking the local_err because we found it to be unreliable - all Torque checks is that the program exec's. It doesn't report back an error if it segfaults instantly, for example, or aborts because it fails to find a required library. So we added a simple timer that declares the launch a failure if the daemon(s) fail to report back in a specified time.

However, it can't hurt to check the flag as well. I'll test it out first just to ensure we don't get false failures.

Thanks
Ralph

On Aug 12, 2009, at 11:33 PM, David Singleton wrote:


Maybe this should go to the devel list but I'll start here.

In tracking the way the PBS tm API propagates error information
back to clients, I noticed that Open MPI is making an incorrect
assumption.  (I'm looking 1.3.2.) The relevant code in
orte/mca/plm/tm/plm_tm_module.c is:

   /* TM poll for all the spawns */
   for (i = 0; i < launched; ++i) {
       rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
       if (TM_SUCCESS != rc) {
           errno = local_err;
opal_output(0, "plm:tm: failed to poll for a spawned daemon,"
                          " return status = %d", rc);
           goto cleanup;
       }
   }

My reading of the way the tm API works is that tm_poll() can (will)
return TM_SUCCESS(0) even when the tm_spawn event being waited on failed,
i.e. local_err needs to be checked even if rc=0.  It looks like TM_
errors (rc values) are from tm protocol failures or incorrect calls
to tm.  local_err is to do with why the actual requested action failed
and is usually some sort of internal PBSE_ error code.  In fact it's
probably always PBSE_SYSTEM (15010) - I think it is for tm_spawn().

Something like the following is probably closer to what is needed.

   /* TM poll for all the spawns */
   for (i = 0; i < launched; ++i) {
       rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
       if (TM_SUCCESS != rc) {
           errno = local_err;
opal_output(0, "plm:tm: failed to poll for a spawned daemon,"
                          " return status = %d", rc);
           goto cleanup;
       }
        if (local_err!=0) {
           errno = local_err;
           opal_output(0, "plm:tm: failed to spawn daemon,"
                          " error code = %d", errno );
           goto cleanup;
       }
   }

I checked torque 2.3.3 to confirm that it's tm behaviour is the same as
OpenPBS in this respect. No idea about PBSPro.


David
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to