Is your application running on the same machine as mpirun?

How did you configure Open MPI? Note that is program will not work without the FT thread enabled, which would be one reason why it would seem to hang (since it is waiting for the application to enter the MPI library):
  --enable-ft-thread --enable-mpi-threads

I do not think the message that you saw is related. Often orte_checkpoint cannot figure out the jobid on first contact with the HNP/mpirun process, so this is displayed as an INVALID handle.

-- Josh

On Sep 11, 2009, at 9:50 AM, Jean Potsam wrote:


Hi Everyone,
I noticed that it hangs just before displaying the following while trying to checkpoint the application.

############################
[sun06:15252] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
###############################

Can it be related to the above?

Thanks


----------------------------------------------------------------------------------------------------------------------
Hi Everyone,
I wrote a small program with a function to trigger the checkpointing mechanism as follows:

############################################

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
void trigger_checkpoint();
int main(int argc, char **argv)
{
int rank,size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
trigger_checkpoint();
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
printf("bye \n");
MPI_Finalize();
return 0;
}

void trigger_checkpoint()
{
  printf("hi\n");
  system("ompi-checkpoint -v `pidof mpirun` ");
}
#############################################


The application works fine on my laptop with ubuntu as the OS. However, when I tried running it on one of the machines at my uni, with suse linux installed, the application hangs as soon as the ompi- checkpoint is triggered. This is what I get:



##########################################################
I am processor no 0 of a total of 1 procs
hi
I am processor no 0 of a total of 1 procs
[sun06:15426] orte_checkpoint: Checkpointing...
[sun06:15426]    PID 15411
[sun06:15426]    Connected to Mpirun [[12727,0],0]
[sun06:15426] orte_checkpoint: notify_hnp: Contact Head Node Process PID 15411
###################################################

does anyone has some ideas about this?

Thanks a lot

Jean.

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to