Yes, the scenario is as you described: one of the processes didn't call 
MPI_Init and exited "normally". All the rest of the processes got stuck forever 
in MPI_Init.
Ideally, I would like to have a time-out setting for a process to call 
MPI_Init, which when expired would indicate a failure to start-up (as if the 
processes aborted). The time-out may be indefinite by default, for backward 
compatibility. No extra communication if no time-out happens...

--------------------------------
Jacob M. Katz | jacob.k...@intel.com<mailto:jacob.k...@intel.com> | Work: 
+972-4-865-5726 | iNet: (8)-465-5726

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Wednesday, December 16, 2009 05:55
To: Open MPI Users
Subject: Re: [OMPI users] How to detect a failure to start-up and MPI_Init()?

Finally got time to look at this - not sure this is a bug, if I understand 
correctly your scenario.

When you say the application exits, do you mean it calls "exit" - or do you 
mean it segfaults or some other such abnormal termination?

Reason I ask: if the process has not yet called MPI_Init and instead calls 
"exit", as far as we are concerned that is a normal termination. So we note 
that it happened, but we don't consider it as having "aborted" - and hence, we 
don't terminate the job.

If that is indeed the scenario, then trying to resolve it is a tad difficult. 
Although we don't advise it, people do frequently have their apps do a bunch of 
stuff prior to calling MPI_Init. So there is no timer I can set that would 
alert me that the job is stuck - could just be waiting for one or more procs to 
reach MPI_Init (e.g., reading a large input file).

Only thing I can think of would be to (a) detect that other procs in the job 
had called MPI_Init, (b) note that this one did -not- call MPI_Init/Finalize 
prior to terminating, and therefore (c) declare the job as having failed.

This might be doable. Tad complicated if, for example, there is only one 
proc/node as now the daemons have to know that other procs (not local to them) 
called MPI_Init.

I'll have to ask the MPI folks on the team if that is something we want to do 
as it could affect scalability by requiring more communication...not sure how 
this fits into the std either.

Ralph


On Dec 15, 2009, at 8:47 AM, Katz, Jacob wrote:


Ralph,
Have you been able to confirm this as a bug?
Thanks!
--------------------------------
Jacob M. Katz | jacob.k...@intel.com<mailto:jacob.k...@intel.com> | Work: 
+972-4-865-5726 | iNet: (8)-465-5726

From: users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org> 
[mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Sunday, December 06, 2009 19:24
To: Open MPI Users
Subject: Re: [OMPI users] How to detect a failure to start-up and MPI_Init()?

I'll look into it - sounds like a bug

Thanks!
On Sun, Dec 6, 2009 at 9:13 AM, Katz, Jacob 
<jacob.k...@intel.com<mailto:jacob.k...@intel.com>> wrote:
I'm using 1.3.3.
The job isn't aborted  in my case when the failing process haven't called 
MPI_Init... It is aborted if the process have called MPI_Init...

--------------------------------
Jacob M. Katz | jacob.k...@intel.com<mailto:jacob.k...@intel.com> | Work: 
+972-4-865-5726 | iNet: (8)-465-5726

From: users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org> 
[mailto:users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org>] On 
Behalf Of Ralph Castain
Sent: Sunday, December 06, 2009 17:44
To: Open MPI Users
Subject: Re: [OMPI users] How to detect a failure to start-up and MPI_Init()?

The system should see that app fail and abort the job - whether it calls 
MPI_Init first or not is irrelevant. What version are you using?
On Sun, Dec 6, 2009 at 8:40 AM, Katz, Jacob 
<jacob.k...@intel.com<mailto:jacob.k...@intel.com>> wrote:
Hi,
Is there a way to detect a situation than one of the processes in an MPI 
application exits without even calling MPI_Init()?
I have a case in which all the processes except one are stuck forever in 
MPI_Init(), and that one exits before being able to call MPI_Init()...
I tried using the mca params that I thought might be related - 
orte_startup_timeout, orte_abort_timeout, but that didn't help.

Thanks!
--------------------------------
Jacob M. Katz | jacob.k...@intel.com<mailto:jacob.k...@intel.com> | Work: 
+972-4-865-5726 | iNet: (8)-465-5726


---------------------------------------------------------------------

Intel Israel (74) Limited



This e-mail and any attachments may contain confidential material for

the sole use of the intended recipient(s). Any review or distribution

by others is strictly prohibited. If you are not the intended

recipient, please contact the sender and delete all copies.

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users


---------------------------------------------------------------------

Intel Israel (74) Limited



This e-mail and any attachments may contain confidential material for

the sole use of the intended recipient(s). Any review or distribution

by others is strictly prohibited. If you are not the intended

recipient, please contact the sender and delete all copies.

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users


---------------------------------------------------------------------

Intel Israel (74) Limited



This e-mail and any attachments may contain confidential material for

the sole use of the intended recipient(s). Any review or distribution

by others is strictly prohibited. If you are not the intended

recipient, please contact the sender and delete all copies.
_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Reply via email to