Unfortunately, the timeout won't work as there is no MPI requirement to call 
MPI_Init before some specific point in the application. This would create an 
experimental process to "guess" the correct timeout on an 
application-by-application basis - ugly.

I have committed code to the OMPI trunk that fixes this problem for the general 
case. Managed to do it without the extra communication, though it required a 
little more complexity in the launch logic.

Anyway, the problem appears resolved on that code branch. Given the required 
change, I do not expect this to appear in the 1.4 series - work is progressing 
on the first release of the next feature series (1.5), and it will be in there. 
Meantime, you are welcome to use a nightly tarball from the devel trunk as it 
appears to be in pretty good shape right now in prep for the 1.5 branch.

Thanks
Ralph

On Dec 18, 2009, at 7:06 AM, Katz, Jacob wrote:

> Yes, the scenario is as you described: one of the processes didn’t call 
> MPI_Init and exited “normally”. All the rest of the processes got stuck 
> forever in MPI_Init.
> Ideally, I would like to have a time-out setting for a process to call 
> MPI_Init, which when expired would indicate a failure to start-up (as if the 
> processes aborted). The time-out may be indefinite by default, for backward 
> compatibility. No extra communication if no time-out happens…
>  
> --------------------------------
> Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: 
> (8)-465-5726
>  
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Ralph Castain
> Sent: Wednesday, December 16, 2009 05:55
> To: Open MPI Users
> Subject: Re: [OMPI users] How to detect a failure to start-up and MPI_Init()?
>  
> Finally got time to look at this - not sure this is a bug, if I understand 
> correctly your scenario.
>  
> When you say the application exits, do you mean it calls "exit" - or do you 
> mean it segfaults or some other such abnormal termination?
>  
> Reason I ask: if the process has not yet called MPI_Init and instead calls 
> "exit", as far as we are concerned that is a normal termination. So we note 
> that it happened, but we don't consider it as having "aborted" - and hence, 
> we don't terminate the job.
>  
> If that is indeed the scenario, then trying to resolve it is a tad difficult. 
> Although we don't advise it, people do frequently have their apps do a bunch 
> of stuff prior to calling MPI_Init. So there is no timer I can set that would 
> alert me that the job is stuck - could just be waiting for one or more procs 
> to reach MPI_Init (e.g., reading a large input file).
>  
> Only thing I can think of would be to (a) detect that other procs in the job 
> had called MPI_Init, (b) note that this one did -not- call MPI_Init/Finalize 
> prior to terminating, and therefore (c) declare the job as having failed.
>  
> This might be doable. Tad complicated if, for example, there is only one 
> proc/node as now the daemons have to know that other procs (not local to 
> them) called MPI_Init.
>  
> I'll have to ask the MPI folks on the team if that is something we want to do 
> as it could affect scalability by requiring more communication...not sure how 
> this fits into the std either.
>  
> Ralph
>  
>  
> On Dec 15, 2009, at 8:47 AM, Katz, Jacob wrote:
> 
> 
> Ralph,
> Have you been able to confirm this as a bug?
> Thanks!
> --------------------------------
> Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: 
> (8)-465-5726
>  
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Ralph Castain
> Sent: Sunday, December 06, 2009 19:24
> To: Open MPI Users
> Subject: Re: [OMPI users] How to detect a failure to start-up and MPI_Init()?
>  
> I'll look into it - sounds like a bug
> 
> Thanks!
> 
> On Sun, Dec 6, 2009 at 9:13 AM, Katz, Jacob <jacob.k...@intel.com> wrote:
> I’m using 1.3.3.
> The job isn’t aborted  in my case when the failing process haven’t called 
> MPI_Init… It is aborted if the process have called MPI_Init…
>  
> --------------------------------
> Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: 
> (8)-465-5726
>  
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Ralph Castain
> Sent: Sunday, December 06, 2009 17:44
> To: Open MPI Users
> Subject: Re: [OMPI users] How to detect a failure to start-up and MPI_Init()?
>  
> The system should see that app fail and abort the job - whether it calls 
> MPI_Init first or not is irrelevant. What version are you using?
> 
> On Sun, Dec 6, 2009 at 8:40 AM, Katz, Jacob <jacob.k...@intel.com> wrote:
> Hi,
> Is there a way to detect a situation than one of the processes in an MPI 
> application exits without even calling MPI_Init()?
> I have a case in which all the processes except one are stuck forever in 
> MPI_Init(), and that one exits before being able to call MPI_Init()…
> I tried using the mca params that I thought might be related - 
> orte_startup_timeout, orte_abort_timeout, but that didn’t help.
>  
> Thanks!
> --------------------------------
> Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: 
> (8)-465-5726
>  
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>  
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>  
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>  
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>  
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>  
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>  
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to