Unfortunately, the timeout won't work as there is no MPI requirement to call MPI_Init before some specific point in the application. This would create an experimental process to "guess" the correct timeout on an application-by-application basis - ugly.
I have committed code to the OMPI trunk that fixes this problem for the general case. Managed to do it without the extra communication, though it required a little more complexity in the launch logic. Anyway, the problem appears resolved on that code branch. Given the required change, I do not expect this to appear in the 1.4 series - work is progressing on the first release of the next feature series (1.5), and it will be in there. Meantime, you are welcome to use a nightly tarball from the devel trunk as it appears to be in pretty good shape right now in prep for the 1.5 branch. Thanks Ralph On Dec 18, 2009, at 7:06 AM, Katz, Jacob wrote: > Yes, the scenario is as you described: one of the processes didn’t call > MPI_Init and exited “normally”. All the rest of the processes got stuck > forever in MPI_Init. > Ideally, I would like to have a time-out setting for a process to call > MPI_Init, which when expired would indicate a failure to start-up (as if the > processes aborted). The time-out may be indefinite by default, for backward > compatibility. No extra communication if no time-out happens… > > -------------------------------- > Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: > (8)-465-5726 > > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: Wednesday, December 16, 2009 05:55 > To: Open MPI Users > Subject: Re: [OMPI users] How to detect a failure to start-up and MPI_Init()? > > Finally got time to look at this - not sure this is a bug, if I understand > correctly your scenario. > > When you say the application exits, do you mean it calls "exit" - or do you > mean it segfaults or some other such abnormal termination? > > Reason I ask: if the process has not yet called MPI_Init and instead calls > "exit", as far as we are concerned that is a normal termination. So we note > that it happened, but we don't consider it as having "aborted" - and hence, > we don't terminate the job. > > If that is indeed the scenario, then trying to resolve it is a tad difficult. > Although we don't advise it, people do frequently have their apps do a bunch > of stuff prior to calling MPI_Init. So there is no timer I can set that would > alert me that the job is stuck - could just be waiting for one or more procs > to reach MPI_Init (e.g., reading a large input file). > > Only thing I can think of would be to (a) detect that other procs in the job > had called MPI_Init, (b) note that this one did -not- call MPI_Init/Finalize > prior to terminating, and therefore (c) declare the job as having failed. > > This might be doable. Tad complicated if, for example, there is only one > proc/node as now the daemons have to know that other procs (not local to > them) called MPI_Init. > > I'll have to ask the MPI folks on the team if that is something we want to do > as it could affect scalability by requiring more communication...not sure how > this fits into the std either. > > Ralph > > > On Dec 15, 2009, at 8:47 AM, Katz, Jacob wrote: > > > Ralph, > Have you been able to confirm this as a bug? > Thanks! > -------------------------------- > Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: > (8)-465-5726 > > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: Sunday, December 06, 2009 19:24 > To: Open MPI Users > Subject: Re: [OMPI users] How to detect a failure to start-up and MPI_Init()? > > I'll look into it - sounds like a bug > > Thanks! > > On Sun, Dec 6, 2009 at 9:13 AM, Katz, Jacob <jacob.k...@intel.com> wrote: > I’m using 1.3.3. > The job isn’t aborted in my case when the failing process haven’t called > MPI_Init… It is aborted if the process have called MPI_Init… > > -------------------------------- > Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: > (8)-465-5726 > > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: Sunday, December 06, 2009 17:44 > To: Open MPI Users > Subject: Re: [OMPI users] How to detect a failure to start-up and MPI_Init()? > > The system should see that app fail and abort the job - whether it calls > MPI_Init first or not is irrelevant. What version are you using? > > On Sun, Dec 6, 2009 at 8:40 AM, Katz, Jacob <jacob.k...@intel.com> wrote: > Hi, > Is there a way to detect a situation than one of the processes in an MPI > application exits without even calling MPI_Init()? > I have a case in which all the processes except one are stuck forever in > MPI_Init(), and that one exits before being able to call MPI_Init()… > I tried using the mca params that I thought might be related - > orte_startup_timeout, orte_abort_timeout, but that didn’t help. > > Thanks! > -------------------------------- > Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: > (8)-465-5726 > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users