The application is terminated and an error message is reported out: mpirun has exited due to process rank 0 with PID 72438 on node Ralph exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). On Dec 18, 2009, at 8:32 AM, Katz, Jacob wrote: > Thanks for the fix. What will be the exact behavior after your fix? > > Re timeouts: Timeout may be indefinite for compliance with the standard. > However, apps might optionally use it for their convenience, like in my case. > No need to guess anything, but would prevent stuck apps. > Unlike regular communication, where one may implement timeout mechanism at > application level using non-blocking communication, there is no way to > implement an app-level time-out for bootstrapping process since MPI_Init > blocks. > > > -------------------------------- > Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: > (8)-465-5726 > > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: Friday, December 18, 2009 16:50 > To: Open MPI Users > Subject: Re: [OMPI users] How to detect a failure to start-up and MPI_Init()? > > Unfortunately, the timeout won't work as there is no MPI requirement to call > MPI_Init before some specific point in the application. This would create an > experimental process to "guess" the correct timeout on an > application-by-application basis - ugly. > > I have committed code to the OMPI trunk that fixes this problem for the > general case. Managed to do it without the extra communication, though it > required a little more complexity in the launch logic. > > Anyway, the problem appears resolved on that code branch. Given the required > change, I do not expect this to appear in the 1.4 series - work is > progressing on the first release of the next feature series (1.5), and it > will be in there. Meantime, you are welcome to use a nightly tarball from the > devel trunk as it appears to be in pretty good shape right now in prep for > the 1.5 branch. > > Thanks > Ralph > > On Dec 18, 2009, at 7:06 AM, Katz, Jacob wrote: > > > Yes, the scenario is as you described: one of the processes didn’t call > MPI_Init and exited “normally”. All the rest of the processes got stuck > forever in MPI_Init. > Ideally, I would like to have a time-out setting for a process to call > MPI_Init, which when expired would indicate a failure to start-up (as if the > processes aborted). The time-out may be indefinite by default, for backward > compatibility. No extra communication if no time-out happens… > > -------------------------------- > Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: > (8)-465-5726 > > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: Wednesday, December 16, 2009 05:55 > To: Open MPI Users > Subject: Re: [OMPI users] How to detect a failure to start-up and MPI_Init()? > > Finally got time to look at this - not sure this is a bug, if I understand > correctly your scenario. > > When you say the application exits, do you mean it calls "exit" - or do you > mean it segfaults or some other such abnormal termination? > > Reason I ask: if the process has not yet called MPI_Init and instead calls > "exit", as far as we are concerned that is a normal termination. So we note > that it happened, but we don't consider it as having "aborted" - and hence, > we don't terminate the job. > > If that is indeed the scenario, then trying to resolve it is a tad difficult. > Although we don't advise it, people do frequently have their apps do a bunch > of stuff prior to calling MPI_Init. So there is no timer I can set that would > alert me that the job is stuck - could just be waiting for one or more procs > to reach MPI_Init (e.g., reading a large input file). > > Only thing I can think of would be to (a) detect that other procs in the job > had called MPI_Init, (b) note that this one did -not- call MPI_Init/Finalize > prior to terminating, and therefore (c) declare the job as having failed. > > This might be doable. Tad complicated if, for example, there is only one > proc/node as now the daemons have to know that other procs (not local to > them) called MPI_Init. > > I'll have to ask the MPI folks on the team if that is something we want to do > as it could affect scalability by requiring more communication...not sure how > this fits into the std either. > > Ralph > > > On Dec 15, 2009, at 8:47 AM, Katz, Jacob wrote: > > > > Ralph, > Have you been able to confirm this as a bug? > Thanks! > -------------------------------- > Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: > (8)-465-5726 > > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: Sunday, December 06, 2009 19:24 > To: Open MPI Users > Subject: Re: [OMPI users] How to detect a failure to start-up and MPI_Init()? > > I'll look into it - sounds like a bug > > Thanks! > > On Sun, Dec 6, 2009 at 9:13 AM, Katz, Jacob <jacob.k...@intel.com> wrote: > I’m using 1.3.3. > The job isn’t aborted in my case when the failing process haven’t called > MPI_Init… It is aborted if the process have called MPI_Init… > > -------------------------------- > Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: > (8)-465-5726 > > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: Sunday, December 06, 2009 17:44 > To: Open MPI Users > Subject: Re: [OMPI users] How to detect a failure to start-up and MPI_Init()? > > The system should see that app fail and abort the job - whether it calls > MPI_Init first or not is irrelevant. What version are you using? > > On Sun, Dec 6, 2009 at 8:40 AM, Katz, Jacob <jacob.k...@intel.com> wrote: > Hi, > Is there a way to detect a situation than one of the processes in an MPI > application exits without even calling MPI_Init()? > I have a case in which all the processes except one are stuck forever in > MPI_Init(), and that one exits before being able to call MPI_Init()… > I tried using the mca params that I thought might be related - > orte_startup_timeout, orte_abort_timeout, but that didn’t help. > > Thanks! > -------------------------------- > Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: > (8)-465-5726 > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users