Thomas,

Here is a quick way to see how a function get called after MPI_Finalize. In the 
following I will use gdb scripting, but with little effort you can adapt this 
to work with your preferred debugger (lldb as an example).

The idea is to break on the function generating the error you get on the 
output, which in Open MPI is ompi_mpi_errors_are_fatal_comm_handler. You will 
need to convince gdb to break on this function, show you the stack and then 
continue and quit

Step 1: Prepare the gdb command file (cmd.gdb) which should contain:

break ompi_mpi_errors_are_fatal_comm_handler
continue
bt
disable 1
continue
quit

Step 2: prepend a call to the debugger to your application

mpirun -np 64 <other options> gdb -x cmd.gdb <application arguments>

Step 3:
Look for the printed stack on your output to see which function is called after 
MPI_Finalize and where it is called.

  George.


On Jan 19, 2014, at 17:44 , Ralph Castain <r...@open-mpi.org> wrote:

> Hard to say what could be the cause of the problem without a better 
> understanding of the code, but the root cause appears to be some code path 
> that allows you to call an MPI function after you called MPI_Finalize. From 
> your description, it appears you have a race condition in the code that 
> activates the code path.
> 
> 
> On Jan 19, 2014, at 6:33 AM, thomas.fo...@ulstein.com wrote:
> 
>> Yes. It's a shared NSF partition on the nodes. 
>> 
>> Sendt fra min iPhone
>> 
>> > Den 19. jan. 2014 kl. 13:29 skrev "Reuti" <re...@staff.uni-marburg.de>:
>> > 
>> > Hi,
>> > 
>> > Am 18.01.2014 um 22:43 schrieb thomas.fo...@ulstein.com:
>> > 
>> > > I have had a running cluster going good for a while, and 2 days ago we 
>> > > decided to upgrade it from 128 to 256 cores. 
>> > > 
>> > > Most om my deployment of nodes goes through cobbler and scripting, and 
>> > > it has worked fine before.on the first 8 nodes. 
>> > 
>> > The same version of Open MPI is installed also on the new nodes?
>> > 
>> > -- Reuti
>> > 
>> > 
>> > > But after adding new nodes, everything is fucked up and i have no idea 
>> > > why:( 
>> > > 
>> > > #*** The MPI_Comm_f2c() function was called after MPI_FINALIZE was 
>> > > invoked. 
>> > > *** This is disallowed by the MPI standard. 
>> > > *** Your MPI job will now abort. 
>> > > [dpn10.cfd.local:14994] Local abort after MPI_FINALIZE completed 
>> > > successfully; not able to aggregate error messages, and not able to 
>> > > guarantee that all other processes were killed! 
>> > > *** The MPI_Comm_f2c() function was called after MPI_FINALIZE was 
>> > > invoked. 
>> > > *** This is disallowed by the MPI standard. 
>> > > *** Your MPI job will now abort. 
>> > > # 
>> > > 
>> > > The random strange issue that if i launch 8 32core jobs, 3 end of 
>> > > running, while the other 5 dies with this error, and its even using a 
>> > > few of new nodes in the job. 
>> > > 
>> > > Any idea what is causing it?, its so random i dont know where to start.. 
>> > > 
>> > > 
>> > > ./Thomas 
>> > > 
>> > > 
>> > > 
>> > > 
>> > > 
>> > > 
>> > > 
>> > > 
>> > > 
>> > > 
>> > > 
>> > > 
>> > > 
>> > > 
>> > > Denne e-posten kan innehalde informasjon som er konfidensiell 
>> > > og/eller underlagt lovbestemt teieplikt. Kun den tiltenkte adressat har 
>> > > adgang 
>> > > til å lese eller vidareformidle denne e-posten eller tilhøyrande 
>> > > vedlegg. Dersom De ikkje er den tiltenkte mottakar, vennligst kontakt 
>> > > avsendar pr e-post, slett denne e-posten med vedlegg og makuler samtlige 
>> > > utskrifter og kopiar av den.
>> > > 
>> > > 
>> > > This e-mail may contain confidential information, or otherwise 
>> > > be protected against unauthorised use. Any disclosure, distribution or 
>> > > other use of the information by anyone but the intended recipient is 
>> > > strictly prohibited. 
>> > > If you have received this e-mail in error, please advise the sender by 
>> > > immediate reply and destroy the received documents and any copies hereof.
>> > > 
>> > > 
>> > > 
>> > > PBefore 
>> > > printing, think about the environment
>> > > 
>> > > 
>> > > 
>> > > _______________________________________________
>> > > users mailing list
>> > > us...@open-mpi.org
>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> > > 
>> > 
>> > _______________________________________________
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> 
>> Denne e-posten kan innehalde informasjon som er konfidensiell og/eller 
>> underlagt lovbestemt teieplikt. Kun den tiltenkte adressat har adgang til å 
>> lese eller vidareformidle denne e-posten eller tilhøyrande vedlegg. Dersom 
>> De ikkje er den tiltenkte mottakar, vennligst kontakt avsendar pr e-post, 
>> slett denne e-posten med vedlegg og makuler samtlige utskrifter og kopiar av 
>> den.
>> 
>> This e-mail may contain confidential information, or otherwise be protected 
>> against unauthorised use. Any disclosure, distribution or other use of the 
>> information by anyone but the intended recipient is strictly prohibited. If 
>> you have received this e-mail in error, please advise the sender by 
>> immediate reply and destroy the received documents and any copies hereof.
>> 
>> PBefore printing, think about the environment
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to