Hi Ken

I haven't forgotten you. We've been meeting this week, which has limited my
time, but I am working on a replacement for that entire code block that
should resolve the problem. Hope to have it soon.

Ralph


On Thu, Jun 25, 2015 at 2:12 PM, Leiter, Kenneth W CIV USARMY ARL (US) <
kenneth.w.leiter2....@mail.mil> wrote:

>  Hi Ralph,
>
>  I had some time this afternoon to work on this problem further and
> discovered some more info.
>
>  I used valgrind to attach to orted and collected logs of valgrind
> output.
>
>  I get many uninitialized value errors in pmix_server_process_msgs.c
> beginning at line 378. It appears that reply is never allocated. If I add 
> "reply
> = OBJ_NEW(opal_buffer_t);" before filling reply, I get rid of those errors
> from valgrind. Whether that is the correct fix I do not know.
>
>  Unfortunately this doesn't solve my problem crashing orted. I now
> consistently get a single error detected by valgrind:
>
>  ==29602== Process terminating with default action of signal 11
> (SIGSEGV): dumping core
>
> ==29602==  Access not within mapped region at address 0x48
>
> ==29602==    at 0x4E6E2FA: orte_util_print_name_args (name_fns.c:142)
>
> ==29602==    by 0xCABE394: orte_rml_oob_send_buffer_nb (rml_oob_send.c:269)
>
> ==29602==    by 0x4ED621E: pmix_server_process_message
> (pmix_server_process_msgs.c:421)
>
> ==29602==    by 0x4EC2606: pmix_server_recv_handler
> (pmix_server_sendrecv.c:446)
>
> ==29602==    by 0x528D31C: opal_libevent2022_event_base_loop (event.c:1321)
>
> ==29602==    by 0x4EA3142: orte_daemon (orted_main.c:864)
>
> ==29602==    by 0x401073: main (orted.c:60)
>
>  From the core file I get from orted - I see that req->proxy is Null in
> pmix_server_process_msgs.c:421 . How this arises, I do not know.
>
>  Thanks,
> Ken Leiter
>
>  ------------------------------
> *From:* users [users-boun...@open-mpi.org] on behalf of Ralph Castain [
> r...@open-mpi.org]
> *Sent:* Thursday, June 11, 2015 4:27 PM
>
> *To:* Open MPI Users
> *Subject:* Re: [OMPI users] orted segmentation fault in pmix on master
>
>  Yeah, I’d put my money on a race condition under that scenario. I don’t
> have anything that large I can test on, but I’ll see what I can do
>
>  On Jun 11, 2015, at 1:17 PM, Leiter, Kenneth W CIV USARMY ARL (US) <
> kenneth.w.leiter2....@mail.mil> wrote:
>
>  Yes, each parent launches ten children and no other parents participate
> in that spawn (i.e. the spawn uses MPI_COMM_SELF as the communicator).
>
>  No threading.
>
>  I am using the example from:
> https://github.com/bfroehle/mpi4py/tree/master/demo/spawning
>
>  In lieu of my actual application which has a lot more moving parts.
>
>  After rerunning many times, it sometimes completes successfully and
> othertimes seg faults the daemon.
>
>  - Ken
>
>   ------------------------------
> *From:* users [users-boun...@open-mpi.org] on behalf of Ralph Castain [
> r...@open-mpi.org]
> *Sent:* Thursday, June 11, 2015 4:09 PM
> *To:* Open MPI Users
> *Subject:* Re: [OMPI users] orted segmentation fault in pmix on master
>
>  So to be clear: each parent launches 10 children, and no other parents
> participate in that spawn?
>
>  And there is no threading in the app, yes?
>
>
>  On Jun 11, 2015, at 12:53 PM, Leiter, Kenneth W CIV USARMY ARL (US) <
> kenneth.w.leiter2....@mail.mil> wrote:
>
>  Howard,
>
>  I do not run into a problem when I have one parent spawning many
> children (tested up to 100 children ranks), but am seeing the problem when
> I have, for example, 8 parents launching 10 children each.
>
>  - Ken
>  ------------------------------
> *From:* users [users-boun...@open-mpi.org] on behalf of Howard Pritchard [
> hpprit...@gmail.com]
> *Sent:* Thursday, June 11, 2015 2:36 PM
> *To:* Open MPI Users
> *Subject:* Re: [OMPI users] orted segmentation fault in pmix on master
>
>   Hi Ken,
>
>  Could you post the output of your ompi_info?
>
>  I have PrgEnv-gnu/5.2.56 and gcc/4.9.2 loaded in my env on nersc
> system.  Following configure line:
>
>  ./configure --enable-mpi-java --prefix=my_favorite_install_location
>
>  The general rule of thumb on cray's with master (not with older versions
> though) is you should be able to
> do a ./configure (install location) and you're ready to go, no need for
> complicated platform files, etc.
> to just build vanilla.
>
>  As you're probably guessing, I'm going to say it works for me, at least
> up to 68 slave ranks.
>
>  I do notice there's some glitch with the mapping of the ranks though.
> The binding logic seems
> to think there's oversubscription of cores even when there should not be.
> I had to use the
>
>  --bind-to none
>
>  option on the command line once I asked for more than 22 slave ranks.
>  edison system has
> has 24 cores/node.
>
>  Howard
>
>
>
> 2015-06-11 12:10 GMT-06:00 Leiter, Kenneth W CIV USARMY ARL (US) <
> kenneth.w.leiter2....@mail.mil>:
>
>> I will try on a non-cray machine as well.
>>
>> - Ken
>>
>> -----Original Message-----
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Howard
>> Pritchard
>> Sent: Thursday, June 11, 2015 12:21 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] orted segmentation fault in pmix on master
>>
>> Hello Ken,
>>
>> Could you give the details of the allocation request (qsub args) as well
>> as the mpirun command line args? I'm trying to reproduce on the nersc
>> system.
>>
>> It would be interesting if you have access to a similar size non-cray
>> cluster if you get the same problems.
>>
>> Howard
>>
>>
>> 2015-06-11 9:13 GMT-06:00 Ralph Castain <r...@open-mpi.org <mailto:
>> r...@open-mpi.org> >:
>>
>>
>>         I don’t have a Cray, but let me see if I can reproduce this on
>> something else
>>
>>         > On Jun 11, 2015, at 7:26 AM, Leiter, Kenneth W CIV USARMY ARL
>> (US) <kenneth.w.leiter2....@mail.mil<mailto:
>> kenneth.w.leiter2....@mail.mil> > wrote:
>>         >
>>         > Hello,
>>         >
>>         > I am attempting to use the openmpi development master for a
>> code that uses
>>         > dynamic process management (i.e. MPI_Comm_spawn) on our Cray
>> XC40 at the
>>         > Army Research Laboratory. After reading through the mailing
>> list I came to
>>         > the conclusion that the master branch is the only hope for
>> getting this to
>>         > work on the newer Cray machines.
>>         >
>>         > To test I am using the cpi-master.c cpi-worker.c example. The
>> test works
>>         > when executing on a small number of processors, five or less,
>> but begins to
>>         > fail with segmentation faults in orted when using more
>> processors. Even with
>>         > five or fewer processors, I am spreading the computation to
>> more than one
>>         > node. I am using the cray ugni btl through the alps scheduler.
>>         >
>>         > I get a core file from orted and have the seg fault tracked
>> down to
>>         > pmix_server_process_msgs.c:420 where req->proxy is NULL. I have
>> tried
>>         > reading the code to understand how this happens, but am unsure.
>> I do see
>>         > that in the if statement where I take the else branch, the
>> other branch
>>         > specifically checks "if (NULL == req->proxy)" - however, no
>> such check is
>>         > done the the else branch.
>>         >
>>         > I have debug output dumped for the failing runs. I can provide
>> the output
>>         > along with ompi_info output and config.log to anyone who is
>> interested.
>>         >
>>         > - Ken Leiter
>>         >
>>         > _______________________________________________
>>         > users mailing list
>>         > us...@open-mpi.org <mailto:us...@open-mpi.org>
>>         > Subscription:
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>         > Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/06/27094.php
>>
>>         _______________________________________________
>>         users mailing list
>>         us...@open-mpi.org <mailto:us...@open-mpi.org>
>>         Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>         Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/06/27095.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/06/27103.php
>>
>
>    _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/06/27110.php
>
>
>    _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/06/27113.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/06/27195.php
>

Reply via email to