Hi, I used die following options for "configure" in openmpi-1.9a1r27380 and I get "MPI_THREAD_MULTIPLE"
--enable-cxx-exceptions \ --enable-mpi-java \ --enable-heterogeneous \ --enable-opal-multi-threads \ --enable-mpi-thread-multiple \ --with-threads=posix \ --with-hwloc=internal \ --without-verbs \ --without-udapl \ --with-wrapper-cflags=-m64 \ --enable-debug \ tyr small_prog 125 mpiexec thread_support I have requested MPI_THREAD_MULTIPLE in "MPI_Init_thread ()" and it provides MPI_THREAD_MULTIPLE "MPI_Query_thread ()" returned MPI_THREAD_MULTIPLE: Many threads are supported and any thread may call MPI functions at any time. "MPI_Is_thread_main ()" returned: "true". Kind regards Siegmar > If you ask for thread multiple, I believe we return thread funneled > or thread serial. You can check, though - I might be remembering > wrong, but I'm pretty sure that's true > > Sent from my iPad > > On Oct 9, 2012, at 7:09 AM, Brian Budge <brian.bu...@gmail.com> wrote: > > > Hi Ralph - > > > > Is this really true? I've been using thread_multiple in my openmpi > > programs for quite some time... There may be known cases where it > > will not work, but for vanilla MPI use, it seems good to go. That's > > not to say that you can't create your own deadlock if you're not > > careful, but they are cases you'd expect deadlock. What specifically > > is unsupported about thread_multiple? > > > > Brian > > > > On Tue, Oct 9, 2012 at 6:30 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> We don't support thread_multiple, I'm afraid. Only thread_funneled, so > >> you'll have to architect things so that each process can perform all its MPI > >> actions inside of a single thread. > >> > >> > >> > >> On Tue, Oct 9, 2012 at 6:10 AM, Hodge, Gary C <gary.c.ho...@lmco.com> wrote: > >>> > >>> FYI, I implemented the harvesting thread but found out quickly that my > >>> installation of open MPI does not have MPI_THREAD_MULIPLE support > >>> > >>> My worker thread still does MPI_Send calls to move the data to the next > >>> process. > >>> > >>> So I am going to download 1.6.2 today, configure it with > >>> --enable-thread-multiple and try again > >>> > >>> > >>> > >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > >>> Behalf Of Ralph Castain > >>> Sent: Thursday, October 04, 2012 8:10 PM > >>> > >>> > >>> To: Open MPI Users > >>> Subject: Re: [OMPI users] EXTERNAL: Re: unacceptable latency in gathering > >>> process > >>> > >>> > >>> > >>> Sorry for delayed response - been on the road all day. > >>> > >>> Usually we use the standard NetPipe, IMB, and other benchmarks to measure > >>> latency. IIRC, these are all point-to-point measurements - i.e., they > >>> measure the latency for a single process sending to one other process > >>> (typically on the order of a couple of microseconds). The tests may have > >>> multiple processes running, but they don't have one process receiving > >>> messages from multiple senders. > >>> > >>> You will, of course, see increased delays in that scenario just due to > >>> cycle time - we give you a message, but cannot give you another one until > >>> you return from our delivery callback. So the longer you spend in the > >>> callback, the slower we go. > >>> > >>> In one use-case I recently helped with, we had a "harvesting" thread that > >>> simply reaped the messages from the MPI callback and stuffed them into a > >>> multi-threaded processing queue. This minimized the MPI "latency", but of > >>> course the overall thruput depended on the speed of the follow-on queue. In > >>> our case, we only had one process running on each node (like you), and had > >>> lots of cores on the node - so we cranked up the threads in the processing > >>> queue and rammed the data thru the pipe. > >>> > >>> Your design looks similar, so you might benefit from a similar approach. > >>> Just don't try to have multiple MPI callbacks each sitting in a separate > >>> thread as thread support in MPI isn't good - better to have a single thread > >>> handling the MPI stuff, and then push it into a queue that multiple threads > >>> can access. > >>> > >>> Anyway, glad that helped diagnose the issue. > >>> Ralph > >>> > >>> > >>> > >>> On Thu, Oct 4, 2012 at 6:55 AM, Hodge, Gary C <gary.c.ho...@lmco.com> > >>> wrote: > >>> > >>> Once I read your comment, Ralph, about this being âorders of magnitude > >>> worse than anything we measureâ, I knew it had to be our problem > >>> > >>> > >>> > >>> We already had some debug code in place to measure when we send and when > >>> we receive over MPI. I turned this code on and ran with 12 slaves instead > >>> of 4. > >>> > >>> Our debug showed that once an SP does a send, it is received at the GP in > >>> less than 1 ms. I then decided to take a close look at when each SP was > >>> sending a message. > >>> > >>> It turns out that the first 9 slaves send out messages at very regular > >>> intervals, but the last 3 slaves have 200 - 600 ms delays in sending out a > >>> message. > >>> > >>> It could be that our SPs have a problem when many are running at once. It > >>> is also interesting to note that the first 9 slaves run on the same blade > >>> chassis as the GP and > >>> > >>> the last 3 SPs run on our second blade chassis. I will later experiment > >>> with the placement of our SPs across chassis to see if this an important > >>> factor or not. > >>> > >>> > >>> > >>> When I first reported this problem, I had only turned on debug in the > >>> receiving GP process. The latency I was seeing then was the difference > >>> between when I received a message > >>> > >>> from the 10th slave and when I received the last message from the 10th > >>> slave. The time we use for our debug comes from an MPI_Wtime call. > >>> > >>> > >>> > >>> Ralph, for my future reference, could you share how many processes were > >>> sending to a single process in your testing, and what were the size of the > >>> messages sent? > >>> > >>> > >>> > >>> Hristo, thanks for your input, I had already spent a few days searching > >>> the faqs and tuning guides before posting. > >>> > >>> > >>> > >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > >>> Behalf Of Ralph Castain > >>> Sent: Wednesday, October 03, 2012 4:01 PM > >>> To: Open MPI Users > >>> Subject: Re: [OMPI users] EXTERNAL: Re: unacceptable latency in gathering > >>> process > >>> > >>> > >>> > >>> Hmmm...you probably can't without digging down into the diagnostics. > >>> > >>> > >>> > >>> Perhaps we could help more if we had some idea how you are measuring this > >>> "latency". I ask because that is orders of magnitude worse than anything we > >>> measure - so I suspect the problem is in your app (i.e., that the time you > >>> are measuring is actually how long it takes you to get around to processing > >>> a message that was received some time ago). > >>> > >>> > >>> > >>> > >>> > >>> On Oct 3, 2012, at 11:52 AM, "Hodge, Gary C" <gary.c.ho...@lmco.com> > >>> wrote: > >>> > >>> > >>> > >>> how do I tell the difference between when the message was received and > >>> when the message was picked up in MPI_Test? > >>> > >>> > >>> > >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > >>> Behalf Of Ralph Castain > >>> Sent: Wednesday, October 03, 2012 1:00 PM > >>> To: Open MPI Users > >>> Subject: EXTERNAL: Re: [OMPI users] unacceptable latency in gathering > >>> process > >>> > >>> > >>> > >>> Out of curiosity, have you logged the time when the SP called "send" and > >>> compared it to the time when the message was received, and when that message > >>> is picked up in MPI_Test? In other words, have you actually verified that > >>> the delay is in the MPI library as opposed to in your application? > >>> > >>> > >>> > >>> > >>> > >>> On Oct 3, 2012, at 9:40 AM, "Hodge, Gary C" <gary.c.ho...@lmco.com> wrote: > >>> > >>> > >>> > >>> Hi all, > >>> > >>> I am running on an IBM BladeCenter, using Open MPI 1.4.1, and opensm > >>> subnet manager for Infiniband > >>> > >>> > >>> > >>> Our application has real time requirements and it has recently been proven > >>> that it does not scale to meet future requirements. > >>> > >>> Presently, I am re-organizing the application to process work in a more > >>> parallel manner then it does now. > >>> > >>> > >>> > >>> Jobs arrive at the rate of 200 per second and are sub-divided into groups > >>> of objects by a master process (MP) on its own node. > >>> > >>> The MP then assigns the object groups to 20 slave processes (SP), each > >>> running on their own node, to do the expensive computational work in > >>> parallel. > >>> > >>> The SPs then send their results to a gatherer process (GP) on its own node > >>> that merges the results for the job and sends it onward for final > >>> processing. > >>> > >>> The highest latency for the last 1024 jobs that were processed is then > >>> written to a log file that is displayed by a GUI. > >>> > >>> Each process uses the same controller method for sending and receiving > >>> messages as follows: > >>> > >>> > >>> > >>> For (each CPU that sends us input) > >>> > >>> { > >>> > >>> MPI_Irecv(â¦.) > >>> > >>> } > >>> > >>> > >>> > >>> While (true) > >>> > >>> { > >>> > >>> For (each CPU that sends us input) > >>> > >>> { > >>> > >>> MPI_Test(â¦.) > >>> > >>> If (message was received) > >>> > >>> { > >>> > >>> Copy the message > >>> > >>> Queue the copy to our input queue > >>> > >>> MPI_Irecv(â¦) > >>> > >>> } > >>> > >>> } > >>> > >>> If (there are messages on our input queue) > >>> > >>> { > >>> > >>> ⦠process the FIRST message on queue (this may queue > >>> messages for output) â¦. > >>> > >>> > >>> > >>> For (each message on our output queue) > >>> > >>> { > >>> > >>> MPI_Send(â¦) > >>> > >>> } > >>> > >>> } > >>> > >>> } > >>> > >>> > >>> > >>> My problem is that I do not meet our applications performance requirements > >>> for a job (~ 20 ms) until I reduce the number of SPs from 20 to 4 or less. > >>> > >>> I added some debug into the GP and found that there are never more than 14 > >>> messages received in the for loop that calls MPI_Test. > >>> > >>> The messages that were sent from the other 6 SPs will eventually arrive at > >>> the GP in a long stream after experiencing high latency (over 600 ms). > >>> > >>> > >>> > >>> Going forward, we need to handle more objects per job and will need to > >>> have more than 4 SPs to keep up. > >>> > >>> My thought is that I have to obey this 4 SPs to 1 GP ratio and create > >>> intermediate GPs to gather results from every 4 slaves. > >>> > >>> > >>> > >>> Is this a contention problem at the GP? > >>> > >>> Is there debugging or logging I can turn on in the MPI to prove that > >>> contention is occurring? > >>> > >>> Can I configure MPI receive processing to improve upon the 4 to 1 ratio? > >>> > >>> Can I improve the controller method (listed above) to gain a performance > >>> improvement? > >>> > >>> > >>> > >>> Thanks for any suggestions. > >>> > >>> Gary Hodge > >>> > >>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >