Re: [OMPI users] EXTERNAL: Re: unacceptable latency in gatheringprocess

Siegmar Gross Tue, 9 Oct 2012 10:51:55 -0400

Hi,

I used die following options for "configure" in openmpi-1.9a1r27380
and I get "MPI_THREAD_MULTIPLE"


  --enable-cxx-exceptions \
  --enable-mpi-java \
  --enable-heterogeneous \
  --enable-opal-multi-threads \
  --enable-mpi-thread-multiple \
  --with-threads=posix \
  --with-hwloc=internal \
  --without-verbs \
  --without-udapl \
  --with-wrapper-cflags=-m64 \
  --enable-debug \


tyr small_prog 125 mpiexec thread_support

I have requested MPI_THREAD_MULTIPLE in "MPI_Init_thread ()" and
it provides MPI_THREAD_MULTIPLE

"MPI_Query_thread ()" returned MPI_THREAD_MULTIPLE:
      Many threads are supported and any thread may call
      MPI functions at any time.

"MPI_Is_thread_main ()" returned: "true".

Kind regards

Siegmar



> If you ask for thread multiple, I believe we return thread funneled
> or thread serial. You can check, though - I might be remembering
> wrong, but I'm pretty sure that's true
> 
> Sent from my iPad
> 
> On Oct 9, 2012, at 7:09 AM, Brian Budge <brian.bu...@gmail.com> wrote:
> 
> > Hi Ralph -
> > 
> > Is this really true?  I've been using thread_multiple in my openmpi
> > programs for quite some time...  There may be known cases where it
> > will not work, but for vanilla MPI use, it seems good to go.  That's
> > not to say that you can't create your own deadlock if you're not
> > careful, but they are cases you'd expect deadlock.  What specifically
> > is unsupported about thread_multiple?
> > 
> >  Brian
> > 
> > On Tue, Oct 9, 2012 at 6:30 AM, Ralph Castain <r...@open-mpi.org> wrote:
> >> We don't support thread_multiple, I'm afraid. Only thread_funneled, so
> >> you'll have to architect things so that each process can perform all its 
MPI
> >> actions inside of a single thread.
> >> 
> >> 
> >> 
> >> On Tue, Oct 9, 2012 at 6:10 AM, Hodge, Gary C <gary.c.ho...@lmco.com> 
wrote:
> >>> 
> >>> FYI, I implemented the harvesting thread but found out quickly that my
> >>> installation of open MPI does not have MPI_THREAD_MULIPLE support
> >>> 
> >>> My worker thread still does MPI_Send calls to move the data to the next
> >>> process.
> >>> 
> >>> So I am going to download 1.6.2 today, configure it with
> >>> --enable-thread-multiple and try again
> >>> 
> >>> 
> >>> 
> >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> >>> Behalf Of Ralph Castain
> >>> Sent: Thursday, October 04, 2012 8:10 PM
> >>> 
> >>> 
> >>> To: Open MPI Users
> >>> Subject: Re: [OMPI users] EXTERNAL: Re: unacceptable latency in gathering
> >>> process
> >>> 
> >>> 
> >>> 
> >>> Sorry for delayed response - been on the road all day.
> >>> 
> >>> Usually we use the standard NetPipe, IMB, and other benchmarks to measure
> >>> latency. IIRC, these are all point-to-point measurements - i.e., they
> >>> measure the latency for a single process sending to one other process
> >>> (typically on the order of a couple of microseconds). The tests may have
> >>> multiple processes running, but they don't have one process receiving
> >>> messages from multiple senders.
> >>> 
> >>> You will, of course, see increased delays in that scenario just due to
> >>> cycle time - we give you a message, but cannot give you another one until
> >>> you return from our delivery callback. So the longer you spend in the
> >>> callback, the slower we go.
> >>> 
> >>> In one use-case I recently helped with, we had a "harvesting" thread that
> >>> simply reaped the messages from the MPI callback and stuffed them into a
> >>> multi-threaded processing queue. This minimized the MPI "latency", but of
> >>> course the overall thruput depended on the speed of the follow-on queue. 
In
> >>> our case, we only had one process running on each node (like you), and had
> >>> lots of cores on the node - so we cranked up the threads in the processing
> >>> queue and rammed the data thru the pipe.
> >>> 
> >>> Your design looks similar, so you might benefit from a similar approach.
> >>> Just don't try to have multiple MPI callbacks each sitting in a separate
> >>> thread as thread support in MPI isn't good - better to have a single 
thread
> >>> handling the MPI stuff, and then push it into a queue that multiple 
threads
> >>> can access.
> >>> 
> >>> Anyway, glad that helped diagnose the issue.
> >>> Ralph
> >>> 
> >>> 
> >>> 
> >>> On Thu, Oct 4, 2012 at 6:55 AM, Hodge, Gary C <gary.c.ho...@lmco.com>
> >>> wrote:
> >>> 
> >>> Once I read your comment, Ralph, about this being âorders of magnitude
> >>> worse than anything we measureâ, I knew it had to be our problem
> >>> 
> >>> 
> >>> 
> >>> We already had some debug code in place to measure when we send and when
> >>> we receive over MPI.  I turned this code on and ran with 12 slaves instead
> >>> of 4.
> >>> 
> >>> Our debug showed that once an SP does a send, it is received at the GP in
> >>> less than 1 ms.   I then decided to take a close look at when each SP was
> >>> sending a message.
> >>> 
> >>> It turns out that the first 9 slaves send out messages at very regular
> >>> intervals, but the last 3 slaves have 200 - 600 ms delays in sending out a
> >>> message.
> >>> 
> >>> It could be that our SPs have a problem when many are running at once.  It
> >>> is also interesting to note that the first 9 slaves run on the same blade
> >>> chassis as the GP and
> >>> 
> >>> the last 3 SPs run on our second blade chassis.  I will later experiment
> >>> with the placement of our SPs across chassis to see if this an important
> >>> factor or not.
> >>> 
> >>> 
> >>> 
> >>> When I first reported this problem, I had only turned on debug in the
> >>> receiving GP process.  The latency I was seeing then was the difference
> >>> between when I received a message
> >>> 
> >>> from the 10th slave and when I received the last message from the 10th
> >>> slave.  The time we use for our debug  comes from an MPI_Wtime call.
> >>> 
> >>> 
> >>> 
> >>> Ralph, for my future reference, could you share how many processes were
> >>> sending to a single process in your testing, and what were the size of the
> >>> messages sent?
> >>> 
> >>> 
> >>> 
> >>> Hristo, thanks for your input, I had already spent a few days searching
> >>> the faqs and tuning guides before posting.
> >>> 
> >>> 
> >>> 
> >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> >>> Behalf Of Ralph Castain
> >>> Sent: Wednesday, October 03, 2012 4:01 PM
> >>> To: Open MPI Users
> >>> Subject: Re: [OMPI users] EXTERNAL: Re: unacceptable latency in gathering
> >>> process
> >>> 
> >>> 
> >>> 
> >>> Hmmm...you probably can't without digging down into the diagnostics.
> >>> 
> >>> 
> >>> 
> >>> Perhaps we could help more if we had some idea how you are measuring this
> >>> "latency". I ask because that is orders of magnitude worse than anything 
we
> >>> measure - so I suspect the problem is in your app (i.e., that the time you
> >>> are measuring is actually how long it takes you to get around to 
processing
> >>> a message that was received some time ago).
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> On Oct 3, 2012, at 11:52 AM, "Hodge, Gary C" <gary.c.ho...@lmco.com>
> >>> wrote:
> >>> 
> >>> 
> >>> 
> >>> how do I tell the difference between when the message was received and
> >>> when the message was picked up in MPI_Test?
> >>> 
> >>> 
> >>> 
> >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> >>> Behalf Of Ralph Castain
> >>> Sent: Wednesday, October 03, 2012 1:00 PM
> >>> To: Open MPI Users
> >>> Subject: EXTERNAL: Re: [OMPI users] unacceptable latency in gathering
> >>> process
> >>> 
> >>> 
> >>> 
> >>> Out of curiosity, have you logged the time when the SP called "send" and
> >>> compared it to the time when the message was received, and when that 
message
> >>> is picked up in MPI_Test? In other words, have you actually verified that
> >>> the delay is in the MPI library as opposed to in your application?
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> On Oct 3, 2012, at 9:40 AM, "Hodge, Gary C" <gary.c.ho...@lmco.com> wrote:
> >>> 
> >>> 
> >>> 
> >>> Hi all,
> >>> 
> >>> I am running on an IBM BladeCenter, using Open MPI 1.4.1, and opensm
> >>> subnet manager for Infiniband
> >>> 
> >>> 
> >>> 
> >>> Our application has real time requirements and it has recently been proven
> >>> that it does not scale to meet future requirements.
> >>> 
> >>> Presently, I am re-organizing the application to process work in a more
> >>> parallel manner then it does now.
> >>> 
> >>> 
> >>> 
> >>> Jobs arrive at the rate of 200 per second and are sub-divided into groups
> >>> of objects by a master process (MP) on its own node.
> >>> 
> >>> The MP then assigns the object groups to 20 slave processes (SP), each
> >>> running on their own node, to do the expensive computational work in
> >>> parallel.
> >>> 
> >>> The SPs then send their results to a gatherer process (GP) on its own node
> >>> that merges the results for the job and sends it onward for final
> >>> processing.
> >>> 
> >>> The highest latency for the last 1024 jobs that were processed is then
> >>> written to a log file that is displayed by a GUI.
> >>> 
> >>> Each process uses the same controller method for sending and  receiving
> >>> messages as follows:
> >>> 
> >>> 
> >>> 
> >>> For (each CPU that sends us input)
> >>> 
> >>> {
> >>> 
> >>> MPI_Irecv(â¦.)
> >>> 
> >>> }
> >>> 
> >>> 
> >>> 
> >>> While (true)
> >>> 
> >>> {
> >>> 
> >>>                For (each CPU that sends us input)
> >>> 
> >>> {
> >>> 
> >>> MPI_Test(â¦.)
> >>> 
> >>> If (message was received)
> >>> 
> >>> {
> >>> 
> >>>                Copy the message
> >>> 
> >>> Queue the copy to our input queue
> >>> 
> >>>                MPI_Irecv(â¦)
> >>> 
> >>> }
> >>> 
> >>> }
> >>> 
> >>> If (there are messages on our input queue)
> >>> 
> >>> {
> >>> 
> >>>                â¦ process the FIRST message on queue (this may queue
> >>> messages for output) â¦.
> >>> 
> >>> 
> >>> 
> >>>                For (each message on our output queue)
> >>> 
> >>>                {
> >>> 
> >>>                                MPI_Send(â¦)
> >>> 
> >>>                }
> >>> 
> >>> }
> >>> 
> >>> }
> >>> 
> >>> 
> >>> 
> >>> My problem is that I do not meet our applications performance requirements
> >>> for a job (~ 20 ms) until I reduce the number of SPs from 20 to 4 or less.
> >>> 
> >>> I added some debug into the GP and found that there are never more than 14
> >>> messages received in the for loop that calls MPI_Test.
> >>> 
> >>> The messages that were sent from the other 6 SPs will eventually arrive at
> >>> the GP in a long stream after experiencing high latency (over 600 ms).
> >>> 
> >>> 
> >>> 
> >>> Going forward, we need to handle more objects per job and will need to
> >>> have more than 4 SPs to keep up.
> >>> 
> >>> My thought is that I have to obey this 4 SPs to 1 GP ratio and create
> >>> intermediate GPs to gather results from every 4 slaves.
> >>> 
> >>> 
> >>> 
> >>> Is this a contention problem at the GP?
> >>> 
> >>> Is there debugging or logging I can turn on in the MPI to prove that
> >>> contention is occurring?
> >>> 
> >>> Can I configure MPI receive processing to improve upon the 4 to 1 ratio?
> >>> 
> >>> Can I improve the controller method (listed above) to gain a performance
> >>> improvement?
> >>> 
> >>> 
> >>> 
> >>> Thanks for any suggestions.
> >>> 
> >>> Gary Hodge
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> 
> >>> 
> >>> 
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> 
> >>> 
> >>> 
> >>> 
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> 
> >>> 
> >>> 
> >>> 
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> 
> >> 
> >> 
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] EXTERNAL: Re: unacceptable latency in gatheringprocess

Reply via email to