Re: [OMPI users] EXTERNAL: Re: unacceptable latency in gathering process

Hodge, Gary C Tue, 9 Oct 2012 10:51:06 -0400

When I said that I quickly found out that my installation does not have 
MPI_THREAD_MULTIPLE support, it was because I was getting sig segv in MPI calls 
when making MPI calls from 2 threads at once.  I later found that 
MPI_Init_thread was saying that my provided support was MPI_THREAD_SINGLE (0)


My app uses an Infiniband connection between nodes and I am running 1.4.1.  
Later versions of the man page for MPI_Init_thread say that multiple is lightly 
tested and could work for other BTLs, openib not being one of them

-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Tuesday, October 09, 2012 10:40 AM
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: unacceptable latency in gathering 
process

If you ask for thread multiple, I believe we return thread funneled or thread 
serial. You can check, though - I might be remembering wrong, but I'm pretty 
sure that's true

Sent from my iPad

On Oct 9, 2012, at 7:09 AM, Brian Budge <brian.bu...@gmail.com> wrote:

> Hi Ralph -
> 
> Is this really true?  I've been using thread_multiple in my openmpi
> programs for quite some time...  There may be known cases where it
> will not work, but for vanilla MPI use, it seems good to go.  That's
> not to say that you can't create your own deadlock if you're not
> careful, but they are cases you'd expect deadlock.  What specifically
> is unsupported about thread_multiple?
> 
>  Brian
> 
> On Tue, Oct 9, 2012 at 6:30 AM, Ralph Castain <r...@open-mpi.org> wrote:
>> We don't support thread_multiple, I'm afraid. Only thread_funneled, so
>> you'll have to architect things so that each process can perform all its MPI
>> actions inside of a single thread.
>> 
>> 
>> 
>> On Tue, Oct 9, 2012 at 6:10 AM, Hodge, Gary C <gary.c.ho...@lmco.com> wrote:
>>> 
>>> FYI, I implemented the harvesting thread but found out quickly that my
>>> installation of open MPI does not have MPI_THREAD_MULIPLE support
>>> 
>>> My worker thread still does MPI_Send calls to move the data to the next
>>> process.
>>> 
>>> So I am going to download 1.6.2 today, configure it with
>>> --enable-thread-multiple and try again
>>> 
>>> 
>>> 
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>>> Behalf Of Ralph Castain
>>> Sent: Thursday, October 04, 2012 8:10 PM
>>> 
>>> 
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] EXTERNAL: Re: unacceptable latency in gathering
>>> process
>>> 
>>> 
>>> 
>>> Sorry for delayed response - been on the road all day.
>>> 
>>> Usually we use the standard NetPipe, IMB, and other benchmarks to measure
>>> latency. IIRC, these are all point-to-point measurements - i.e., they
>>> measure the latency for a single process sending to one other process
>>> (typically on the order of a couple of microseconds). The tests may have
>>> multiple processes running, but they don't have one process receiving
>>> messages from multiple senders.
>>> 
>>> You will, of course, see increased delays in that scenario just due to
>>> cycle time - we give you a message, but cannot give you another one until
>>> you return from our delivery callback. So the longer you spend in the
>>> callback, the slower we go.
>>> 
>>> In one use-case I recently helped with, we had a "harvesting" thread that
>>> simply reaped the messages from the MPI callback and stuffed them into a
>>> multi-threaded processing queue. This minimized the MPI "latency", but of
>>> course the overall thruput depended on the speed of the follow-on queue. In
>>> our case, we only had one process running on each node (like you), and had
>>> lots of cores on the node - so we cranked up the threads in the processing
>>> queue and rammed the data thru the pipe.
>>> 
>>> Your design looks similar, so you might benefit from a similar approach.
>>> Just don't try to have multiple MPI callbacks each sitting in a separate
>>> thread as thread support in MPI isn't good - better to have a single thread
>>> handling the MPI stuff, and then push it into a queue that multiple threads
>>> can access.
>>> 
>>> Anyway, glad that helped diagnose the issue.
>>> Ralph
>>> 
>>> 
>>> 
>>> On Thu, Oct 4, 2012 at 6:55 AM, Hodge, Gary C <gary.c.ho...@lmco.com>
>>> wrote:
>>> 
>>> Once I read your comment, Ralph, about this being “orders of magnitude
>>> worse than anything we measure”, I knew it had to be our problem
>>> 
>>> 
>>> 
>>> We already had some debug code in place to measure when we send and when
>>> we receive over MPI.  I turned this code on and ran with 12 slaves instead
>>> of 4.
>>> 
>>> Our debug showed that once an SP does a send, it is received at the GP in
>>> less than 1 ms.   I then decided to take a close look at when each SP was
>>> sending a message.
>>> 
>>> It turns out that the first 9 slaves send out messages at very regular
>>> intervals, but the last 3 slaves have 200 - 600 ms delays in sending out a
>>> message.
>>> 
>>> It could be that our SPs have a problem when many are running at once.  It
>>> is also interesting to note that the first 9 slaves run on the same blade
>>> chassis as the GP and
>>> 
>>> the last 3 SPs run on our second blade chassis.  I will later experiment
>>> with the placement of our SPs across chassis to see if this an important
>>> factor or not.
>>> 
>>> 
>>> 
>>> When I first reported this problem, I had only turned on debug in the
>>> receiving GP process.  The latency I was seeing then was the difference
>>> between when I received a message
>>> 
>>> from the 10th slave and when I received the last message from the 10th
>>> slave.  The time we use for our debug  comes from an MPI_Wtime call.
>>> 
>>> 
>>> 
>>> Ralph, for my future reference, could you share how many processes were
>>> sending to a single process in your testing, and what were the size of the
>>> messages sent?
>>> 
>>> 
>>> 
>>> Hristo, thanks for your input, I had already spent a few days searching
>>> the faqs and tuning guides before posting.
>>> 
>>> 
>>> 
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>>> Behalf Of Ralph Castain
>>> Sent: Wednesday, October 03, 2012 4:01 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] EXTERNAL: Re: unacceptable latency in gathering
>>> process
>>> 
>>> 
>>> 
>>> Hmmm...you probably can't without digging down into the diagnostics.
>>> 
>>> 
>>> 
>>> Perhaps we could help more if we had some idea how you are measuring this
>>> "latency". I ask because that is orders of magnitude worse than anything we
>>> measure - so I suspect the problem is in your app (i.e., that the time you
>>> are measuring is actually how long it takes you to get around to processing
>>> a message that was received some time ago).
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Oct 3, 2012, at 11:52 AM, "Hodge, Gary C" <gary.c.ho...@lmco.com>
>>> wrote:
>>> 
>>> 
>>> 
>>> how do I tell the difference between when the message was received and
>>> when the message was picked up in MPI_Test?
>>> 
>>> 
>>> 
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>>> Behalf Of Ralph Castain
>>> Sent: Wednesday, October 03, 2012 1:00 PM
>>> To: Open MPI Users
>>> Subject: EXTERNAL: Re: [OMPI users] unacceptable latency in gathering
>>> process
>>> 
>>> 
>>> 
>>> Out of curiosity, have you logged the time when the SP called "send" and
>>> compared it to the time when the message was received, and when that message
>>> is picked up in MPI_Test? In other words, have you actually verified that
>>> the delay is in the MPI library as opposed to in your application?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Oct 3, 2012, at 9:40 AM, "Hodge, Gary C" <gary.c.ho...@lmco.com> wrote:
>>> 
>>> 
>>> 
>>> Hi all,
>>> 
>>> I am running on an IBM BladeCenter, using Open MPI 1.4.1, and opensm
>>> subnet manager for Infiniband
>>> 
>>> 
>>> 
>>> Our application has real time requirements and it has recently been proven
>>> that it does not scale to meet future requirements.
>>> 
>>> Presently, I am re-organizing the application to process work in a more
>>> parallel manner then it does now.
>>> 
>>> 
>>> 
>>> Jobs arrive at the rate of 200 per second and are sub-divided into groups
>>> of objects by a master process (MP) on its own node.
>>> 
>>> The MP then assigns the object groups to 20 slave processes (SP), each
>>> running on their own node, to do the expensive computational work in
>>> parallel.
>>> 
>>> The SPs then send their results to a gatherer process (GP) on its own node
>>> that merges the results for the job and sends it onward for final
>>> processing.
>>> 
>>> The highest latency for the last 1024 jobs that were processed is then
>>> written to a log file that is displayed by a GUI.
>>> 
>>> Each process uses the same controller method for sending and  receiving
>>> messages as follows:
>>> 
>>> 
>>> 
>>> For (each CPU that sends us input)
>>> 
>>> {
>>> 
>>> MPI_Irecv(….)
>>> 
>>> }
>>> 
>>> 
>>> 
>>> While (true)
>>> 
>>> {
>>> 
>>>                For (each CPU that sends us input)
>>> 
>>> {
>>> 
>>> MPI_Test(….)
>>> 
>>> If (message was received)
>>> 
>>> {
>>> 
>>>                Copy the message
>>> 
>>> Queue the copy to our input queue
>>> 
>>>                MPI_Irecv(…)
>>> 
>>> }
>>> 
>>> }
>>> 
>>> If (there are messages on our input queue)
>>> 
>>> {
>>> 
>>>                … process the FIRST message on queue (this may queue
>>> messages for output) ….
>>> 
>>> 
>>> 
>>>                For (each message on our output queue)
>>> 
>>>                {
>>> 
>>>                                MPI_Send(…)
>>> 
>>>                }
>>> 
>>> }
>>> 
>>> }
>>> 
>>> 
>>> 
>>> My problem is that I do not meet our applications performance requirements
>>> for a job (~ 20 ms) until I reduce the number of SPs from 20 to 4 or less.
>>> 
>>> I added some debug into the GP and found that there are never more than 14
>>> messages received in the for loop that calls MPI_Test.
>>> 
>>> The messages that were sent from the other 6 SPs will eventually arrive at
>>> the GP in a long stream after experiencing high latency (over 600 ms).
>>> 
>>> 
>>> 
>>> Going forward, we need to handle more objects per job and will need to
>>> have more than 4 SPs to keep up.
>>> 
>>> My thought is that I have to obey this 4 SPs to 1 GP ratio and create
>>> intermediate GPs to gather results from every 4 slaves.
>>> 
>>> 
>>> 
>>> Is this a contention problem at the GP?
>>> 
>>> Is there debugging or logging I can turn on in the MPI to prove that
>>> contention is occurring?
>>> 
>>> Can I configure MPI receive processing to improve upon the 4 to 1 ratio?
>>> 
>>> Can I improve the controller method (listed above) to gain a performance
>>> improvement?
>>> 
>>> 
>>> 
>>> Thanks for any suggestions.
>>> 
>>> Gary Hodge
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] EXTERNAL: Re: unacceptable latency in gathering process

Reply via email to