One of the authors of ML mentioned to me off-list that he has an idea what 
might have been causing the slowdown.  They're actively working on tweaking and 
making things better.

I told them to ping you -- the whole point is that ml is supposed to be 
*better* than our existing collectives, so if it's not, we should fix that 
before we make ml be the default.  :-)


On Mar 21, 2014, at 9:04 AM, Ralph Castain <r...@open-mpi.org> wrote:

> 
> On Mar 20, 2014, at 5:56 PM, tmish...@jcity.maeda.co.jp wrote:
> 
>> 
>> Hi Ralph, congratulations on releasing new openmpi-1.7.5.
>> 
>> By the way, opnempi-1.7.5rc3 has been slowing down our application
>> with smaller size of testing data, where the time consuming part
>> of our application is so called sparse solver. It's negligible
>> with medium or large size data - more practical one, so I have
>> been defering this problem.
>> 
>> However, this slowdown disappears in the final version of
>> openmpi-1.7.5. After some investigations, I found coll_ml caused
>> this slowdown. The final version seems to set coll_ml_priority as zero
>> again.
>> 
>> Could you explain briefly about the advantage of coll_ml? In what kind
>> of situation it's effective and so on ...
> 
> I'm not really the one to speak about coll/ml as I wasn't involved in it - 
> Nathan would be the one to ask. It is supposed to be significantly faster for 
> most collectives, but I imagine it would depend on the precise collective 
> being used and the size of the data. We did find and fix a number of problems 
> right at the end (which is why we dropped the priority until we can better 
> test/debug it), and so we might have hit something that was causing your slow 
> down.
> 
> 
>> 
>> In addition, I'm not sure why coll_my is activated in openmpi-1.7.5rc3,
>> although its priority is lower than tuned as described in the message
>> of changeset 30790:
>> We are initially setting the priority lower than
>> tuned until this has had some time to soak in the trunk.
> 
> Were you actually seeing coll/ml being used? It shouldn't have been. However, 
> coll/ml was getting called during the collective initialization phase so it 
> could set itself up, even if it wasn't being used. One part of its setup is a 
> somewhat expensive connectivity computation - one of our last-minute cleanups 
> was removal of a static 1MB array in that procedure. Changing the priority to 
> 0 completely disables the coll/ml component, thus removing it from even the 
> initialization phase. My guess is that you were seeing a measurable "hit" by 
> that procedure on your small data tests, which probably ran fairly quickly - 
> and not seeing it on the other tests because the setup time was swamped by 
> the computation time.
> 
> 
>> 
>> Tetsuya
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to