Yes, Nathan has a few coll ml fixes queued up for 1.8.

On Mar 24, 2014, at 10:11 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> I ran our application using the final version of openmpi-1.7.5 again
> with coll_ml_priority = 90.
> 
> Then, coll/ml was actually activated and I got these error messages
> as shown below:
> [manage][[11217,1],0][coll_ml_lmngr.c:265:mca_coll_ml_lmngr_alloc] COLL-ML
> List manager is empty.
> [manage][[11217,1],0][coll_ml_allocation.c:47:mca_coll_ml_allocate_block]
> COLL-ML lmngr failed.
> [manage][[11217,1],0][coll_ml_module.c:532:ml_module_memory_initialization]
> COLL-ML mca_coll_ml_allocate_block exited wi
> th error.
> 
> Unfortunately coll/ml seems to still have some problems ...
> 
> And, it also means coll/ml was not activated on my test run with
> coll_ml_priority = 27. So, the slowdown was due to the expensive
> connectivity computation as you pointed out, I guess.
> 
> Tetsuya
> 
>> On Mar 20, 2014, at 5:56 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> Hi Ralph, congratulations on releasing new openmpi-1.7.5.
>>> 
>>> By the way, opnempi-1.7.5rc3 has been slowing down our application
>>> with smaller size of testing data, where the time consuming part
>>> of our application is so called sparse solver. It's negligible
>>> with medium or large size data - more practical one, so I have
>>> been defering this problem.
>>> 
>>> However, this slowdown disappears in the final version of
>>> openmpi-1.7.5. After some investigations, I found coll_ml caused
>>> this slowdown. The final version seems to set coll_ml_priority as zero
>>> again.
>>> 
>>> Could you explain briefly about the advantage of coll_ml? In what kind
>>> of situation it's effective and so on ...
>> 
>> I'm not really the one to speak about coll/ml as I wasn't involved in it
> - Nathan would be the one to ask. It is supposed to be significantly faster
> for most collectives, but I imagine it would
>> depend on the precise collective being used and the size of the data. We
> did find and fix a number of problems right at the end (which is why we
> dropped the priority until we can better test/debug
>> it), and so we might have hit something that was causing your slow down.
>> 
>> 
>>> 
>>> In addition, I'm not sure why coll_my is activated in openmpi-1.7.5rc3,
>>> although its priority is lower than tuned as described in the message
>>> of changeset 30790:
>>> We are initially setting the priority lower than
>>> tuned until this has had some time to soak in the trunk.
>> 
>> Were you actually seeing coll/ml being used? It shouldn't have been.
> However, coll/ml was getting called during the collective initialization
> phase so it could set itself up, even if it wasn't being
>> used. One part of its setup is a somewhat expensive connectivity
> computation - one of our last-minute cleanups was removal of a static 1MB
> array in that procedure. Changing the priority to 0
>> completely disables the coll/ml component, thus removing it from even the
> initialization phase. My guess is that you were seeing a measurable "hit"
> by that procedure on your small data tests, which
>> probably ran fairly quickly - and not seeing it on the other tests
> because the setup time was swamped by the computation time.
>> 
>> 
>>> 
>>> Tetsuya
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to