Yes, Nathan has a few coll ml fixes queued up for 1.8. On Mar 24, 2014, at 10:11 PM, tmish...@jcity.maeda.co.jp wrote:
> > > I ran our application using the final version of openmpi-1.7.5 again > with coll_ml_priority = 90. > > Then, coll/ml was actually activated and I got these error messages > as shown below: > [manage][[11217,1],0][coll_ml_lmngr.c:265:mca_coll_ml_lmngr_alloc] COLL-ML > List manager is empty. > [manage][[11217,1],0][coll_ml_allocation.c:47:mca_coll_ml_allocate_block] > COLL-ML lmngr failed. > [manage][[11217,1],0][coll_ml_module.c:532:ml_module_memory_initialization] > COLL-ML mca_coll_ml_allocate_block exited wi > th error. > > Unfortunately coll/ml seems to still have some problems ... > > And, it also means coll/ml was not activated on my test run with > coll_ml_priority = 27. So, the slowdown was due to the expensive > connectivity computation as you pointed out, I guess. > > Tetsuya > >> On Mar 20, 2014, at 5:56 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> Hi Ralph, congratulations on releasing new openmpi-1.7.5. >>> >>> By the way, opnempi-1.7.5rc3 has been slowing down our application >>> with smaller size of testing data, where the time consuming part >>> of our application is so called sparse solver. It's negligible >>> with medium or large size data - more practical one, so I have >>> been defering this problem. >>> >>> However, this slowdown disappears in the final version of >>> openmpi-1.7.5. After some investigations, I found coll_ml caused >>> this slowdown. The final version seems to set coll_ml_priority as zero >>> again. >>> >>> Could you explain briefly about the advantage of coll_ml? In what kind >>> of situation it's effective and so on ... >> >> I'm not really the one to speak about coll/ml as I wasn't involved in it > - Nathan would be the one to ask. It is supposed to be significantly faster > for most collectives, but I imagine it would >> depend on the precise collective being used and the size of the data. We > did find and fix a number of problems right at the end (which is why we > dropped the priority until we can better test/debug >> it), and so we might have hit something that was causing your slow down. >> >> >>> >>> In addition, I'm not sure why coll_my is activated in openmpi-1.7.5rc3, >>> although its priority is lower than tuned as described in the message >>> of changeset 30790: >>> We are initially setting the priority lower than >>> tuned until this has had some time to soak in the trunk. >> >> Were you actually seeing coll/ml being used? It shouldn't have been. > However, coll/ml was getting called during the collective initialization > phase so it could set itself up, even if it wasn't being >> used. One part of its setup is a somewhat expensive connectivity > computation - one of our last-minute cleanups was removal of a static 1MB > array in that procedure. Changing the priority to 0 >> completely disables the coll/ml component, thus removing it from even the > initialization phase. My guess is that you were seeing a measurable "hit" > by that procedure on your small data tests, which >> probably ran fairly quickly - and not seeing it on the other tests > because the setup time was swamped by the computation time. >> >> >>> >>> Tetsuya >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/