There is a project called "MVAPICH2-GPU", which is developed by D. K. Panda's research group at Ohio State University. You will find lots of references on Google... and I just briefly gone through the slides of "MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters"":
http://nowlab.cse.ohio-state.edu/publications/conf-presentations/2011/hao-isc11-slides.pdf It takes advantage of CUDA 4.0's Unified Virtual Addressing (UVA) to pipeline & optimize cudaMemcpyAsync() & RMDA transfers. (MVAPICH 1.8a1p1 also supports Device-Device, Device-Host, Host-Device transfers.) Open MPI also supports similar functionality, but as OpenMPI is not an academic project, there are less academic papers documenting the internals of the latest developments (not saying that it's bad - many products are not academic in nature and thus have less published papers...) Rayson ================================= Grid Engine / Open Grid Scheduler http://gridscheduler.sourceforge.net/ Scalable Grid Engine Support Program http://www.scalablelogic.com/ On Mon, Dec 12, 2011 at 11:40 AM, Durga Choudhury <dpcho...@gmail.com> wrote: > I think this is a *great* topic for discussion, so let me throw some > fuel to the fire: the mechanism described in the blog (that makes > perfect sense) is fine for (N)UMA shared memory architectures. But > will it work for asymmetric architectures such as the Cell BE or > discrete GPUs where the data between the compute nodes have to be > explicitly DMA'd in? Is there a middleware layer that makes it > transparent to the upper layer software? > > Best regards > Durga > > On Mon, Dec 12, 2011 at 11:00 AM, Rayson Ho <raysonlo...@gmail.com> wrote: >> On Sat, Dec 10, 2011 at 3:21 PM, amjad ali <amja...@gmail.com> wrote: >>> (2) The latest MPI implementations are intelligent enough that they use some >>> efficient mechanism while executing MPI based codes on shared memory >>> (multicore) machines. (please tell me any reference to quote this fact). >> >> Not an academic paper, but from a real MPI library developer/architect: >> >> http://blogs.cisco.com/performance/shared-memory-as-an-mpi-transport/ >> http://blogs.cisco.com/performance/shared-memory-as-an-mpi-transport-part-2/ >> >> Open MPI is used by Japan's K computer (current #1 TOP 500 computer) >> and LANL's RoadRunner (#1 Jun 08 – Nov 09), and "10^16 Flops Can't Be >> Wrong" and "10^15 Flops Can't Be Wrong": >> >> http://www.open-mpi.org/papers/sc-2008/jsquyres-cisco-booth-talk-2up.pdf >> >> Rayson >> >> ================================= >> Grid Engine / Open Grid Scheduler >> http://gridscheduler.sourceforge.net/ >> >> Scalable Grid Engine Support Program >> http://www.scalablelogic.com/ >> >> >>> >>> >>> Please help me in formally justifying this and comment/modify above two >>> justifications. Better if I you can suggent me to quote some reference of >>> any suitable publication in this regard. >>> >>> best regards, >>> Amjad Ali >>> >>> _______________________________________________ >>> Beowulf mailing list, beow...@beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> >> >> >> -- >> Rayson >> >> ================================================== >> Open Grid Scheduler - The Official Open Source Grid Engine >> http://gridscheduler.sourceforge.net/ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Rayson ================================================== Open Grid Scheduler - The Official Open Source Grid Engine http://gridscheduler.sourceforge.net/