The MPI.jl package also supports calling MPI routines directly.  If you are 
transferring arrays of immutables, they can be sent with no overhead 
(serialization or otherwise).  The limitation is mutable types cannot be 
sent (but they can be using the remotecall framework).

    Jared Crean

On Friday, August 12, 2016 at 1:12:50 PM UTC-4, Amit Murthy wrote:
>
> Are the constants the same across iterations? If so, you may find a 
> CachingPool useful - 
> http://docs.julialang.org/en/latest/stdlib/parallel/#Base.CachingPool
>
> >> Is there a way to only incur serialization/preparation costs once on 
> the central worker, when the same data is
> >> transferred to multiple workers?
>
> Time the amount of time it takes to serialize your data to an IOBuffer. 
> This will give you an idea if there are any benefits to serializing first 
> to an IOBuffer and then sending that once to every worker.
>
> We don't yet have a network optimized construct like MPI Broadcast 
> (@everywhere does serialize separately to each worker).  
>
> >> Is it likely to help if I write branching code (1 sends to 2), (1 and 2 
> send to 3 and 4), (1,2,3,4 send to 5,6,7,8)?
>
> How many workers do you have? I doubt you will much benefits for upto a 
> couple of 100 workers for the sizes you mentioned. 
>
> >> Alternately is there any way of using other, faster technologies from 
> within the REPL? My cluster supports MPI, and also I have GPUs with 
> infiniband connections. 
>
> Package MPI.jl supports using MPI as transport. There is some more work 
> required to optimize the use of MPI transport and support MPI broadcast. 
> This is WIP.
>
> >> My appetite for messing around with this to achieve better performance 
> is quite high.
>
> Good to hear. If you have narrowed down specific performance issues they 
> can be worked on at MPI.jl / Julia repos 
>
>
> On Fri, Aug 12, 2016 at 3:19 PM, Matthew Pearce <[email protected] 
> <javascript:>> wrote:
>
>> Dear Julians
>>
>> I'm trying to speed up the network data transfer in an MCMC algorithm. 
>> Currently I'm using `remotecall` based functions to do the network 
>> communication.
>>
>> Essentially for every node I include I scale up by I incur about 50mb of 
>> data transfer per iteration. The topology is that various 1mb vectors get 
>> computed on worker nodes and transferred back to a central node. The 
>> central node does some work on the vector and sends back a copy of the 
>> resulting vector (same size) to each worker node.
>>
>> Now I'm doing the send and receive transfers asynchronously, but it's 
>> scaling quite badly because the network transfer complexity is 
>> O(nodes*vectors) and the constants are big. This makes me think that 
>> there's some work going on like the same vector being serialized on the 
>> central node for each transfer to another node. 
>>
>>    - Is there a way to only incur serialization/preparation costs once 
>>    on the central worker, when the same data is transferred to multiple 
>>    workers?
>>    - Is it likely to help if I write branching code (1 sends to 2), (1 
>>    and 2 send to 3 and 4), (1,2,3,4 send to 5,6,7,8)? 
>>    - Alternately is there any way of using other, faster technologies 
>>    from within the REPL? My cluster supports MPI, and also I have GPUs with 
>>    infiniband connections. 
>>    
>> My appetite for messing around with this to achieve better performance is 
>> quite high.
>>
>> Cheers in advance
>>
>> Matthew 
>>
>
>

Reply via email to