Re: Handling tree reduction algorithm with Spark in parallel

Debasish Das Tue, 30 Sep 2014 15:45:52 -0700

If the tree is too big build it on graphx....but it will need thorough
analysis so that the partitions are well balanced...


On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg <[email protected]> wrote:

> Hi Boromir,
>
> Assuming the tree fits in memory, and what you want to do is parallelize
> the computation, the 'obvious' way is the following:
>
> * broadcast the tree T to each worker (ok since it fits in memory)
> * construct an RDD for the deepest level - each element in the RDD is
> (parent,data_at_node)
> * aggregate this by key (=parent) -> RDD[parent,data]
> * map each element (p, data) -> (parent(p), data) using T
> * repeat until you have an RDD of size = 1 (assuming T is connected)
>
> If T cannot fit in memory, or is very deep, then there are more exotic
> techniques, but hopefully this suffices.
>
> Andy
>
>
> --
> http://www.cs.ox.ac.uk/people/andy.twigg/
>
> On 30 September 2014 14:12, Boromir Widas <[email protected]> wrote:
>
>> Hello Folks,
>>
>> I have been trying to implement a tree reduction algorithm recently in
>> spark but could not find suitable parallel operations. Assuming I have a
>> general tree like the following -
>>
>>
>>
>> I have to do the following -
>> 1) Do some computation at each leaf node to get an array of doubles.(This
>> can be pre computed)
>> 2) For each non leaf node, starting with the root node compute the sum of
>> these arrays for all child nodes. So to get the array for node B, I need to
>> get the array for E, which is the sum of G + H.
>>
>> ////////////////////// Start Snippet
>> case class Node(name: String, children: Array[Node], values:
>> Array[Double])
>>
>> // read in the tree here
>>
>> def getSumOfChildren(node: Node) : Array[Double] = {
>>     if(node.isLeafNode) {
>>       return node.values
>>    }
>>     foreach(child in node.children) {
>>        // can use an accumulator here
>>        node.values = (node.values,
>> getSumOfChildren(child)).zipped.map(_+_)
>>    }
>>    node.values
>> }
>> ////////////////////////// End Snippet
>>
>> Any pointers to how this can be done in parallel to use all cores will be
>> greatly appreciated.
>>
>> Thanks,
>> Boromir.
>>
>>
>

Re: Handling tree reduction algorithm with Spark in parallel

Reply via email to