Yes, that makes sense. It's similar to the all reduce pattern in vw.

On Wednesday, 1 October 2014, Matei Zaharia <[email protected]> wrote:

> Some of the MLlib algorithms do tree reduction in 1.1:
> http://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html.
> You can check out how they implemented it -- it is a series of reduce
> operations.
>
> Matei
>
> On Oct 1, 2014, at 11:02 AM, Boromir Widas <[email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
> Thanks a lot Andy and Debashish, your suggestions were of great help.
>
> On Tue, Sep 30, 2014 at 6:44 PM, Debasish Das <[email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
>> If the tree is too big build it on graphx....but it will need thorough
>> analysis so that the partitions are well balanced...
>>
>> On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg <[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>
>>> Hi Boromir,
>>>
>>> Assuming the tree fits in memory, and what you want to do is parallelize
>>> the computation, the 'obvious' way is the following:
>>>
>>> * broadcast the tree T to each worker (ok since it fits in memory)
>>> * construct an RDD for the deepest level - each element in the RDD is
>>> (parent,data_at_node)
>>> * aggregate this by key (=parent) -> RDD[parent,data]
>>> * map each element (p, data) -> (parent(p), data) using T
>>> * repeat until you have an RDD of size = 1 (assuming T is connected)
>>>
>>> If T cannot fit in memory, or is very deep, then there are more exotic
>>> techniques, but hopefully this suffices.
>>>
>>> Andy
>>>
>>>
>>> --
>>> http://www.cs.ox.ac.uk/people/andy.twigg/
>>>
>>> On 30 September 2014 14:12, Boromir Widas <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>>
>>>> Hello Folks,
>>>>
>>>> I have been trying to implement a tree reduction algorithm recently in
>>>> spark but could not find suitable parallel operations. Assuming I have a
>>>> general tree like the following -
>>>>
>>>>
>>>>
>>>> I have to do the following -
>>>> 1) Do some computation at each leaf node to get an array of
>>>> doubles.(This can be pre computed)
>>>> 2) For each non leaf node, starting with the root node compute the sum
>>>> of these arrays for all child nodes. So to get the array for node B, I need
>>>> to get the array for E, which is the sum of G + H.
>>>>
>>>> ////////////////////// Start Snippet
>>>> case class Node(name: String, children: Array[Node], values:
>>>> Array[Double])
>>>>
>>>> // read in the tree here
>>>>
>>>> def getSumOfChildren(node: Node) : Array[Double] = {
>>>>     if(node.isLeafNode) {
>>>>       return node.values
>>>>    }
>>>>     foreach(child in node.children) {
>>>>        // can use an accumulator here
>>>>        node.values = (node.values,
>>>> getSumOfChildren(child)).zipped.map(_+_)
>>>>    }
>>>>    node.values
>>>> }
>>>> ////////////////////////// End Snippet
>>>>
>>>> Any pointers to how this can be done in parallel to use all cores will
>>>> be greatly appreciated.
>>>>
>>>> Thanks,
>>>> Boromir.
>>>>
>>>>
>>>
>>
>
>

Reply via email to