Yes, that makes sense. It's similar to the all reduce pattern in vw. On Wednesday, 1 October 2014, Matei Zaharia <[email protected]> wrote:
> Some of the MLlib algorithms do tree reduction in 1.1: > http://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html. > You can check out how they implemented it -- it is a series of reduce > operations. > > Matei > > On Oct 1, 2014, at 11:02 AM, Boromir Widas <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > > Thanks a lot Andy and Debashish, your suggestions were of great help. > > On Tue, Sep 30, 2014 at 6:44 PM, Debasish Das <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >> If the tree is too big build it on graphx....but it will need thorough >> analysis so that the partitions are well balanced... >> >> On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg <[email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >> >>> Hi Boromir, >>> >>> Assuming the tree fits in memory, and what you want to do is parallelize >>> the computation, the 'obvious' way is the following: >>> >>> * broadcast the tree T to each worker (ok since it fits in memory) >>> * construct an RDD for the deepest level - each element in the RDD is >>> (parent,data_at_node) >>> * aggregate this by key (=parent) -> RDD[parent,data] >>> * map each element (p, data) -> (parent(p), data) using T >>> * repeat until you have an RDD of size = 1 (assuming T is connected) >>> >>> If T cannot fit in memory, or is very deep, then there are more exotic >>> techniques, but hopefully this suffices. >>> >>> Andy >>> >>> >>> -- >>> http://www.cs.ox.ac.uk/people/andy.twigg/ >>> >>> On 30 September 2014 14:12, Boromir Widas <[email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>> >>>> Hello Folks, >>>> >>>> I have been trying to implement a tree reduction algorithm recently in >>>> spark but could not find suitable parallel operations. Assuming I have a >>>> general tree like the following - >>>> >>>> >>>> >>>> I have to do the following - >>>> 1) Do some computation at each leaf node to get an array of >>>> doubles.(This can be pre computed) >>>> 2) For each non leaf node, starting with the root node compute the sum >>>> of these arrays for all child nodes. So to get the array for node B, I need >>>> to get the array for E, which is the sum of G + H. >>>> >>>> ////////////////////// Start Snippet >>>> case class Node(name: String, children: Array[Node], values: >>>> Array[Double]) >>>> >>>> // read in the tree here >>>> >>>> def getSumOfChildren(node: Node) : Array[Double] = { >>>> if(node.isLeafNode) { >>>> return node.values >>>> } >>>> foreach(child in node.children) { >>>> // can use an accumulator here >>>> node.values = (node.values, >>>> getSumOfChildren(child)).zipped.map(_+_) >>>> } >>>> node.values >>>> } >>>> ////////////////////////// End Snippet >>>> >>>> Any pointers to how this can be done in parallel to use all cores will >>>> be greatly appreciated. >>>> >>>> Thanks, >>>> Boromir. >>>> >>>> >>> >> > >
