Thanks Matei, will check out the MLLib implementation. On Wed, Oct 1, 2014 at 2:24 PM, Andy Twigg <[email protected]> wrote:
> Yes, that makes sense. It's similar to the all reduce pattern in vw. > > > On Wednesday, 1 October 2014, Matei Zaharia <[email protected]> > wrote: > >> Some of the MLlib algorithms do tree reduction in 1.1: >> http://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html. >> You can check out how they implemented it -- it is a series of reduce >> operations. >> >> Matei >> >> On Oct 1, 2014, at 11:02 AM, Boromir Widas <[email protected]> wrote: >> >> Thanks a lot Andy and Debashish, your suggestions were of great help. >> >> On Tue, Sep 30, 2014 at 6:44 PM, Debasish Das <[email protected]> >> wrote: >> >>> If the tree is too big build it on graphx....but it will need thorough >>> analysis so that the partitions are well balanced... >>> >>> On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg <[email protected]> >>> wrote: >>> >>>> Hi Boromir, >>>> >>>> Assuming the tree fits in memory, and what you want to do is >>>> parallelize the computation, the 'obvious' way is the following: >>>> >>>> * broadcast the tree T to each worker (ok since it fits in memory) >>>> * construct an RDD for the deepest level - each element in the RDD is >>>> (parent,data_at_node) >>>> * aggregate this by key (=parent) -> RDD[parent,data] >>>> * map each element (p, data) -> (parent(p), data) using T >>>> * repeat until you have an RDD of size = 1 (assuming T is connected) >>>> >>>> If T cannot fit in memory, or is very deep, then there are more exotic >>>> techniques, but hopefully this suffices. >>>> >>>> Andy >>>> >>>> >>>> -- >>>> http://www.cs.ox.ac.uk/people/andy.twigg/ >>>> >>>> On 30 September 2014 14:12, Boromir Widas <[email protected]> wrote: >>>> >>>>> Hello Folks, >>>>> >>>>> I have been trying to implement a tree reduction algorithm recently in >>>>> spark but could not find suitable parallel operations. Assuming I have a >>>>> general tree like the following - >>>>> >>>>> >>>>> >>>>> I have to do the following - >>>>> 1) Do some computation at each leaf node to get an array of >>>>> doubles.(This can be pre computed) >>>>> 2) For each non leaf node, starting with the root node compute the sum >>>>> of these arrays for all child nodes. So to get the array for node B, I >>>>> need >>>>> to get the array for E, which is the sum of G + H. >>>>> >>>>> ////////////////////// Start Snippet >>>>> case class Node(name: String, children: Array[Node], values: >>>>> Array[Double]) >>>>> >>>>> // read in the tree here >>>>> >>>>> def getSumOfChildren(node: Node) : Array[Double] = { >>>>> if(node.isLeafNode) { >>>>> return node.values >>>>> } >>>>> foreach(child in node.children) { >>>>> // can use an accumulator here >>>>> node.values = (node.values, >>>>> getSumOfChildren(child)).zipped.map(_+_) >>>>> } >>>>> node.values >>>>> } >>>>> ////////////////////////// End Snippet >>>>> >>>>> Any pointers to how this can be done in parallel to use all cores will >>>>> be greatly appreciated. >>>>> >>>>> Thanks, >>>>> Boromir. >>>>> >>>>> >>>> >>> >> >>
