Hi, Zhan, That sounds really interesting! Please at me when you submit the PR. If possible, please also posted the performance difference.
Thanks, Xiao Li 2015-11-11 14:45 GMT-08:00 Zhan Zhang <zzh...@hortonworks.com>: > Hi Folks, > > I did some performance measurement based on TPC-H recently, and want to > bring up some performance issue I observed. Both are related to cartesian > join. > > 1. CartesianProduct implementation. > > Currently CartesianProduct relies on RDD.cartesian, in which the > computation is realized as follows > > override def compute(split: Partition, context: TaskContext): > Iterator[(T, U)] = { > val currSplit = split.asInstanceOf[CartesianPartition] > for (x <- rdd1.iterator(currSplit.s1, context); > y <- rdd2.iterator(currSplit.s2, context)) yield (x, y) > } > > From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n > times. Which is really heavy and may never finished if n is large, > especially when rdd2 is coming from ShuffleRDD. > > We should have some optimization on CartesianProduct by caching > rightResults. The problem is that we don’t have cleanup hook to unpersist > rightResults AFAIK. I think we should have some cleanup hook after query > execution. > With the hook available, we can easily optimize such Cartesian join. I > believe such cleanup hook may also benefit other query optimizations. > > > 2. Unnecessary CartesianProduct join. > > When we have some queries similar to following (don’t remember the exact > form): > select * from a, b, c, d where a.key1 = c.key1 and b.key2 = c.key2 and > c.key3 = d.key3 > > There will be a cartesian join between a and b. But if we just simply > change the table order, for example from a, c, b, d, such cartesian join > are eliminated. > Without such manual tuning, the query will never finish if a, c are big. > But we should not relies on such manual optimization. > > > Please provide your inputs. If they are both valid, I will open liras for > each. > > Thanks. > > Zhan Zhang > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >