Hi, Zhan,

That sounds really interesting! Please at me when you submit the PR. If
possible, please also posted the performance difference.

Thanks,

Xiao Li


2015-11-11 14:45 GMT-08:00 Zhan Zhang <zzh...@hortonworks.com>:

> Hi Folks,
>
> I did some performance measurement based on TPC-H recently, and want to
> bring up some performance issue I observed. Both are related to cartesian
> join.
>
> 1. CartesianProduct implementation.
>
> Currently CartesianProduct relies on RDD.cartesian, in which the
> computation is realized as follows
>
>   override def compute(split: Partition, context: TaskContext):
> Iterator[(T, U)] = {
>     val currSplit = split.asInstanceOf[CartesianPartition]
>     for (x <- rdd1.iterator(currSplit.s1, context);
>          y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
>   }
>
> From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n
> times. Which is really heavy and may never finished if n is large,
> especially when rdd2 is coming from ShuffleRDD.
>
> We should have some optimization on CartesianProduct by caching
> rightResults. The problem is that we don’t have cleanup hook to unpersist
> rightResults AFAIK. I think we should have some cleanup hook after query
> execution.
> With the hook available, we can easily optimize such Cartesian join. I
> believe such cleanup hook may also benefit other query optimizations.
>
>
> 2. Unnecessary CartesianProduct join.
>
> When we have some queries similar to following (don’t remember the exact
> form):
> select * from a, b, c, d where a.key1 = c.key1 and b.key2 = c.key2 and
> c.key3 = d.key3
>
> There will be a cartesian join between a and b. But if we just simply
> change the table order, for example from a, c, b, d, such cartesian join
> are eliminated.
> Without such manual tuning, the query will never finish if a, c are big.
> But we should not relies on such manual optimization.
>
>
> Please provide your inputs. If they are both valid, I will open liras for
> each.
>
> Thanks.
>
> Zhan Zhang
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Reply via email to