Hi All, I have problem with cartesian product. I build cartesian of RDDs in the loop and update one of the variables in the iteration. At the end of the iteration the variable is squeezed to its original size. Therefore, I expect same running time for each iteration, because result of cartesian product always has the same size. Because cartesian products are executed after each other, I assume the result from the previous iteration to be used. It does not seem to be the case, runtime grows exponentially with each iteration. Here is simple code snippet to reproduce it:
D = sc.parallelize(list(range(1,100000,1))).cache() L= sc.parallelize(['a','b','c','d','e','f','g','h','i','k']).cache() for i in range(1,6): L=L.cartesian(D) L.unpersist() L=L.reduceByKey(min)\ .map(lambda (l,n):l).cache() L.collect() Does somebody has explanation for that? I run spark 1.5.0. with seven workers. Thanks