Hi All,
I have problem with cartesian product. I build cartesian of RDDs in the
loop and update one of the variables in the iteration. At the end of the
iteration the variable is squeezed to its original size. Therefore, I
expect same running time for each iteration, because result of cartesian
product always has the same size. Because cartesian products are executed
after each other, I assume the result from the previous iteration to be
used. It does not seem to be the case, runtime grows exponentially with
each iteration.
Here is simple code snippet to reproduce it:

D = sc.parallelize(list(range(1,100000,1))).cache()
L= sc.parallelize(['a','b','c','d','e','f','g','h','i','k']).cache()


for i in range(1,6):
    L=L.cartesian(D)
    L.unpersist()
    L=L.reduceByKey(min)\
        .map(lambda (l,n):l).cache()
L.collect()


Does somebody has explanation for that?

I run spark 1.5.0. with seven workers.

Thanks

Reply via email to