Hi All,
I have problem with cartesian product. I build cartesian of two RDDs in the
loop and the result is squeezed to the original size of one of
participating variables. At the and of the iteration this result is assigned
to the original variable. I expect same running time for each iteration,
because result of cartesian product always has the same size. Because
cartesian products are executed after each other, I assume the result from
the previous iteration to be used. It does not seem to be the case, runtime
grows exponentially with each iteration.
Here is simple code snippet to reproduce it:
D = sc.parallelize(list(range(1,100000,1))).cache()
L= sc.parallelize(['a','b','c','d','e','f','g','h','i','k']).cache()
for i in range(1,6):
L=L.cartesian(D)
L.unpersist()
L=L.reduceByKey(min)\
.map(lambda (l,n):l).cache()
L.collect()
Does somebody has explanation for that?
I run spark 1.5.0. with seven workers and pyspark
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/cartesian-in-the-loop-runtime-grows-tp25303.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]