Hello,
it seems using a Spark DataFrame, which had limit() applied on it, in
further calculations produces very unexpected results. Some people use
it as poor man's sampling and end up debugging for hours to see what's
going on. Here is an example
|from pyspark.sql import Row rdd=sc.parallelize([Row(i=i) for i in
range(1000000)],200) rdd1=rdd.toDF().limit(12345).rdd
rdd2=rdd1.map(lambda x:(x,x)) rdd2.join(rdd2).count() # result is 10240
despite doing a self-join; expected 12345|
in Pyspark/Spark 2.0.0
I understand that actions on limit may yield different rows every time,
but re-calculating the same rdd2 within a single DAG is highly unexpected.
Maybe a comment in the documentation may be helpful if there is no easy
fix of limit. I do see that the intend for limit may be such that no two
limit paths should occur in a single DAG.
What do you think? What is the correct explanation?
Anton