[Spark Core] Re-using dataframes with limit() produces unexpected results

Ant Wed, 11 Jan 2017 23:19:07 -0800

Hello,

it seems using a Spark DataFrame, which had limit() applied on it, infurther calculations produces very unexpected results. Some people useit as poor man's sampling and end up debugging for hours to see what'sgoing on. Here is an example

|from pyspark.sql import Row rdd=sc.parallelize([Row(i=i) for i inrange(1000000)],200) rdd1=rdd.toDF().limit(12345).rddrdd2=rdd1.map(lambda x:(x,x)) rdd2.join(rdd2).count() # result is 10240despite doing a self-join; expected 12345|


in Pyspark/Spark 2.0.0

I understand that actions on limit may yield different rows every time,but re-calculating the same rdd2 within a single DAG is highly unexpected.

Maybe a comment in the documentation may be helpful if there is no easyfix of limit. I do see that the intend for limit may be such that no twolimit paths should occur in a single DAG.


What do you think? What is the correct explanation?

Anton

[Spark Core] Re-using dataframes with limit() produces unexpected results

Reply via email to