Hi, I got the correct answer. Did I miss something?
// maropu --- Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.0 /_/ Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12) SparkSession available as 'spark'. >>> >>> from pyspark.sql import Row >>> rdd=sc.parallelize([Row(i=i) for i in range(1000000)],200) >>> rdd1=rdd.toDF().limit(12345).rdd >>> rdd2=rdd1.map(lambda x:(x,x)) >>> rdd2.join(rdd2).count() 12345 On Thu, Jan 12, 2017 at 4:18 PM, Ant <the.gere...@googlemail.com> wrote: > Hello, > it seems using a Spark DataFrame, which had limit() applied on it, in > further calculations produces very unexpected results. Some people use it > as poor man's sampling and end up debugging for hours to see what's going > on. Here is an example > > from pyspark.sql import Row > rdd=sc.parallelize([Row(i=i) for i in range(1000000)],200) > rdd1=rdd.toDF().limit(12345).rdd > rdd2=rdd1.map(lambda x:(x,x)) > rdd2.join(rdd2).count() > # result is 10240 despite doing a self-join; expected 12345 > > in Pyspark/Spark 2.0.0 > > I understand that actions on limit may yield different rows every time, > but re-calculating the same rdd2 within a single DAG is highly unexpected. > > Maybe a comment in the documentation may be helpful if there is no easy > fix of limit. I do see that the intend for limit may be such that no two > limit paths should occur in a single DAG. > > What do you think? What is the correct explanation? > > Anton > -- --- Takeshi Yamamuro