Re: [Spark Core] Re-using dataframes with limit() produces unexpected results

Takeshi Yamamuro Thu, 12 Jan 2017 06:14:19 -0800

Hi,

I got the correct answer. Did I miss something?


// maropu

---
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
      /_/

Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12)
SparkSession available as 'spark'.
>>>
>>> from pyspark.sql import Row
>>> rdd=sc.parallelize([Row(i=i) for i in range(1000000)],200)
>>> rdd1=rdd.toDF().limit(12345).rdd
>>> rdd2=rdd1.map(lambda x:(x,x))
>>> rdd2.join(rdd2).count()
12345


On Thu, Jan 12, 2017 at 4:18 PM, Ant <the.gere...@googlemail.com> wrote:

> Hello,
> it seems using a Spark DataFrame, which had limit() applied on it, in
> further calculations produces very unexpected results. Some people use it
> as poor man's sampling and end up debugging for hours to see what's going
> on. Here is an example
>
> from pyspark.sql import Row
> rdd=sc.parallelize([Row(i=i) for i in range(1000000)],200)
> rdd1=rdd.toDF().limit(12345).rdd
> rdd2=rdd1.map(lambda x:(x,x))
> rdd2.join(rdd2).count()
> # result is 10240 despite doing a self-join; expected 12345
>
> in Pyspark/Spark 2.0.0
>
> I understand that actions on limit may yield different rows every time,
> but re-calculating the same rdd2 within a single DAG is highly unexpected.
>
> Maybe a comment in the documentation may be helpful if there is no easy
> fix of limit. I do see that the intend for limit may be such that no two
> limit paths should occur in a single DAG.
>
> What do you think? What is the correct explanation?
>
> Anton
>



-- 
---
Takeshi Yamamuro

Re: [Spark Core] Re-using dataframes with limit() produces unexpected results

Reply via email to