Re: SparkR Count vs Take performance

Sean Owen Tue, 01 Mar 2016 10:53:31 -0800

There is an "isEmpty" method that basically does exactly what your
second version does.


I have seen it be unusually slow at times because it must copy 1
element to the driver, and it's possible that's slow. It still
shouldn't be slow in general, and I'd be surprised if it's slower than
a count in all but pathological cases.



On Tue, Mar 1, 2016 at 6:03 PM, Dirceu Semighini Filho
<dirceu.semigh...@gmail.com> wrote:
> Hello all.
> I have a script that create a dataframe from this operation:
>
> mytable <- sql(sqlContext,("SELECT ID_PRODUCT, ... FROM mytable"))
>
> rSparkDf <- createPartitionedDataFrame(sqlContext,myRdataframe)
> dFrame <- join(mytable,rSparkDf,mytable$ID_PRODUCT==rSparkDf$ID_PRODUCT)
>
> After filtering this dFrame with this:
>
>
> I tried to execute the following
> filteredDF <- filterRDD(toRDD(dFrame),function (row) {row['COLUMN'] %in%
> c("VALUES", ...)})
> Now I need to know if the resulting dataframe is empty, and to do that I
> tried this two codes:
> if(count(filteredDF) > 0)
> and
> if(length(take(filteredDF,1)) > 0)
> I thought that the second one, using take, shoule run faster than count, but
> that didn't happen.
> The take operation creates one job per partition of my rdd (which was 200)
> and this make it to run slower than the count.
> Is this the expected behaviour?
>
> Regards,
> Dirceu

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkR Count vs Take performance

Reply via email to