Re: SparkR Count vs Take performance

Dirceu Semighini Filho Tue, 01 Mar 2016 11:20:07 -0800

Great, I didn't noticed this isEmpty method.
Well serialization is been a problem in this project, we have noticed a lot
of time been spent in serializing and deserializing things to send and get
from the cluster.


2016-03-01 15:47 GMT-03:00 Sean Owen <so...@cloudera.com>:

> There is an "isEmpty" method that basically does exactly what your
> second version does.
>
> I have seen it be unusually slow at times because it must copy 1
> element to the driver, and it's possible that's slow. It still
> shouldn't be slow in general, and I'd be surprised if it's slower than
> a count in all but pathological cases.
>
>
>
> On Tue, Mar 1, 2016 at 6:03 PM, Dirceu Semighini Filho
> <dirceu.semigh...@gmail.com> wrote:
> > Hello all.
> > I have a script that create a dataframe from this operation:
> >
> > mytable <- sql(sqlContext,("SELECT ID_PRODUCT, ... FROM mytable"))
> >
> > rSparkDf <- createPartitionedDataFrame(sqlContext,myRdataframe)
> > dFrame <- join(mytable,rSparkDf,mytable$ID_PRODUCT==rSparkDf$ID_PRODUCT)
> >
> > After filtering this dFrame with this:
> >
> >
> > I tried to execute the following
> > filteredDF <- filterRDD(toRDD(dFrame),function (row) {row['COLUMN'] %in%
> > c("VALUES", ...)})
> > Now I need to know if the resulting dataframe is empty, and to do that I
> > tried this two codes:
> > if(count(filteredDF) > 0)
> > and
> > if(length(take(filteredDF,1)) > 0)
> > I thought that the second one, using take, shoule run faster than count,
> but
> > that didn't happen.
> > The take operation creates one job per partition of my rdd (which was
> 200)
> > and this make it to run slower than the count.
> > Is this the expected behaviour?
> >
> > Regards,
> > Dirceu
>

Re: SparkR Count vs Take performance

Reply via email to