Re: SparkR Count vs Take performance

Sean Owen Tue, 01 Mar 2016 11:38:26 -0800

Yeah one surprising result is that you can't call isEmpty on an RDD of
nonserializable objects. You can't do much with an RDD of
nonserializable objects anyway, but they can exist as an intermediate
stage.


We could fix that pretty easily with a little copy and paste of the
take() code; right now isEmpty is simple but has this drawback.

On Tue, Mar 1, 2016 at 7:18 PM, Dirceu Semighini Filho
<dirceu.semigh...@gmail.com> wrote:
> Great, I didn't noticed this isEmpty method.
> Well serialization is been a problem in this project, we have noticed a lot
> of time been spent in serializing and deserializing things to send and get
> from the cluster.
>
> 2016-03-01 15:47 GMT-03:00 Sean Owen <so...@cloudera.com>:
>>
>> There is an "isEmpty" method that basically does exactly what your
>> second version does.
>>
>> I have seen it be unusually slow at times because it must copy 1
>> element to the driver, and it's possible that's slow. It still
>> shouldn't be slow in general, and I'd be surprised if it's slower than
>> a count in all but pathological cases.
>>
>>
>>
>> On Tue, Mar 1, 2016 at 6:03 PM, Dirceu Semighini Filho
>> <dirceu.semigh...@gmail.com> wrote:
>> > Hello all.
>> > I have a script that create a dataframe from this operation:
>> >
>> > mytable <- sql(sqlContext,("SELECT ID_PRODUCT, ... FROM mytable"))
>> >
>> > rSparkDf <- createPartitionedDataFrame(sqlContext,myRdataframe)
>> > dFrame <- join(mytable,rSparkDf,mytable$ID_PRODUCT==rSparkDf$ID_PRODUCT)
>> >
>> > After filtering this dFrame with this:
>> >
>> >
>> > I tried to execute the following
>> > filteredDF <- filterRDD(toRDD(dFrame),function (row) {row['COLUMN'] %in%
>> > c("VALUES", ...)})
>> > Now I need to know if the resulting dataframe is empty, and to do that I
>> > tried this two codes:
>> > if(count(filteredDF) > 0)
>> > and
>> > if(length(take(filteredDF,1)) > 0)
>> > I thought that the second one, using take, shoule run faster than count,
>> > but
>> > that didn't happen.
>> > The take operation creates one job per partition of my rdd (which was
>> > 200)
>> > and this make it to run slower than the count.
>> > Is this the expected behaviour?
>> >
>> > Regards,
>> > Dirceu
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkR Count vs Take performance

Reply via email to