Re: SparkR Count vs Take performance

2016-03-02 Thread Dirceu Semighini Filho
udera.com] > Sent: Wednesday, March 2, 2016 3:37 AM > To: Dirceu Semighini Filho > Cc: user > Subject: Re: SparkR Count vs Take performance > > Yeah one surprising result is that you can't call isEmpty on an RDD of > nonserializable objects. You can't do much with

RE: SparkR Count vs Take performance

2016-03-02 Thread Sun, Rui
each fetch. -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, March 2, 2016 3:37 AM To: Dirceu Semighini Filho Cc: user Subject: Re: SparkR Count vs Take performance Yeah one surprising result is that you can't call isEmpty on an RDD of nonserializab

Re: SparkR Count vs Take performance

2016-03-01 Thread Sean Owen
Yeah one surprising result is that you can't call isEmpty on an RDD of nonserializable objects. You can't do much with an RDD of nonserializable objects anyway, but they can exist as an intermediate stage. We could fix that pretty easily with a little copy and paste of the take() code; right now i

Re: SparkR Count vs Take performance

2016-03-01 Thread Dirceu Semighini Filho
Great, I didn't noticed this isEmpty method. Well serialization is been a problem in this project, we have noticed a lot of time been spent in serializing and deserializing things to send and get from the cluster. 2016-03-01 15:47 GMT-03:00 Sean Owen : > There is an "isEmpty" method that basicall

Re: SparkR Count vs Take performance

2016-03-01 Thread Sean Owen
There is an "isEmpty" method that basically does exactly what your second version does. I have seen it be unusually slow at times because it must copy 1 element to the driver, and it's possible that's slow. It still shouldn't be slow in general, and I'd be surprised if it's slower than a count in

SparkR Count vs Take performance

2016-03-01 Thread Dirceu Semighini Filho
Hello all. I have a script that create a dataframe from this operation: mytable <- sql(sqlContext,("SELECT ID_PRODUCT, ... FROM mytable")) rSparkDf <- createPartitionedDataFrame(sqlContext,myRdataframe) dFrame <- join(mytable,rSparkDf,mytable$ID_PRODUCT==rSparkDf$ID_PRODUCT) After filtering this