Great, I didn't noticed this isEmpty method. Well serialization is been a problem in this project, we have noticed a lot of time been spent in serializing and deserializing things to send and get from the cluster.
2016-03-01 15:47 GMT-03:00 Sean Owen <so...@cloudera.com>: > There is an "isEmpty" method that basically does exactly what your > second version does. > > I have seen it be unusually slow at times because it must copy 1 > element to the driver, and it's possible that's slow. It still > shouldn't be slow in general, and I'd be surprised if it's slower than > a count in all but pathological cases. > > > > On Tue, Mar 1, 2016 at 6:03 PM, Dirceu Semighini Filho > <dirceu.semigh...@gmail.com> wrote: > > Hello all. > > I have a script that create a dataframe from this operation: > > > > mytable <- sql(sqlContext,("SELECT ID_PRODUCT, ... FROM mytable")) > > > > rSparkDf <- createPartitionedDataFrame(sqlContext,myRdataframe) > > dFrame <- join(mytable,rSparkDf,mytable$ID_PRODUCT==rSparkDf$ID_PRODUCT) > > > > After filtering this dFrame with this: > > > > > > I tried to execute the following > > filteredDF <- filterRDD(toRDD(dFrame),function (row) {row['COLUMN'] %in% > > c("VALUES", ...)}) > > Now I need to know if the resulting dataframe is empty, and to do that I > > tried this two codes: > > if(count(filteredDF) > 0) > > and > > if(length(take(filteredDF,1)) > 0) > > I thought that the second one, using take, shoule run faster than count, > but > > that didn't happen. > > The take operation creates one job per partition of my rdd (which was > 200) > > and this make it to run slower than the count. > > Is this the expected behaviour? > > > > Regards, > > Dirceu >