There is an "isEmpty" method that basically does exactly what your second version does.
I have seen it be unusually slow at times because it must copy 1 element to the driver, and it's possible that's slow. It still shouldn't be slow in general, and I'd be surprised if it's slower than a count in all but pathological cases. On Tue, Mar 1, 2016 at 6:03 PM, Dirceu Semighini Filho <dirceu.semigh...@gmail.com> wrote: > Hello all. > I have a script that create a dataframe from this operation: > > mytable <- sql(sqlContext,("SELECT ID_PRODUCT, ... FROM mytable")) > > rSparkDf <- createPartitionedDataFrame(sqlContext,myRdataframe) > dFrame <- join(mytable,rSparkDf,mytable$ID_PRODUCT==rSparkDf$ID_PRODUCT) > > After filtering this dFrame with this: > > > I tried to execute the following > filteredDF <- filterRDD(toRDD(dFrame),function (row) {row['COLUMN'] %in% > c("VALUES", ...)}) > Now I need to know if the resulting dataframe is empty, and to do that I > tried this two codes: > if(count(filteredDF) > 0) > and > if(length(take(filteredDF,1)) > 0) > I thought that the second one, using take, shoule run faster than count, but > that didn't happen. > The take operation creates one job per partition of my rdd (which was 200) > and this make it to run slower than the count. > Is this the expected behaviour? > > Regards, > Dirceu --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org