Re: New Feature Request

2015-08-05 Thread Sean Owen
I don't think countApprox is appropriate here unless approximation is OK. But more generally, counting everything matching a filter requires applying the filter to the whole data set, which seems like the thing to be avoided here. The take approach is better since it would stop after finding n mat

Re: New Feature Request

2015-08-05 Thread Sandeep Giri
Hi Jonathan, Does that guarantee a result? I do not see that it is really optimized. Hi Carsten, How does the following code work: data.filter(qualifying_function).take(n).count() >= n Also, as per my understanding, in both the approaches you mentioned the qualifying function will be execute

Re: New Feature Request

2015-07-31 Thread Jonathan Winandy
Hello ! You could try something like that : def exists[T](rdd:RDD[T])(f:T=>Boolean, n:Int):Boolean = { rdd.filter(f).countApprox(timeout = 1).getFinalValue().low > n } If would work for large datasets and large value of n. Have a nice day, Jonathan On 31 July 2015 at 11:29, Carsten Sc

Re: New Feature Request

2015-07-31 Thread Carsten Schnober
Hi, the RDD class does not have an exist()-method (in the Scala API), but the functionality you need seems easy to resemble with the existing methods: val containsNMatchingElements = data.filter(qualifying_function).take(n).count() >= n Note: I am not sure whether the intermediate take(n) really

New Feature Request

2015-07-31 Thread Sandeep Giri
Dear Spark Dev Community, I am wondering if there is already a function to solve my problem. If not, then should I work on this? Say you just want to check if a word exists in a huge text file. I could not find better ways than those mentioned here