Community ,
Given a dataset ds , what is the recommended way to store the records that don't meet a filter? For example : val ds = Seq(1,2,3,4).toDS val f = (i : Integer) => i < 2 val filtered = ds.filter(f(_)) I understand I can do this : val filterNotMet = ds.filter(!f(_)) But unless I am missing something , I believe this means that Spark will iterate and apply the filter twice which sounds like an overhead to me. Please clarify if this is not the case. Another option I can think of is to do something like this : val fudf = udf(f) val applyFilterUDF = ds.withColumn("filtered",fudf($"value")) val filteredUDF = applyFilter.where(applyFilter("filtered") === true) val filterNotMetUDF = applyFilter.where(applyFilter("filtered") === false) But from the plan I can't really tell if it is any better : scala> filtered.explain == Physical Plan == *(1) Filter <function1>.apply$mcZI$sp +- LocalTableScan [value#149] scala> applyFilterUDF.explain == Physical Plan == LocalTableScan [value#149, filtered#153] scala> filterNotMet.explain == Physical Plan == *(1) Filter <function1>.apply$mcZI$sp +- LocalTableScan [value#149] scala> filterNotMetUDF.explain == Physical Plan == *(1) Project [value#62, UDF(value#62) AS filtered#97] +- *(1) Filter (UDF(value#62) = false) +- LocalTableScan [value#62] Thank you.