Community , 

 

Given a dataset ds , what is the recommended way to store the records that
don't meet a filter?

 

For example : 

 

val ds = Seq(1,2,3,4).toDS

 

val f = (i : Integer) => i < 2 

 

val filtered = ds.filter(f(_)) 

 

I understand I can do this : 

 

val filterNotMet = ds.filter(!f(_)) 

 

But unless I am missing something , I believe this means that Spark will
iterate and apply the filter twice which sounds like an overhead to me.
Please clarify if this is not the case.

 

Another option I can think of is to do something like this : 

 

val fudf = udf(f)

 

val applyFilterUDF = ds.withColumn("filtered",fudf($"value"))

 

val filteredUDF = applyFilter.where(applyFilter("filtered") === true)

 

val filterNotMetUDF = applyFilter.where(applyFilter("filtered") === false)

 

But from the plan I can't really tell if it is any better : 

 

scala> filtered.explain

== Physical Plan ==

*(1) Filter <function1>.apply$mcZI$sp

+- LocalTableScan [value#149]

 

scala> applyFilterUDF.explain

== Physical Plan ==

LocalTableScan [value#149, filtered#153]

 

scala> filterNotMet.explain

== Physical Plan ==

*(1) Filter <function1>.apply$mcZI$sp

+- LocalTableScan [value#149]

 

scala> filterNotMetUDF.explain

== Physical Plan ==

*(1) Project [value#62, UDF(value#62) AS filtered#97]

+- *(1) Filter (UDF(value#62) = false)

   +- LocalTableScan [value#62]

 

 

Thank you.

 

 

Reply via email to