Thanks, Harish.
Mike – this would be a cleaner version for your use case:
df.filter(df("filter_field") === "value").select("field1").show()
Mohammed
From: Harish Butani [mailto:[email protected]]
Sent: Monday, July 20, 2015 5:37 PM
To: Mohammed Guller
Cc: Michael Armbrust; Mike Trienis; [email protected]
Subject: Re: Data frames select and where clause dependency
Yes via: org.apache.spark.sql.catalyst.optimizer.ColumnPruning
See DefaultOptimizer.batches for list of logical rewrites.
You can see the optimized plan by printing: df.queryExecution.optimizedPlan
On Mon, Jul 20, 2015 at 5:22 PM, Mohammed Guller
<[email protected]<mailto:[email protected]>> wrote:
Michael,
How would the Catalyst optimizer optimize this version?
df.filter(df("filter_field") === "value").select("field1").show()
Would it still read all the columns in df or would it read only “filter_field”
and “field1” since only two columns are used (assuming other columns from df
are not used anywhere else)?
Mohammed
From: Michael Armbrust
[mailto:[email protected]<mailto:[email protected]>]
Sent: Friday, July 17, 2015 1:39 PM
To: Mike Trienis
Cc: [email protected]<mailto:[email protected]>
Subject: Re: Data frames select and where clause dependency
Each operation on a dataframe is completely independent and doesn't know what
operations happened before it. When you do a selection, you are removing other
columns from the dataframe and so the filter has nothing to operate on.
On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis
<[email protected]<mailto:[email protected]>> wrote:
I'd like to understand why the where field must exist in the select clause.
For example, the following select statement works fine
* df.select("field1", "filter_field").filter(df("filter_field") ===
"value").show()
However, the next one fails with the error "in operator !Filter
(filter_field#60 = value);"
* df.select("field1").filter(df("filter_field") === "value").show()
As a work-around, it seems that I can do the following
* df.select("field1", "filter_field").filter(df("filter_field") ===
"value").drop("filter_field").show()
Thanks, Mike.