I was examining the filters passed down to the Data Source API and noticed
that a common subexpression in the SQL Select Where clause was not
eliminated.   Notice the "gender = M" is listed twice in the plan explain.
  Is this expected ?  I haven't tried more complex examples with AND in the
where clause

scala> val df = spark.read.format("json").load("./people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, gender: string ... 2
more fields]

scala> df.show
+---+------+-------+------+
|age|gender|   name|salary|
+---+------+-------+------+
| 80|     M|   Andy|    10|
| 40|     M|Michael|    30|
| 70|     F| Justin|    70|
+---+------+-------+------+


scala> df.registerTempTable("people")
warning: there was one deprecation warning; re-run with -deprecation for
details

scala> val outdf = spark.sql("select * from people where gender = \"M\" or
salary > 30 or gender = \"M\"")
outdf: org.apache.spark.sql.DataFrame = [age: bigint, gender: string ... 2
more fields]

In the output below, you get the "gender = M" clause twice.

scala> outdf.explain
== Physical Plan ==
*Filter (((gender#9 = M) || (salary#11L > 30)) || (gender#9 = M))
+- *FileScan json [age#8L,gender#9,name#10,salary#11L] Batched: false,
Format: JSON, Location:
InMemoryFileIndex[file:/home/sandeep/utils/DB/spark/people.json],
PartitionFilters: [], PushedFilters:
[Or(Or(EqualTo(gender,M),GreaterThan(salary,30)),EqualTo(gender,M))],
ReadSchema: struct<age:bigint,gender:string,name:string,salary:bigint>

Reply via email to