I was examining the filters passed down to the Data Source API and noticed that a common subexpression in the SQL Select Where clause was not eliminated. Notice the "gender = M" is listed twice in the plan explain. Is this expected ? I haven't tried more complex examples with AND in the where clause
scala> val df = spark.read.format("json").load("./people.json") df: org.apache.spark.sql.DataFrame = [age: bigint, gender: string ... 2 more fields] scala> df.show +---+------+-------+------+ |age|gender| name|salary| +---+------+-------+------+ | 80| M| Andy| 10| | 40| M|Michael| 30| | 70| F| Justin| 70| +---+------+-------+------+ scala> df.registerTempTable("people") warning: there was one deprecation warning; re-run with -deprecation for details scala> val outdf = spark.sql("select * from people where gender = \"M\" or salary > 30 or gender = \"M\"") outdf: org.apache.spark.sql.DataFrame = [age: bigint, gender: string ... 2 more fields] In the output below, you get the "gender = M" clause twice. scala> outdf.explain == Physical Plan == *Filter (((gender#9 = M) || (salary#11L > 30)) || (gender#9 = M)) +- *FileScan json [age#8L,gender#9,name#10,salary#11L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/home/sandeep/utils/DB/spark/people.json], PartitionFilters: [], PushedFilters: [Or(Or(EqualTo(gender,M),GreaterThan(salary,30)),EqualTo(gender,M))], ReadSchema: struct<age:bigint,gender:string,name:string,salary:bigint>