Dear Sir / Madam, I have a plan to contribute some codes about passing filters to a datasource as physical planning.
In more detail, I understand when we want to build up filter operations from data like Parquet (when actually reading and filtering HDFS blocks at first not filtering in memory with Spark operations), we need to implement PrunedFilteredScan, PrunedScan or CatalystScan in package org.apache.spark.sql.sources. For PrunedFilteredScan and PrunedScan, it pass the filter objects in package org.apache.spark.sql.sources, which do not access directly to the query parser but are objects built by selectFilters() in package org.apache.spark.sql.sources.DataSourceStrategy. It looks all the filters (rather raw expressions) do not pass to the function below in PrunedFilteredScan and PrunedScan. def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] The passing filters in here are defined in package org.apache.spark.sql.sources. On the other hand, it does not pass EqualNullSafe filter in package org.apache.spark.sql.catalyst.expressions even though this looks possible to pass for other datasources such as Parquet and JSON. I understand that CatalystScan can take the all raw expression accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). As far as I know, Parquet also does not use this. In general, this can be a issue as a user send a query to data such as 1. SELECT * FROM table WHERE field = 1; 2. SELECT * FROM table WHERE field <=> 1; The second query can be hugely slow because of large network traffic by not filtered data from the source RDD. Also,I could not find a proper issue for this (except for https://issues.apache.org/jira/browse/SPARK-8747) which says it supports now binary capability. Accordingly, I want to add this issue and make a pull request with my codes. Could you please make any comments for this? Thanks.
