[
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cheng Lian updated SPARK-17636:
-------------------------------
Description:
There's a *PushedFilters* for a simple numeric field, but not for a numeric
field inside a struct. Not sure if this is a Spark limitation because of
Parquet, or only a Spark limitation.
{noformat}
scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp",
"sale_id")
res5: org.apache.spark.sql.DataFrame = [day_timestamp:
struct<timestamp:bigint,timezone:string>, sale_id: bigint]
scala> res5.filter("sale_id > 4").queryExecution.executedPlan
res9: org.apache.spark.sql.execution.SparkPlan =
Filter[23814] [args=(sale_id#86324L >
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths:
s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
res10: org.apache.spark.sql.execution.SparkPlan =
Filter[23815] [args=(day_timestamp#86302.timestamp >
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths:
s3a://some/parquet/file
{noformat}
was:
Theres a *PushedFilters* for a simple numeric field, but not for a numeric
field inside a struct. Not sure if this is a Spark limitation because of
Parquet, or only a Spark limitation.
{quote}
scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp",
"sale_id")
res5: org.apache.spark.sql.DataFrame = [day_timestamp:
struct<timestamp:bigint,timezone:string>, sale_id: bigint]
scala> res5.filter("sale_id > 4").queryExecution.executedPlan
res9: org.apache.spark.sql.execution.SparkPlan =
Filter[23814] [args=(sale_id#86324L >
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths:
s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
res10: org.apache.spark.sql.execution.SparkPlan =
Filter[23815] [args=(day_timestamp#86302.timestamp >
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths:
s3a://some/parquet/file
{quote}
> Parquet filter push down doesn't handle struct fields
> -----------------------------------------------------
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, SQL
> Affects Versions: 1.6.2, 1.6.3
> Reporter: Mitesh
> Priority: Minor
>
> There's a *PushedFilters* for a simple numeric field, but not for a numeric
> field inside a struct. Not sure if this is a Spark limitation because of
> Parquet, or only a Spark limitation.
> {noformat}
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp",
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp:
> struct<timestamp:bigint,timezone:string>, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L >
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths:
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp >
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths:
> s3a://some/parquet/file
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]