Spark schema evolution

gtinside Tue, 22 Mar 2016 06:35:34 -0700

Hi ,

I have a table sourced from* 2 parquet files* with few extra columns in one
of the parquet file. Simple * queries works fine but queries with predicate
on extra column doesn't work and I get column not found


*Column resp_party_type exist in just one parquet file*

a) Query that work :
select resp_party_type  from operational_analytics 

b) Query that doesn't work : (complains about missing column
*resp_party_type *)
select category as Events, resp_party as Team, count(*) as Total from
operational_analytics where application = 'PeopleMover' and resp_party_type
= 'Team' group by category, resp_party

*Query Plan for (b)*
== Physical Plan ==
TungstenAggregate(key=[category#30986,resp_party#31006],
functions=[(count(1),mode=Final,isDistinct=false)],
output=[Events#36266,Team#36267,Total#36268L])
 TungstenExchange hashpartitioning(category#30986,resp_party#31006)
  TungstenAggregate(key=[category#30986,resp_party#31006],
functions=[(count(1),mode=Partial,isDistinct=false)],
output=[category#30986,resp_party#31006,currentCount#36272L])
   Project [category#30986,resp_party#31006]
    Filter ((application#30983 = PeopleMover) && (resp_party_type#31007 =
Team))
     Scan
ParquetRelation[snackfs://tst:9042/aladdin_data_beta/operational_analytics/operational_analytics_peoplemover.parquet,snackfs://tst:9042/aladdin_data_beta/operational_analytics/operational_analytics_mis.parquet][category#30986,resp_party#31006,application#30983,resp_party_type#31007]


I have set spark.sql.parquet.mergeSchema = true and
spark.sql.parquet.filterPushdown = true. When I set
spark.sql.parquet.filterPushdown = false Query (b) starts working, execution
plan after setting the filterPushdown = false for Query(b)

== Physical Plan ==
TungstenAggregate(key=[category#30986,resp_party#31006],
functions=[(count(1),mode=Final,isDistinct=false)],
output=[Events#36313,Team#36314,Total#36315L])
 TungstenExchange hashpartitioning(category#30986,resp_party#31006)
  TungstenAggregate(key=[category#30986,resp_party#31006],
functions=[(count(1),mode=Partial,isDistinct=false)],
output=[category#30986,resp_party#31006,currentCount#36319L])
   Project [category#30986,resp_party#31006]
    Filter ((application#30983 = PeopleMover) && (resp_party_type#31007 =
Team))
     Scan
ParquetRelation[snackfs://tst:9042/aladdin_data_beta/operational_analytics/operational_analytics_peoplemover.parquet,snackfs://tst:9042/aladdin_data_beta/operational_analytics/operational_analytics_mis.parquet][category#30986,resp_party#31006,application#30983,resp_party_type#31007]




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-schema-evolution-tp26563.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Spark schema evolution

Reply via email to