Re: Doubts about SparkSQL

Renato Marroquín Mogrovejo Sun, 24 May 2015 12:00:07 -0700

Hi all,

Many thanks for all responses, but I think I just pressed enter too quickly
without explaining correctly what I meant.
What I mean is if the optimizer is able to optimize the processing if  an
inner query contains a blocking operator like an aggregation and if it
knows the partitioning scheme used.
For example, let's suppose we use the following schema


> cities_temp( id: Int, country_id: Int, name:String, validfrom: Int).

And the query that we use is:

> select id, country_id, MAX(validFrom) as validFrom
   from (select * from cities_temp where validFrom < 3 )A
   group by id, country_id

The resulting physical plan used is:

Aggregate false, [id#0,country_id#1], [id#0,country_id#1,MAX(PartialMax#7)
AS validFrom#4]
 Exchange (HashPartitioning [id#0,country_id#1], 200)
  Aggregate true, [id#0,country_id#1], [id#0,country_id#1,MAX(validFrom#3)
AS PartialMax#7]
   Project [id#0,country_id#1,validFrom#3]
    Filter (validFrom#3 < 3)
     PhysicalRDD [id#0,country_id#1,name#2,validfrom#3],
MapPartitionsRDD[4] at mapPartitions at ExistingRDD.scala:37

In this case, there is an "Exchange" stage where data is being hash
partitioned to be able to do the aggregation. This is fine as in this case
the optimizer is not aware of the partitioning scheme used by the date. So
this is my question, is there a way to tell the optimizer what partitioning
scheme was used in order to perform the filter, projection, and the
aggregation without it having to repartition?
Because if data was stored using a hash-partitioning scheme, then we could
perform a more optimal operation and reducing (possibly avoiding) the
shuffle operation depending on the query executed.
Thanks again for your help!


Renato M.

Re: Doubts about SparkSQL

Reply via email to