Hi all, Many thanks for all responses, but I think I just pressed enter too quickly without explaining correctly what I meant. What I mean is if the optimizer is able to optimize the processing if an inner query contains a blocking operator like an aggregation and if it knows the partitioning scheme used. For example, let's suppose we use the following schema
> cities_temp( id: Int, country_id: Int, name:String, validfrom: Int). And the query that we use is: > select id, country_id, MAX(validFrom) as validFrom from (select * from cities_temp where validFrom < 3 )A group by id, country_id The resulting physical plan used is: Aggregate false, [id#0,country_id#1], [id#0,country_id#1,MAX(PartialMax#7) AS validFrom#4] Exchange (HashPartitioning [id#0,country_id#1], 200) Aggregate true, [id#0,country_id#1], [id#0,country_id#1,MAX(validFrom#3) AS PartialMax#7] Project [id#0,country_id#1,validFrom#3] Filter (validFrom#3 < 3) PhysicalRDD [id#0,country_id#1,name#2,validfrom#3], MapPartitionsRDD[4] at mapPartitions at ExistingRDD.scala:37 In this case, there is an "Exchange" stage where data is being hash partitioned to be able to do the aggregation. This is fine as in this case the optimizer is not aware of the partitioning scheme used by the date. So this is my question, is there a way to tell the optimizer what partitioning scheme was used in order to perform the filter, projection, and the aggregation without it having to repartition? Because if data was stored using a hash-partitioning scheme, then we could perform a more optimal operation and reducing (possibly avoiding) the shuffle operation depending on the query executed. Thanks again for your help! Renato M.