Why is spark not doing filter just after the scan to reduce shuffle at later stage.

Khemchand Nagar Thu, 28 Nov 2024 03:26:13 -0800

The scenario here I am observing is that I am running tpcds query 4 on both 
partitioned and non-partitioned data.
So here in the query same CTE year_total is being used twice with name 
first_year_total and second_year_total.
And these two tables are being filtered with the year using the d_year column 
in the data_dim.
Which is helping to filter the data of the tables based on data.
So lets first write the join here happening.
So here for each a broadcast  hash join will there and and these two tables are 
joined let say here with shuffle hash join.


So, in case of partitioned data where data is bucketed in the disk with the 
data as the bucketing key , so it is using the dynamic partition pruning and 
scanning the relevant data only. And then applying the shuffle hash join and 
then the broadcast hash join.
And in case of data which is stored non-partitioned first it is scanning the 
tables, which I can observer from the spark sql plan for the query. And then 
later after scanning it is doing the shuffle hash join and then the broadcast 
hash join.

So here I have two questions:

  1.
Here in case of partitioned data as it is doing the pruning at runtime then why 
it is again doing the broadcast hash join with the date table.
  2.
The second question here ,which is the main questions I want to ask, is that as 
the predicate pushdown says spark will do the filtering as near as possible to 
the source. So, in case of non-partitioned why spark is not applying the filter 
just after the scan so that I can reduce the later shuffle data.As I am getting 
the shuffle data twice in non-partitioned compared to partitioned data. Why 
this behaviour.

Why is spark not doing filter just after the scan to reduce shuffle at later stage.

Reply via email to