Hey Everyone In Spark, suppose i have the following df.
``` df = spark.createDataFrame([['A', 'A06', 'B', 'B02', '202412'], ['A', 'A04', 'B', 'B03', '202501'], ['B', 'B01', 'C', 'C02', '202411'], ['B', 'B03', 'A', 'A06', '202502']], 'entity_code: string, entity_rollup: string, target_entity_code: string, target_entity_rollup: string, period: string') df.show() df.createOrReplaceTempView('v1') +-----------+-------------+------------------+--------------------+------+ |entity_code|entity_rollup|target_entity_code|target_entity_rollup|period| +-----------+-------------+------------------+--------------------+------+ | A| A06| B| B02|202412| | A| A04| B| B03|202501| | B| B01| C| C02|202411| | B| B03| A| A06|202502| +-----------+-------------+------------------+--------------------+------+ ``` I have two queries ```sql SELECT DISTINCT STACK( 2 , entity_code, entity_rollup , target_entity_code, target_entity_rollup ) AS (entity_code, entity_rollup) , period FROM v1 Running 'explain' on the above gives == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[entity_code#22625, entity_rollup#22626, period#21414], functions=[]) +- Exchange hashpartitioning(entity_code#22625, entity_rollup#22626, period#21414, 200), ENSURE_REQUIREMENTS, [plan_id=12977] +- HashAggregate(keys=[entity_code#22625, entity_rollup#22626, period#21414], functions=[]) +- Project [entity_code#22625, entity_rollup#22626, period#21414] +- Generate stack(2, entity_code#21410, entity_rollup#21411, target_entity_code#21412, target_entity_rollup#21413), [period#21414], false, [entity_code#22625, entity_rollup#22626] +- LocalTableScan [entity_code#21410, entity_rollup#21411, target_entity_code#21412, target_entity_rollup#21413, period#21414] ``` The second query is ```sql SELECT entity_code, entity_rollup, period FROM v1 UNION SELECT target_entity_code, target_entity_rollup, period FROM v1 Running 'explain' on above query did not have the FileScan, so added that from actual query as well == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[entity_source_id#24129, entity_rollup#24130, entity#24131, period#24175], functions=[]) +- Exchange hashpartitioning(entity_source_id#24129, entity_rollup#24130, entity#24131, period#24175, 200), ENSURE_REQUIREMENTS, [plan_id=14134] +- HashAggregate(keys=[entity_source_id#24129, entity_rollup#24130, entity#24131, period#24175], functions=[]) +- Union :- Project [billing_entity_code#24170 AS entity_source_id#24129, billing_entity_region#24171 AS entity_rollup#24130, billing_entity_name#24169 AS entity#24131, period#24175] : +- FileScan ... Batched: true, DataFilters: [], Format: Parquet, Location: ..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<billing_entity_name:string,billing_entity_code:string,billing_entity_region:string,period:... +- Project [receiving_entity_code#24186 AS entity_source_id#24132, receiving_entity_region#24187 AS entity_rollup#24133, receiving_entity_name#24185 AS entity#24134, period#24188] +- FileScan ... Batched: true, DataFilters: [], Format: Parquet, Location: ..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<receiving_entity_name:string,receiving_entity_code:string,receiving_entity_region:string,p... ``` The difference in the two queries is that Union has two file reads, while Stack has one. Rest is solely the difference between union and stack. Which one of the two is faster? Thanks & Regards Dhruv