Re: [I] CometHashJoin always selects BuildRight which causes potential performance regression [datafusion-comet]

via GitHub Mon, 10 Feb 2025 18:10:13 -0800


hayman42 commented on issue #1382:
URL: 
https://github.com/apache/datafusion-comet/issues/1382#issuecomment-2649656840


   @parthchandra With comet shuffle disabled, the plan is almost like vanilla 
spark's because it replaces comet SHJ to spark SHJ. And thus it preserves 
spark's performance. Here is the plan with comet shuffle disabled.
   
   <details>
   <summary>comet shuffle disabled</summary>
   
   image is not attached so I put text instead
   
   ```
   +- == Final Plan ==
      Execute InsertIntoHadoopFsRelationCommand (66)
      +- WriteFiles (65)
         +- * Sort (64)
            +- AQEShuffleRead (63)
               +- ShuffleQueryStage (62), Statistics(sizeInBytes=9.9 KiB, 
rowCount=175)
                  +- Exchange (61)
                     +- * HashAggregate (60)
                        +- AQEShuffleRead (59)
                           +- ShuffleQueryStage (58), 
Statistics(sizeInBytes=878.7 KiB, rowCount=1.37E+4)
                              +- Exchange (57)
                                 +- * HashAggregate (56)
                                    +- * Project (55)
                                       +- * BroadcastHashJoin Inner BuildRight 
(54)
                                          :- * Project (48)
                                          :  +- * ShuffledHashJoin Inner 
BuildLeft (47)
                                          :     :- AQEShuffleRead (40)
                                          :     :  +- ShuffleQueryStage (39), 
Statistics(sizeInBytes=3.4 GiB, rowCount=6.53E+7)
                                          :     :     +- Exchange (38)
                                          :     :        +- * Project (37)
                                          :     :           +- * 
ShuffledHashJoin Inner BuildLeft (36)
                                          :     :              :- 
AQEShuffleRead (29)
                                          :     :              :  +- 
ShuffleQueryStage (28), Statistics(sizeInBytes=3.9 GiB, rowCount=6.53E+7)
                                          :     :              :     +- 
Exchange (27)
                                          :     :              :        +- * 
Project (26)
                                          :     :              :           +- * 
ShuffledHashJoin Inner BuildRight (25)
                                          :     :              :              
:- AQEShuffleRead (18)
                                          :     :              :              : 
 +- ShuffleQueryStage (17), Statistics(sizeInBytes=3.4 GiB, rowCount=6.53E+7)
                                          :     :              :              : 
    +- Exchange (16)
                                          :     :              :              : 
       +- * Project (15)
                                          :     :              :              : 
          +- * ShuffledHashJoin Inner BuildLeft (14)
                                          :     :              :              : 
             :- AQEShuffleRead (7)
                                          :     :              :              : 
             :  +- ShuffleQueryStage (6), Statistics(sizeInBytes=33.2 MiB, 
rowCount=2.18E+6)
                                          :     :              :              : 
             :     +- Exchange (5)
                                          :     :              :              : 
             :        +- * CometColumnarToRow (4)
                                          :     :              :              : 
             :           +- CometProject (3)
                                          :     :              :              : 
             :              +- CometFilter (2)
                                          :     :              :              : 
             :                 +- CometScan parquet  (1)
                                          :     :              :              : 
             +- AQEShuffleRead (13)
                                          :     :              :              : 
                +- ShuffleQueryStage (12), Statistics(sizeInBytes=62.6 GiB, 
rowCount=1.20E+9)
                                          :     :              :              : 
                   +- Exchange (11)
                                          :     :              :              : 
                      +- * CometColumnarToRow (10)
                                          :     :              :              : 
                         +- CometFilter (9)
                                          :     :              :              : 
                            +- CometScan parquet  (8)
                                          :     :              :              
+- AQEShuffleRead (24)
                                          :     :              :                
 +- ShuffleQueryStage (23), Statistics(sizeInBytes=45.8 MiB, rowCount=2.00E+6)
                                          :     :              :                
    +- Exchange (22)
                                          :     :              :                
       +- * CometColumnarToRow (21)
                                          :     :              :                
          +- CometFilter (20)
                                          :     :              :                
             +- CometScan parquet  (19)
                                          :     :              +- 
AQEShuffleRead (35)
                                          :     :                 +- 
ShuffleQueryStage (34), Statistics(sizeInBytes=4.8 GiB, rowCount=1.60E+8)
                                          :     :                    +- 
Exchange (33)
                                          :     :                       +- * 
CometColumnarToRow (32)
                                          :     :                          +- 
CometFilter (31)
                                          :     :                             
+- CometScan parquet  (30)
                                          :     +- AQEShuffleRead (46)
                                          :        +- ShuffleQueryStage (45), 
Statistics(sizeInBytes=6.7 GiB, rowCount=3.00E+8)
                                          :           +- Exchange (44)
                                          :              +- * 
CometColumnarToRow (43)
                                          :                 +- CometFilter (42)
                                          :                    +- CometScan 
parquet  (41)
                                          +- BroadcastQueryStage (53), 
Statistics(sizeInBytes=1024.2 KiB, rowCount=25)
                                             +- BroadcastExchange (52)
                                                +- * CometColumnarToRow (51)
                                                   +- CometFilter (50)
                                                      +- CometScan parquet  (49)
   ```
   
   </details>
   
   I have another question regarding your comment. Is it ok to use comet SMJ to 
a large dataset? Or did you just disable comet shuffle? I observed comet SMJ is 
way too slower compared to spark and that is why I am trying to use SHJ.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] CometHashJoin always selects BuildRight which causes potential performance regression [datafusion-comet]

Reply via email to