hayman42 commented on issue #1382: URL: https://github.com/apache/datafusion-comet/issues/1382#issuecomment-2649656840
@parthchandra With comet shuffle disabled, the plan is almost like vanilla spark's because it replaces comet SHJ to spark SHJ. And thus it preserves spark's performance. Here is the plan with comet shuffle disabled. <details> <summary>comet shuffle disabled</summary> image is not attached so I put text instead ``` +- == Final Plan == Execute InsertIntoHadoopFsRelationCommand (66) +- WriteFiles (65) +- * Sort (64) +- AQEShuffleRead (63) +- ShuffleQueryStage (62), Statistics(sizeInBytes=9.9 KiB, rowCount=175) +- Exchange (61) +- * HashAggregate (60) +- AQEShuffleRead (59) +- ShuffleQueryStage (58), Statistics(sizeInBytes=878.7 KiB, rowCount=1.37E+4) +- Exchange (57) +- * HashAggregate (56) +- * Project (55) +- * BroadcastHashJoin Inner BuildRight (54) :- * Project (48) : +- * ShuffledHashJoin Inner BuildLeft (47) : :- AQEShuffleRead (40) : : +- ShuffleQueryStage (39), Statistics(sizeInBytes=3.4 GiB, rowCount=6.53E+7) : : +- Exchange (38) : : +- * Project (37) : : +- * ShuffledHashJoin Inner BuildLeft (36) : : :- AQEShuffleRead (29) : : : +- ShuffleQueryStage (28), Statistics(sizeInBytes=3.9 GiB, rowCount=6.53E+7) : : : +- Exchange (27) : : : +- * Project (26) : : : +- * ShuffledHashJoin Inner BuildRight (25) : : : :- AQEShuffleRead (18) : : : : +- ShuffleQueryStage (17), Statistics(sizeInBytes=3.4 GiB, rowCount=6.53E+7) : : : : +- Exchange (16) : : : : +- * Project (15) : : : : +- * ShuffledHashJoin Inner BuildLeft (14) : : : : :- AQEShuffleRead (7) : : : : : +- ShuffleQueryStage (6), Statistics(sizeInBytes=33.2 MiB, rowCount=2.18E+6) : : : : : +- Exchange (5) : : : : : +- * CometColumnarToRow (4) : : : : : +- CometProject (3) : : : : : +- CometFilter (2) : : : : : +- CometScan parquet (1) : : : : +- AQEShuffleRead (13) : : : : +- ShuffleQueryStage (12), Statistics(sizeInBytes=62.6 GiB, rowCount=1.20E+9) : : : : +- Exchange (11) : : : : +- * CometColumnarToRow (10) : : : : +- CometFilter (9) : : : : +- CometScan parquet (8) : : : +- AQEShuffleRead (24) : : : +- ShuffleQueryStage (23), Statistics(sizeInBytes=45.8 MiB, rowCount=2.00E+6) : : : +- Exchange (22) : : : +- * CometColumnarToRow (21) : : : +- CometFilter (20) : : : +- CometScan parquet (19) : : +- AQEShuffleRead (35) : : +- ShuffleQueryStage (34), Statistics(sizeInBytes=4.8 GiB, rowCount=1.60E+8) : : +- Exchange (33) : : +- * CometColumnarToRow (32) : : +- CometFilter (31) : : +- CometScan parquet (30) : +- AQEShuffleRead (46) : +- ShuffleQueryStage (45), Statistics(sizeInBytes=6.7 GiB, rowCount=3.00E+8) : +- Exchange (44) : +- * CometColumnarToRow (43) : +- CometFilter (42) : +- CometScan parquet (41) +- BroadcastQueryStage (53), Statistics(sizeInBytes=1024.2 KiB, rowCount=25) +- BroadcastExchange (52) +- * CometColumnarToRow (51) +- CometFilter (50) +- CometScan parquet (49) ``` </details> I have another question regarding your comment. Is it ok to use comet SMJ to a large dataset? Or did you just disable comet shuffle? I observed comet SMJ is way too slower compared to spark and that is why I am trying to use SHJ. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org