GitHub user vigneshsiva11 added a comment to the discussion: I tried 
benchmarking TPC-DS for Spark vs Datafusion Comet on AWS Glue Catalog Iceberg 
Tables and Spark was faster.

Hi NoahKus, it's common to see results like this. It usually isn't because 
Spark is 'faster,' but because there is hidden overhead when moving data 
between Spark and Comet.

Here are the 3 main reasons this happens:

The 'Moving' Tax: Every time Comet has to send data back to Spark (and vice 
versa), it costs time to convert and copy that data. If your query plan has 
many 'Fallback' nodes, these copies can make Comet slower than just staying in 
Spark.

Small Data Batches: Native engines like DataFusion (Comet’s core) work best 
with huge chunks of data. If your Iceberg tables are sending very small batches 
of rows, Comet cannot use its full speed.

Cloud Metadata: Since you are using AWS Glue, a lot of time is spent just 
'finding' the data in the cloud before the actual processing starts. Spark and 
Comet handle this metadata differently, which can hide the native speed gains.

GitHub link: 
https://github.com/apache/datafusion-comet/discussions/3199#discussioncomment-15653785

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to