GitHub user vigneshsiva11 added a comment to the discussion: I tried benchmarking TPC-DS for Spark vs Datafusion Comet on AWS Glue Catalog Iceberg Tables and Spark was faster.
Hi NoahKus, it's common to see results like this. It usually isn't because Spark is 'faster,' but because there is hidden overhead when moving data between Spark and Comet. Here are the 3 main reasons this happens: The 'Moving' Tax: Every time Comet has to send data back to Spark (and vice versa), it costs time to convert and copy that data. If your query plan has many 'Fallback' nodes, these copies can make Comet slower than just staying in Spark. Small Data Batches: Native engines like DataFusion (Comet’s core) work best with huge chunks of data. If your Iceberg tables are sending very small batches of rows, Comet cannot use its full speed. Cloud Metadata: Since you are using AWS Glue, a lot of time is spent just 'finding' the data in the cloud before the actual processing starts. Spark and Comet handle this metadata differently, which can hide the native speed gains. GitHub link: https://github.com/apache/datafusion-comet/discussions/3199#discussioncomment-15653785 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
