mbutrovich commented on issue #4440: URL: https://github.com/apache/datafusion-comet/issues/4440#issuecomment-4662746192
I ran an SF1000 comparison (cluster, Spark 3.5.8, parquet, 3 iterations) of the default native C2R against `spark.comet.exec.columnarToRow.native.enabled=false` (the JVM `CometColumnarToRowExec`). Both runs processed the same input (654.1 GiB). Throughput, GC, and peak memory are within run-to-run variance. Throughput: - Wall clock across 103 queries: native 682s, JVM 710s (4%). - Aggregate task time is nearly identical: native 178.1h, JVM 177.1h. The wall-clock difference does not correspond to a difference in aggregate CPU time. - q14a (-3.0s) and q14b (+1.1s) are the same query and have deltas of opposite sign, which indicates a per-query noise floor of a few seconds. The 4% difference is consistent with run-to-run variance. GC and memory (executor pages and per-execution from the event logs): - Total GC: native 33.7 min, JVM 31.3 min, approximately 0.3% of aggregate task time in both runs. Native is not lower. - Peak JVM memory, on-heap and off-heap, is indistinguishable between the runs. - Paired by execution id, GC deltas are small and approximately symmetric in sign. The largest native-higher delta is +23s on an execution with about 7,850s aggregate runtime. The largest relative differences (about 2x) are around 15s on about 1,900s runs, with native higher. GC is concentrated in the longest-running executions and correlates with shuffle volume rather than with the C2R path. GC is a small fraction of runtime in both runs because Comet executes off-heap (64 GiB off-heap per executor, Arrow buffers off-heap), so most data movement does not touch the JVM heap. This is consistent with the microbenchmark parity in the original post and with @parthchandra's earlier no-improvement result. This measures aggregate and per-execution GC time, not allocation rate. Measuring allocation rate would require JFR on an executor. I expected the JVM path to be faster, because it is `CodegenSupport` and folds into the downstream WholeStageCodegen stage, while native `CometNativeColumnarToRowExec` is a `ColumnarToRowTransition` boundary that materializes an `UnsafeRow` per row. It was not faster, and throughput and GC are equivalent, so the fusion asymmetry does not produce a measurable advantage at this scale. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
