mbutrovich commented on issue #4440:
URL: 
https://github.com/apache/datafusion-comet/issues/4440#issuecomment-4662746192

   I ran an SF1000 comparison (cluster, Spark 3.5.8, parquet, 3 iterations) of 
the default native C2R against 
`spark.comet.exec.columnarToRow.native.enabled=false` (the JVM 
`CometColumnarToRowExec`). Both runs processed the same input (654.1 GiB). 
Throughput, GC, and peak memory are within run-to-run variance.
   
   Throughput:
   
   - Wall clock across 103 queries: native 682s, JVM 710s (4%).
   - Aggregate task time is nearly identical: native 178.1h, JVM 177.1h. The 
wall-clock difference does not correspond to a difference in aggregate CPU time.
   - q14a (-3.0s) and q14b (+1.1s) are the same query and have deltas of 
opposite sign, which indicates a per-query noise floor of a few seconds. The 4% 
difference is consistent with run-to-run variance.
   
   GC and memory (executor pages and per-execution from the event logs):
   
   - Total GC: native 33.7 min, JVM 31.3 min, approximately 0.3% of aggregate 
task time in both runs. Native is not lower.
   - Peak JVM memory, on-heap and off-heap, is indistinguishable between the 
runs.
   - Paired by execution id, GC deltas are small and approximately symmetric in 
sign. The largest native-higher delta is +23s on an execution with about 7,850s 
aggregate runtime. The largest relative differences (about 2x) are around 15s 
on about 1,900s runs, with native higher. GC is concentrated in the 
longest-running executions and correlates with shuffle volume rather than with 
the C2R path.
   
   GC is a small fraction of runtime in both runs because Comet executes 
off-heap (64 GiB off-heap per executor, Arrow buffers off-heap), so most data 
movement does not touch the JVM heap. This is consistent with the 
microbenchmark parity in the original post and with @parthchandra's earlier 
no-improvement result. This measures aggregate and per-execution GC time, not 
allocation rate. Measuring allocation rate would require JFR on an executor.
   
   I expected the JVM path to be faster, because it is `CodegenSupport` and 
folds into the downstream WholeStageCodegen stage, while native 
`CometNativeColumnarToRowExec` is a `ColumnarToRowTransition` boundary that 
materializes an `UnsafeRow` per row. It was not faster, and throughput and GC 
are equivalent, so the fusion asymmetry does not produce a measurable advantage 
at this scale.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to