andygrove commented on PR #4591:
URL: 
https://github.com/apache/datafusion-comet/pull/4591#issuecomment-4855144317

   I ran `CometInMemoryCacheBenchmark` from this PR locally to see the numbers. 
Release build, Apple M3 Ultra, JDK 17, default Spark profile (3.5), 5M-row 
cached table.
   
   **Repeated full scan** (`SELECT sum(id), sum(k), sum(v)`)
   
   | Case | Best (ms) | Avg (ms) | Rate (M/s) | Relative |
   |---|---|---|---|---|
   | Comet cache disabled | 180 | 201 | 27.7 | 1.0X |
   | Comet cache enabled | 121 | 128 | 41.3 | **1.5X** |
   
   **Selective filter** (`WHERE id >= 4500000 AND id < 4750000`)
   
   | Case | Best (ms) | Avg (ms) | Rate (M/s) | Relative |
   |---|---|---|---|---|
   | Comet cache disabled | 46 | 53 | 108.2 | 1.0X |
   | Comet cache enabled | 42 | 48 | 117.9 | **1.1X** |
   
   The full repeated scan is about 1.5x faster, which is the case this targets 
directly since it drops the `CometSparkColumnarToColumnar` conversion on every 
read.
   
   The selective filter is only about 1.1x. That is the workload I'd expect to 
gain the most from the new stats-based pruning, so the small gap is a bit 
surprising. My guess is the filtered query spends most of its time in the 
aggregate rather than the scan, so the pruning win gets diluted. It might be 
worth a variant that isolates the scan (wider projection, more selective 
predicate, or larger row count) to show the pruning benefit more clearly.
   
   Could you add these numbers (or your own run) to the PR description?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to