andygrove opened a new pull request, #4197:
URL: https://github.com/apache/datafusion-comet/pull/4197

   ## Which issue does this PR close?
   
   Closes #4194 (sub-issue of #4098).
   
   ## Rationale for this change
   
   Three tests in \`CometTaskMetricsSuite\` were ignored on Spark 4.1+ because 
Comet's reported \`bytesRead\` was 6-14× larger than Spark's, breaking the 
existing 0.7-1.3 ratio assertion.
   
   Investigation: Spark 4.1 changed \`ParquetFileFormat\` to pre-open the 
\`SeekableInputStream\` and read the file footer outside the 
\`FileScanRDD.compute()\` thread. Spark's \`inputMetrics.bytesRead\` is updated 
from a Hadoop FileSystem thread-local byte counter 
(\`SparkHadoopUtil.getFSBytesReadOnThreadCallback\`) that only captures reads 
on the \`compute()\` thread — so reads serviced by the pre-opened stream's 
internal buffer go uncounted.
   
   | File size | Spark 4.1 \`bytesRead\` vs actual | Comet/Spark ratio |
   |---|---|---|
   | Tiny (whole file pre-buffered) | Near 0 | 10-15× |
   | Small (MB) | Significant under-count | 2-5× |
   | Medium / Large (10+ MB) | Subsequent row-group reads counted | Closer to 
1× |
   
   Comet (via DataFusion's \`bytes_scanned\`) reports actual file IO, which is 
the only truthful number to compare against. The existing tight ratio is 
unrecoverable on Spark 4.1+ for small files without changing what Comet reports 
— and changing Comet to under-report would be wrong.
   
   ## What changes are included in this PR?
   
   - \`CometTaskMetricsSuite\`: extracted a \`assertCometBytesReadInRange\` 
helper and used it from all three affected tests. On Spark <= 4.0 it keeps the 
tight 0.7-1.3 ratio. On Spark 4.1+ it asserts only \`cometBytes >= sparkBytes\` 
and that both are positive. Records read is still checked exactly.
   - Removed the \`assume(!isSpark41Plus, ...)\` guards from the three 
previously-skipped tests.
   - \`docs/source/user-guide/latest/metrics.md\`: added a section explaining 
the discrepancy, the file-size dependence, and that this is purely 
observability — \`inputMetrics.bytesRead\` is not consumed by the planner, the 
optimizer, or AQE, so plan choices and correctness are unaffected.
   
   ## How are these changes tested?
   
   - Full \`CometTaskMetricsSuite\` (5 tests) passes on **Spark 4.1** locally 
(was 3 failures + 2 passes before).
   - Full \`CometTaskMetricsSuite\` (5 tests) passes on **Spark 4.0** locally 
(V1 path unchanged).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to