ding-young commented on issue #16367:
URL: https://github.com/apache/datafusion/issues/16367#issuecomment-3000103132

   ### Need for a Custom Batch Writer?
   
   #### 1. `concat_batches` before writing?
   I tried a quick local test where, instead of writing one batch at a time 
using the current `IPCStreamWriter`, I concatenated multiple batches using 
`concat_batches` before writing them. In my local environment, this didn't make 
a noticeable difference in compression ratio.
   Maybe that's because the compression happens at the buffer level for each 
column (i.e., values of the same column are grouped together), or perhaps 
because each record batch already consists of 8192 rows and the compression 
window size overlaps with that. Still, even if concatenating batches introduces 
some memory copy overhead, it might still impact I/O bandwidth or reduce the 
number of system calls, so I think it's worth investigating further.
   
   ####  2. Comet's implementation 
   I looked into why Comet introduced a custom batch writer and reviewed the 
related PR. The main reasons their implementation improved performance were:
   
   (a) Their previous approach duplicated the schema for each batch, which the 
new implementation avoided.
   (b) They didn’t use FlatBuffer encoding, so there was no alignment or 
metadata overhead.
   
   In our case, though, since `IPCStreamWriter` already writes the schema only 
once when the writer is created, we probably won’t see the same benefits from 
(a). 
   
   I haven’t had a chance to look closely at the Vortex side yet. If I come 
across any interesting experimental results or ideas worth sharing, I’ll follow 
up later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to