Re: [I] Optimized spill file format [datafusion]

2025-07-11 Thread via GitHub
alamb commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-3063721375 > Yes, I think so. Of course, there's still room to seek further performance optimizations, but for now: Indeed -we can always make the code better :) -- This is an a

Re: [I] Optimized spill file format [datafusion]

2025-07-11 Thread via GitHub
alamb commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-3063722047 Thanks again @ding-young -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

Re: [I] Optimized spill file format [datafusion]

2025-07-11 Thread via GitHub
alamb closed issue #14078: Optimized spill file format URL: https://github.com/apache/datafusion/issues/14078 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-

Re: [I] Optimized spill file format [datafusion]

2025-07-10 Thread via GitHub
ding-young commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-3059889330 > So shall we close this issue as complete now? Yes, I think so. Of course, there's still room to seek further performance optimizations, but for now: - Validati

Re: [I] Optimized spill file format [datafusion]

2025-07-09 Thread via GitHub
alamb commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-3052679388 > It seems like comet has removed their customed BatchReader/Writer and switched back to arrow IPC reader/writer (see [PR#1703](https://github.com/apache/datafusion-comet/pull/170

Re: [I] Optimized spill file format [datafusion]

2025-07-09 Thread via GitHub
ding-young commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-3051639268 It seems like comet has removed their customed BatchReader/Writer (see [PR#1703](https://github.com/apache/datafusion-comet/pull/1703/files)). -- This is an automated mes

Re: [I] Optimized spill file format [datafusion]

2025-04-28 Thread via GitHub
alamb commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2836303429 20% better -- not bad! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [I] Optimized spill file format [datafusion]

2025-04-24 Thread via GitHub
getChan commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2829344168 update : skip validation is applied when reading spill files. by #15454 -- This is an automated message from the Apache Git Service. To respond to the message, please log on t

Re: [I] Optimized spill file format [datafusion]

2025-04-16 Thread via GitHub
alamb commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2810394103 > The tricky part to implement is array encoding like REE or bit-packing for integer arrays. Maybe we can find some reusable code in Arrow Parquet writer implementation or use som

Re: [I] Optimized spill file format [datafusion]

2025-04-15 Thread via GitHub
2010YOUY01 commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2804443324 https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/umami.pdf Here are some advanced German techniques: this paper discussed 1. Implementation of a hash-based spillin

Re: [I] Optimized spill file format [datafusion]

2025-02-27 Thread via GitHub
alamb commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2689518371 - BTW https://github.com/apache/arrow-rs/pull/7120 is complete so will be able to disable validation for spill files with the next arrow release -- This is an automated message

Re: [I] Optimized spill file format [datafusion]

2025-01-12 Thread via GitHub
alamb commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585752275 FYI I am working with @@totoroyyb on the arrow IPC work, in case anyone is interested or has time to help: - https://github.com/apache/arrow-rs/pull/6938#issuecomment-2585751118

Re: [I] Optimized spill file format [datafusion]

2025-01-11 Thread via GitHub
andygrove commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585364461 > It would be nice to have the option (likely not enabled by default) for the spill files to be compressed. It's almost trivial I think with the current implementation.

Re: [I] Optimized spill file format [datafusion]

2025-01-11 Thread via GitHub
Omega359 commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585363202 It would be nice to have the option (likely not enabled by default) for the spill files to be compressed. It's almost trivial I think with the current implementation. -- Thi

Re: [I] Optimized spill file format [datafusion]

2025-01-11 Thread via GitHub
tustvold commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585194217 > DataFusion itself should not go to any significant trouble / effort to protect against the threat model of someone having enough control over the local file system to make ar

Re: [I] Optimized spill file format [datafusion]

2025-01-11 Thread via GitHub
alamb commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585193813 @2010YOUY01 the idea of avoiding Row->Column->Row conversions is a (very) good one -- This is an automated message from the Apache Git Service. To respond to the message, pleas

Re: [I] Optimized spill file format [datafusion]

2025-01-11 Thread via GitHub
alamb commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585193584 > I feel I ought to point out though that in order for it to be sound to read a file without validation, DF needs to be sure nobody else could have written/modified it. In

Re: [I] Optimized spill file format [datafusion]

2025-01-11 Thread via GitHub
tustvold commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585169856 Spilling the row format makes some sense to me, although I suspect IPC will outperform it, presuming a fast enough disk. I feel I ought to point out though that in order

Re: [I] Optimized spill file format [datafusion]

2025-01-10 Thread via GitHub
2010YOUY01 commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585043689 Although we're currently spilling column-wise record batches, I think this will change to row-wise batches in the future. It would be better to benchmark and optimize spillin

Re: [I] Optimized spill file format [datafusion]

2025-01-10 Thread via GitHub
alamb commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2584278401 As a data point, @totoroyyb reports a 100x faster reading of Arrow IPC data without validation on https://github.com/apache/arrow-rs/issues/6933 -- This is an automated message