alamb commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-3063721375
> Yes, I think so. Of course, there's still room to seek further performance
optimizations, but for now:
Indeed -we can always make the code better :)
--
This is an a
alamb commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-3063722047
Thanks again @ding-young
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the speci
alamb closed issue #14078: Optimized spill file format
URL: https://github.com/apache/datafusion/issues/14078
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-
ding-young commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-3059889330
> So shall we close this issue as complete now?
Yes, I think so. Of course, there's still room to seek further performance
optimizations, but for now:
- Validati
alamb commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-3052679388
> It seems like comet has removed their customed BatchReader/Writer and
switched back to arrow IPC reader/writer (see
[PR#1703](https://github.com/apache/datafusion-comet/pull/170
ding-young commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-3051639268
It seems like comet has removed their customed BatchReader/Writer (see
[PR#1703](https://github.com/apache/datafusion-comet/pull/1703/files)).
--
This is an automated mes
alamb commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2836303429
20% better -- not bad!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific
getChan commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2829344168
update : skip validation is applied when reading spill files. by #15454
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on t
alamb commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2810394103
> The tricky part to implement is array encoding like REE or bit-packing for
integer arrays. Maybe we can find some reusable code in Arrow Parquet writer
implementation or use som
2010YOUY01 commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2804443324
https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/umami.pdf
Here are some advanced German techniques: this paper discussed
1. Implementation of a hash-based spillin
alamb commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2689518371
- BTW https://github.com/apache/arrow-rs/pull/7120 is complete so will be
able to disable validation for spill files with the next arrow release
--
This is an automated message
alamb commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585752275
FYI I am working with @@totoroyyb on the arrow IPC work, in case anyone is
interested or has time to help:
- https://github.com/apache/arrow-rs/pull/6938#issuecomment-2585751118
andygrove commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585364461
> It would be nice to have the option (likely not enabled by default) for
the spill files to be compressed. It's almost trivial I think with the current
implementation.
Omega359 commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585363202
It would be nice to have the option (likely not enabled by default) for the
spill files to be compressed. It's almost trivial I think with the current
implementation.
--
Thi
tustvold commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585194217
> DataFusion itself should not go to any significant trouble / effort to
protect against the threat model of someone having enough control over the
local file system to make ar
alamb commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585193813
@2010YOUY01 the idea of avoiding Row->Column->Row conversions is a (very)
good one
--
This is an automated message from the Apache Git Service.
To respond to the message, pleas
alamb commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585193584
> I feel I ought to point out though that in order for it to be sound to
read a file without validation, DF needs to be sure nobody else could have
written/modified it.
In
tustvold commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585169856
Spilling the row format makes some sense to me, although I suspect IPC will
outperform it, presuming a fast enough disk.
I feel I ought to point out though that in order
2010YOUY01 commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585043689
Although we're currently spilling column-wise record batches, I think this
will change to row-wise batches in the future. It would be better to benchmark
and optimize spillin
alamb commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2584278401
As a data point, @totoroyyb reports a 100x faster reading of Arrow IPC data
without validation on https://github.com/apache/arrow-rs/issues/6933
--
This is an automated message
20 matches
Mail list logo