ctsk commented on PR #16463:
URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2994387124
@Dandandan @2010YOUY01 Both of those snippets are nor covering for
ByteViewArrays. Is that a bug?
--
This is an automated message from the Apache Git Service.
To respond to the messag
ctsk commented on PR #16463:
URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2994386331
@Dandandan I believe that that heuristic does not make sense in this
context. The reason why the gc is introduced here is mainly to reduce the size
of the data buffer vector of StringVi
Dandandan commented on PR #16463:
URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2990166497
We have quite some implementations of gc-ing arrays. I am wondering in this
case if the performance can be improved for smaller tables by this heuristic
used here:
https://
ctsk commented on PR #16463:
URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2990163724
I will do both of these things later today. I am concerned about the
performance impact for smaller-scale tasks. I suspect many users of datafusion
are not doing such large joins
2010YOUY01 commented on PR #16463:
URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2990142949
Thank you! This is great. I got some minor suggestions:
1. The issue explained the motivation of the change clearly, I recommend to
add the same rationale to the code commen
ctsk commented on PR #16463:
URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2989084548
At SF=100, this PR is 10% faster:
```
Benchmark tpch_sf100.json
┏━━┳┳━┳━━━┓
┃ Q
ctsk commented on PR #16463:
URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2989059712
Benchmark results:
```
Comparing main and fix_build-side-gc
Benchmark tpch_sf10.json
┏━━┳┳━
ctsk opened a new pull request, #16463:
URL: https://github.com/apache/datafusion/pull/16463
## Which issue does this PR close?
- Closes #16206.
## What changes are included in this PR?
- A utility function to garbage collect (gc) all view-type columns of a
batch