Re: [PR] GC string views on hash join build side [datafusion]

2025-06-22 Thread via GitHub
ctsk commented on PR #16463: URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2994387124 @Dandandan @2010YOUY01 Both of those snippets are nor covering for ByteViewArrays. Is that a bug? -- This is an automated message from the Apache Git Service. To respond to the messag

Re: [PR] GC string views on hash join build side [datafusion]

2025-06-22 Thread via GitHub
ctsk commented on PR #16463: URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2994386331 @Dandandan I believe that that heuristic does not make sense in this context. The reason why the gc is introduced here is mainly to reduce the size of the data buffer vector of StringVi

Re: [PR] GC string views on hash join build side [datafusion]

2025-06-20 Thread via GitHub
Dandandan commented on PR #16463: URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2990166497 We have quite some implementations of gc-ing arrays. I am wondering in this case if the performance can be improved for smaller tables by this heuristic used here: https://

Re: [PR] GC string views on hash join build side [datafusion]

2025-06-20 Thread via GitHub
ctsk commented on PR #16463: URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2990163724 I will do both of these things later today. I am concerned about the performance impact for smaller-scale tasks. I suspect many users of datafusion are not doing such large joins

Re: [PR] GC string views on hash join build side [datafusion]

2025-06-20 Thread via GitHub
2010YOUY01 commented on PR #16463: URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2990142949 Thank you! This is great. I got some minor suggestions: 1. The issue explained the motivation of the change clearly, I recommend to add the same rationale to the code commen

Re: [PR] GC string views on hash join build side [datafusion]

2025-06-19 Thread via GitHub
ctsk commented on PR #16463: URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2989084548 At SF=100, this PR is 10% faster: ``` Benchmark tpch_sf100.json ┏━━┳┳━┳━━━┓ ┃ Q

Re: [PR] GC string views on hash join build side [datafusion]

2025-06-19 Thread via GitHub
ctsk commented on PR #16463: URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2989059712 Benchmark results: ``` Comparing main and fix_build-side-gc Benchmark tpch_sf10.json ┏━━┳┳━

[PR] GC string views on hash join build side [datafusion]

2025-06-19 Thread via GitHub
ctsk opened a new pull request, #16463: URL: https://github.com/apache/datafusion/pull/16463 ## Which issue does this PR close? - Closes #16206. ## What changes are included in this PR? - A utility function to garbage collect (gc) all view-type columns of a batch