zhuqi-lucas commented on PR #15348:
URL: https://github.com/apache/datafusion/pull/15348#issuecomment-2743243018

   > Thank you for the work on better Utf8View support. I tried one sort 
benchmark with sort-preserving merging on a single `Utf8View` column, but it 
gets slower:
   > 
   > Reproducer
   > 
   > ```
   > cargo run --profile release-nonlto --bin dfbench -- sort-tpch -p 
/Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 -q 3
   > ```
   > 
   > main: 8s pr: 10s
   > 
   > According to the flamegraph, an extra overhead of 
`libsystem_platform.dylib_platform_memcmp` showed up inside 
`SortPreservingMergeStream` It's not obvious why, I'll try to help figure it 
out later.
   > 
   > 
[flamegraphs.zip](https://github.com/user-attachments/files/19388551/flamegraphs.zip)
   
   Thank you @2010YOUY01 for review, i may know the problem about the above 
Reproducer:
   
   1. The q3 sort bench mark is a special case sort by l_comment which is 
always long string larger than 12 bytes, meanwhile it has many case with same 
prefix, it means the 4 bytes view are also same, so the compare logic will go 
to the last part to compare the buffer, it will make the compare regression.
   2. You can try to sort the normal case which the string is mostly less than 
12 bytes, and even larger than 12 bytes, we also will optimize use the 4 bytes 
view to compare, for example change the q3 to sql which will use the normal 
string to order by:
   
   ```rust
   SELECT l_shipmode, l_comment, l_partkey
           FROM lineitem
           ORDER BY l_shipmode;
   ```
   
   It will show the performance improvement.
   
   
   And finally, i think we need to create a follow-up ticket to improve and 
investigate the regression case. It's will be valuable for us to improve it. 
Thanks!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to