tshauck opened a new issue, #11834: URL: https://github.com/apache/datafusion/issues/11834
### Is your feature request related to a problem or challenge? We are working to add complete StringView support in DataFusion, which permits potentially much faster processing of string data. See https://github.com/apache/datafusion/issues/10918 for more background. Today, most DataFusion string functions support DataType::Utf8 and DataType::LargeUtf8 and when called with a StringView argument DataFusion will cast the argument back to DataType::Utf8 which is expensive. To realize the full speed of StringView, we need to ensure that all string functions support the DataType::Utf8View directly. ### Describe the solution you'd like Update the function to support DataType::Utf8View directly ### Describe alternatives you've considered The typical steps are: 1. Write some tests showing the function doesn't support Utf8View (see the tests in [`string_view.slt`](https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/string_view.slt) to ensure the arguments are not being cast 2. Change the `Signature` of the function to accept `Utf8View` in addition to `Utf8`/`LargeUtf8` 3. Update the implementation of the function to operate on `Utf8View` Example PRs * Update to use an arrow kernel that already supports StringView: https://github.com/apache/datafusion/pull/11787 * Change the implementation to support StringView directly: https://github.com/apache/datafusion/pull/11676 * Change implementation (option 2): https://github.com/apache/datafusion/pull/11556 ### Additional context The documenation of string functions can be found here: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#string-functions To test a function with StringView with `datafusion-cli` you can use an example like this (replacing `starts_with` with the relevant function) ```sql > create table foo as values (arrow_cast('foo', 'Utf8View'), arrow_cast('bar', 'Utf8View')); 0 row(s) fetched. Elapsed 0.043 seconds. > select starts_with(column1, column2) from foo; +--------------------------------------+ | starts_with(foo.column1,foo.column2) | +--------------------------------------+ | false | +--------------------------------------+ 1 row(s) fetched. Elapsed 0.015 seconds. ``` To see if it is using utf8 view, use `EXPLAIN` to see the plan and verify there is no `CAST`. In this example the `CAST(column1@0 AS Utf8)` indicates that the function is not using `Utf8View` natively ```sql > explain select starts_with(column1, column2) from foo; +---------------+------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Projection: starts_with(CAST(foo.column1 AS Utf8), CAST(foo.column2 AS Utf8)) | | | TableScan: foo projection=[column1, column2] | | physical_plan | ProjectionExec: expr=[starts_with(CAST(column1@0 AS Utf8), CAST(column2@1 AS Utf8)) as starts_with(foo.column1,foo.column2)] | | | MemoryExec: partitions=1, partition_sizes=[1] | | | | +---------------+------------------------------------------------------------------------------------------------------------------------------+ 2 row(s) fetched. Elapsed 0.006 seconds. ``` It is also often good to test with a constant as well (likewise there should be no cast): ```sql > explain select starts_with(column1, 'foo') from foo; +---------------+----------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+----------------------------------------------------------------------------------------------------------+ | logical_plan | Projection: starts_with(CAST(foo.column1 AS Utf8), Utf8("foo")) | | | TableScan: foo projection=[column1] | | physical_plan | ProjectionExec: expr=[starts_with(CAST(column1@0 AS Utf8), foo) as starts_with(foo.column1,Utf8("foo"))] | | | MemoryExec: partitions=1, partition_sizes=[1] | | | | +---------------+----------------------------------------------------------------------------------------------------------+ 2 row(s) fetched. Elapsed 0.002 seconds. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
