2010YOUY01 commented on PR #12444:
URL: https://github.com/apache/datafusion/pull/12444#issuecomment-2350899067
> I am in general somewhat lukewarm on adding optimizations that make some
queries faster and some slower (as it then becomes a tradeoff, and different
users might have different tradeoffs).
>
> It would be great to figure out how to avoid this tradeoff (I left one
suggestion)
I think this regression is fixable in the long term (by making ASCII check
more efficient, currently especially for `StringView` ASCII check is not the
most efficient way), but it's a good idea to be more conservative and skip
ASCII validation for small prefix for now.
I applied this suggestion and benched again and I think there is no
noticeable ASCII check overhead:
Result:
`substr_before` is current main already with `StringView` optimization to
avoid copy
`susbtr_after` is this PR with additional ASCII fast path
```
group
substr_after substr_before
-----
------------ -------------
LONGER THAN 12/substr_large_string [size=1024, count=64, strlen=128]
1.00 74.1±1.13µs ? ?/sec 2.65 196.4±1.32µs ? ?/sec
LONGER THAN 12/substr_large_string [size=4096, count=64, strlen=128]
1.00 290.6±1.16µs ? ?/sec 2.68 779.1±17.07µs ? ?/sec
LONGER THAN 12/substr_string [size=1024, count=64, strlen=128]
1.00 72.9±0.25µs ? ?/sec 2.91 212.2±13.48µs ? ?/sec
LONGER THAN 12/substr_string [size=4096, count=64, strlen=128]
1.00 285.0±1.72µs ? ?/sec 2.99 852.6±67.06µs ? ?/sec
LONGER THAN 12/substr_string_view [size=1024, count=64, strlen=128]
1.00 29.7±0.17µs ? ?/sec 5.61 166.5±24.98µs ? ?/sec
LONGER THAN 12/substr_string_view [size=4096, count=64, strlen=128]
1.00 117.8±0.92µs ? ?/sec 5.29 623.4±29.53µs ? ?/sec
SHORTER THAN 12/substr_large_string [size=1024, strlen=12]
1.00 59.0±0.67µs ? ?/sec 1.15 67.8±1.30µs ? ?/sec
SHORTER THAN 12/substr_large_string [size=4096, strlen=12]
1.00 228.5±2.10µs ? ?/sec 1.26 289.0±25.86µs ? ?/sec
SHORTER THAN 12/substr_string [size=1024, strlen=12]
1.00 55.3±0.46µs ? ?/sec 1.06 58.5±3.18µs ? ?/sec
SHORTER THAN 12/substr_string [size=4096, strlen=12]
1.00 214.8±1.59µs ? ?/sec 1.04 222.4±4.55µs ? ?/sec
SHORTER THAN 12/substr_string_view [size=1024, strlen=12]
1.00 18.2±0.09µs ? ?/sec 1.27 23.0±0.49µs ? ?/sec
SHORTER THAN 12/substr_string_view [size=4096, strlen=12]
1.00 73.5±1.79µs ? ?/sec 1.44 105.8±11.82µs ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=1024, count=6,
strlen=128] 1.00 75.9±0.40µs ? ?/sec 1.04 78.8±3.79µs
? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=4096, count=6,
strlen=128] 1.00 297.4±2.70µs ? ?/sec 1.01 299.3±8.54µs
? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=1024, count=6, strlen=128]
1.00 77.8±0.24µs ? ?/sec 1.07 83.4±10.36µs ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=4096, count=6, strlen=128]
1.04 300.9±1.48µs ? ?/sec 1.00 289.1±3.56µs ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=1024, count=6,
strlen=128] 1.06 33.3±0.63µs ? ?/sec 1.00 31.5±0.15µs
? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=4096, count=6,
strlen=128] 1.00 129.8±2.23µs ? ?/sec 1.01 130.8±13.20µs
? ?/sec
```
> The other thing I keep thinking is how can we avoid this 'is_ascii' check
at runtime (so things get faster regardless). Maybe it is time to consider
starting to propage the is_ascii flag on the arrays themselves
>
> The parquet reader, for example, knows when it has only ascii data
I think it's a good idea.
I'm curious (and also to justify the extra complexity), is your (InfluxDB)
real workload dominated by String data? I saw somewhere Databricks and Tableau
said their production workload has >50% string data, many are the substitute
for UDT, and also uncleaned raw data, for such case it should be worth the
effort
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]