alamb commented on issue #6906:
URL: https://github.com/apache/datafusion/issues/6906#issuecomment-2355857689
In terms of implementation, what I suggest is:
1. Do a POC implementaiton: wire up just enough `StringView, don't worry
about GC, basic unit tests
2. Verify it makes the clickbench query faster
3. Flesh out testing, documentation, add support for StringArrary, etc
4. Merge and profit (bonus points for blogging about it)
For the POC here is the reproducer I recommend:
### Step 1. Get `hits_partitioned` using `bench.sh`:
```shell
cd benchmarks
./bench.sh data clickbench_partitioned
```
### Step 2: Prepare a script with reproducer query:
```sql
set datafusion.execution.parquet.schema_force_view_types = true;
SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\\.)?([^/]+)/.*$', '\\1')
AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer")
FROM hits_partitioned
WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC
LIMIT 25;
```
```shell
andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2/benchmarks/data$ cat
q28.sql
set datafusion.execution.parquet.schema_force_view_types = true;
SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\\.)?([^/]+)/.*$', '\\1')
AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer")
FROM hits_partitioned
WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC
LIMIT 25;
```
### Step 3: Run script (with release build of `datafusion-cli`):
```shell
datafusion-cli -f q28.sql
```
* `set datafusion.execution.parquet.schema_force_view_types = true;` -->
`Elapsed 18.431 seconds.`
* `set datafusion.execution.parquet.schema_force_view_types = false;` -->
`Elapsed 6.427 seconds.`
The goal is to get `set datafusion.execution.parquet.schema_force_view_types
= true;` to be the same (or better) than when it is false
If you look at the
[flamegraph-string-view.svg](https://github.com/user-attachments/assets/e89c5566-c635-467b-9b28-24b7d2d91bfa),
you can see most of the time is spent doing GroupsAccumulator

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]