alamb commented on PR #11627:
URL: https://github.com/apache/datafusion/pull/11627#issuecomment-2259204056
I have been playing with this PR more. On my 8 core test machine on GCP, I
am running
```sql
set datafusion.execution.target_partitions = 90;
SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh") FROM hits
GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;
```
The actual command:
```shell
./datafusion-cli -c 'set datafusion.execution.target_partitions = 90;
SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh") FROM hits GROUP
BY "\
WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;'
```
On this branch, I reliably see it use 8GB peak memory and take around 10
seconds:
8GB max
10 row(s) fetched.
Elapsed 10.073 seconds.
Elapsed 9.880 seconds.
Elapsed 9.939 seconds.
When running the same command on main I see it reliably use 12GB of memory
and take 14 seconds
12GB peak
Elapsed 14.069 seconds.
Elapsed 14.018 seconds.
Elapsed 14.078 seconds.
Therefore I conclude (again) this branch is a substantial improvement
for high cardinality aggregates on many cores and therefore I think we
should merge it
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]