Hi, I am trying to simulate the "Split Distinct Aggregation" [1] with the data from Taxi Ride. I am using the following query:
SELECT dayOfTheYear, COUNT(DISTINCT driverId) FROM TaxiRide GROUP BY dayOfTheYear and I am analyzing the different methods for optimizing. So I started using (1) no optimization, then the (2) "table.exec.mini-batch.size" = TRUE with "table.optimizer.agg-phase-strategy" = "ONE_PHASE", then I changed to (3) "table.optimizer.agg-phase-strategy" = "TWO_PHASE", and finally I use the (4) "table.optimizer.distinct-agg.split.enabled" = TRUE. What does the sentence "COUNT DISTINCT is not good at reducing records if the value of distinct key (i.e. user_id) is sparse." mean on the website? In my case the distinct key is the driverId. So, should I change the data source to have a lot of null values on the driverId column? I am asking because I tested this query with optimizations (1) and (2) I got ~20K r/s on each operator (parallelism of 8) when I set the workload to ~200K r/s. This is almost the total workload. Then I changed only the optimization to (3, TWO_PHASE) and maximum throughput reaches only 4K. I think that the problem is in my data that the query with distinct is consuming. So, how should I prepare the data to see the optimization of split distinct take effect? Thanks, Felipe [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/tuning/streaming_aggregation_optimization.html#split-distinct-aggregation -- -- Felipe Gutierrez -- skype: felipe.o.gutierrez -- https://felipeogutierrez.blogspot.com