kosiew opened a new issue, #16339: URL: https://github.com/apache/datafusion/issues/16339
### Describe the bug DataFusion's COUNT and COUNT DISTINCT aggregate functions produce incorrect results when operating on dictionary arrays that contain null values. The functions appear to be counting dictionary keys rather than properly handling the null values referenced by those keys. Two specific issues have been identified: COUNT with dictionary arrays: When a dictionary array has keys that reference null values, COUNT incorrectly counts those null references as valid values instead of ignoring them. COUNT DISTINCT with all-null dictionary arrays: When a dictionary array contains only keys that reference null values, COUNT DISTINCT should return 0 but may return incorrect results. ### To Reproduce Issue 1 - COUNT with mixed null/non-null dictionary values: ```rust use arrow::array::{DictionaryArray, Int32Array, StringArray}; use arrow::datatypes::Int32Type; use std::sync::Arc; // Create dictionary with values ["a", null, "c"] let values = StringArray::from(vec![Some("a"), None, Some("c")]); // Keys [0, 1, 2, 0, 1] reference: "a", null, "c", "a", null let keys = Int32Array::from(vec![0, 1, 2, 0, 1]); let dict_array = DictionaryArray::<Int32Type>::try_new(keys, Arc::new(values))?; // COUNT should return 3 (only non-null values: "a", "c", "a") // But may incorrectly count the null references ``` Issue 2 - COUNT DISTINCT with all-null dictionary values: ```rust // Create dictionary where all keys reference null values let dict_values = StringArray::from(vec![None, Some("abc")]); let dict_indices = Int32Array::from(vec![0; 5]); // All keys point to null let dict_array = DictionaryArray::<Int32Type>::try_new(dict_indices, Arc::new(dict_values))?; // COUNT DISTINCT should return 0 since all referenced values are null ``` ### Expected behavior COUNT: Should only count non-null values in dictionary arrays by properly dereferencing dictionary keys to their actual values and ignoring null references. COUNT DISTINCT: Should return 0 when all dictionary keys reference null values, and should properly count only distinct non-null values when there's a mix of null and non-null references. Both functions should handle dictionary arrays by: Dereferencing dictionary keys to their actual values Applying null-checking logic to the dereferenced values, not the keys Following the same null-handling semantics as regular arrays Additional context This issue affects the correctness of aggregate queries on dictionary-encoded columns, which are commonly used in analytical workloads for memory efficiency. The bug could lead to incorrect query results in production environments. The issue is present in the core aggregation logic for both regular COUNT and COUNT DISTINCT operations when processing DictionaryArray inputs. The functions need to properly handle the indirection layer that dictionary encoding introduces. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org