kosiew opened a new issue, #16339:
URL: https://github.com/apache/datafusion/issues/16339

   ### Describe the bug
   
   DataFusion's COUNT and COUNT DISTINCT aggregate functions produce
   incorrect results when operating on dictionary arrays that contain
   null values. The functions appear to be counting dictionary keys
   rather than properly handling the null values referenced by those
   keys.
   
   Two specific issues have been identified:
   
   COUNT with dictionary arrays: When a dictionary array has keys that
   reference null values, COUNT incorrectly counts those null references
   as valid values instead of ignoring them.
   
   COUNT DISTINCT with all-null dictionary arrays: When a dictionary
   array contains only keys that reference null values, COUNT DISTINCT
   should return 0 but may return incorrect results.
   
   
   ### To Reproduce
   
   Issue 1 - COUNT with mixed null/non-null dictionary values:
   
   ```rust
   use arrow::array::{DictionaryArray, Int32Array, StringArray};
   use arrow::datatypes::Int32Type;
   use std::sync::Arc;
   
   // Create dictionary with values ["a", null, "c"]
   let values = StringArray::from(vec![Some("a"), None, Some("c")]);
   // Keys [0, 1, 2, 0, 1] reference: "a", null, "c", "a", null
   let keys = Int32Array::from(vec![0, 1, 2, 0, 1]);
   let dict_array = DictionaryArray::<Int32Type>::try_new(keys,
   Arc::new(values))?;
   
   // COUNT should return 3 (only non-null values: "a", "c", "a")
   // But may incorrectly count the null references 
   ```
   Issue 2 - COUNT DISTINCT with all-null dictionary values:
   
   ```rust
   // Create dictionary where all keys reference null values
   let dict_values = StringArray::from(vec![None, Some("abc")]);
   let dict_indices = Int32Array::from(vec![0; 5]); // All keys point to
   null
   let dict_array = DictionaryArray::<Int32Type>::try_new(dict_indices,
   Arc::new(dict_values))?;
   
   // COUNT DISTINCT should return 0 since all referenced values are null   
   ```
   
   ### Expected behavior
   
   COUNT: Should only count non-null values in dictionary arrays by
   properly dereferencing dictionary keys to their actual values and
   ignoring null references.
   
   COUNT DISTINCT: Should return 0 when all dictionary keys reference
   null values, and should properly count only distinct non-null values
   when there's a mix of null and non-null references.
   
   Both functions should handle dictionary arrays by:
   
   Dereferencing dictionary keys to their actual values
   Applying null-checking logic to the dereferenced values, not the keys
   Following the same null-handling semantics as regular arrays
   Additional context
   This issue affects the correctness of aggregate queries on
   dictionary-encoded columns, which are commonly used in analytical
   workloads for memory efficiency. The bug could lead to incorrect query
   results in production environments.
   
   The issue is present in the core aggregation logic for both regular
   COUNT and COUNT DISTINCT operations when processing DictionaryArray
   inputs. The functions need to properly handle the indirection layer
   that dictionary encoding introduces.  
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to