Jefffrey commented on code in PR #19618:
URL: https://github.com/apache/datafusion/pull/19618#discussion_r2675768751


##########
datafusion/functions-aggregate/src/string_agg.rs:
##########
@@ -384,14 +384,13 @@ impl Accumulator for SimpleStringAggAccumulator {
     }
 
     fn evaluate(&mut self) -> Result<ScalarValue> {
-        let result = if self.has_value {
-            ScalarValue::LargeUtf8(Some(std::mem::take(&mut 
self.accumulated_string)))
+        if self.has_value {
+            Ok(ScalarValue::LargeUtf8(Some(
+                self.accumulated_string.clone(),

Review Comment:
   I guess this approach is fine for now



##########
datafusion/expr-common/src/accumulator.rs:
##########
@@ -58,17 +58,37 @@ pub trait Accumulator: Send + Sync + Debug {
     /// running sum.
     fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()>;
 
-    /// Returns the final aggregate value, consuming the internal state.
+    /// Returns the final aggregate value.
     ///
     /// For example, the `SUM` accumulator maintains a running sum,
     /// and `evaluate` will produce that running sum as its output.
     ///
-    /// This function should not be called twice, otherwise it will
-    /// result in potentially non-deterministic behavior.
-    ///
     /// This function gets `&mut self` to allow for the accumulator to build
     /// arrow-compatible internal state that can be returned without copying
-    /// when possible (for example distinct strings)
+    /// when possible (for example distinct strings).
+    ///
+    /// # Window Frame Queries
+    ///
+    /// When used in a window context without [`Self::supports_retract_batch`],
+    /// `evaluate()` may be called multiple times on the same accumulator 
instance
+    /// (once per row in the partition). In this case, implementations **must 
not**
+    /// consume or modify internal state. Use references or clones to preserve 
state:
+    ///
+    /// ```ignore
+    /// // GOOD: Preserves state for subsequent calls
+    /// fn evaluate(&mut self) -> Result<ScalarValue> {
+    ///     calculate_result(&self.values)  // Use reference
+    /// }
+    ///
+    /// // BAD: Consumes state, breaks window queries
+    /// fn evaluate(&mut self) -> Result<ScalarValue> {
+    ///     calculate_result(std::mem::take(&mut self.values))
+    /// }
+    /// ```
+    ///
+    /// For efficient sliding window calculations, consider implementing
+    /// [`Self::retract_batch`] which allows DataFusion to incrementally
+    /// update state rather than calling `evaluate()` repeatedly.

Review Comment:
   There's some things to correct here:
   
   - They are still allowed to modify the internal state, technically; 
`percentile_cont` and `median` use `select_nth_unstable_by` which modifies the 
state but doesn't consume it
   - Implementing `retract_batch` has nothing to do with efficiency; it just 
allows support for more types of window frames
   
   I think something like this would be more accurate:
   
   ```rust
   /// ## Correctness
   ///
   /// This function must not consume the internal state, as it is also used in 
window
   /// aggregate functions where it can be executed multiple times depending on 
the
   /// current window frame. Consuming the internal state can cause the next 
invocation
   /// to have incorrect results.
   ///
   /// - Even if this accumulator doesn't implement [`retract_batch`] it may 
still be used
   ///   in window aggregate functions where the window frame is
   ///   `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`
   ///
   /// It is fine to modify the state (e.g. re-order elements within internal 
state vec) so long
   /// as this doesn't cause an incorrect computation on the next call of 
evaluate.
   ///
   /// [`retract_batch`]: Self::retract_batch
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to