kosiew commented on PR #17085:
URL: https://github.com/apache/datafusion/pull/17085#issuecomment-3389122318
<html><head></head><body><h1>Why <code inline="">AccumulatorArgs</code>
Differs from Other Function Args</h1>
<p>Your review raises an excellent question about API consistency. The
answer lies in <strong>when and how</strong> these functions resolve their
input fields.</p>
<hr>
<h2>Scalar & Window Functions: Pre-computed Fields</h2>
<p>Both scalar and window functions receive <strong>pre-computed</strong>
<code inline="">FieldRef</code>s at creation time:</p>
<pre><code class="language-rust">// From create_udwf_window_expr
(windows/mod.rs:178-180)
let input_fields: Vec<_> = args
.iter()
.map(|arg| arg.return_field(input_schema)) // Computed once at planning
.collect::<Result<_>>()?;
</code></pre>
<p>These are then passed to <code inline="">ExpressionArgs</code>, <code
inline="">PartitionEvaluatorArgs</code>, or <code
inline="">ScalarFunctionArgs</code>.<br>
The schema itself is never exposed because <strong>the fields have already
been computed</strong>.</p>
<hr>
<h2>Aggregate Functions: Runtime Schema Access</h2>
<p>Aggregates, by contrast, receive the <strong>physical schema</strong> and
compute fields <strong>on demand</strong>:</p>
<pre><code class="language-rust">// From
AggregateFunctionExpr::create_accumulator (aggregate.rs:429-432)
let schema = self.args_schema();
let acc_args = self.make_acc_args(schema.as_ref()); // Schema passed here
self.fun.accumulator(acc_args)
</code></pre>
<hr>
<h1>Why Can't Aggregates Use Pre-computed Fields?</h1>
<p>The expressions need <strong>schema access at runtime</strong>.
Consider:</p>
<pre><code>SUM(a + b)
</code></pre>
<p>The expression <code inline="">a + b</code> references <strong>two
columns</strong> from the physical schema, but produces <strong>one logical
argument</strong> for the accumulator.<br>
The <code inline="">PhysicalExpr</code> for <code inline="">a + b</code>
needs to:</p>
<ol>
<li>
<p>Resolve columns <code inline="">a</code> and <code inline="">b</code>
from the physical schema</p>
</li>
<li>
<p>Evaluate the addition</p>
</li>
<li>
<p>Pass the result to the accumulator</p>
</li>
</ol>
<p>If we only passed pre-computed fields (like scalar/window functions do),
the <code inline="">PhysicalExpr</code> couldn’t resolve its column references
— it needs the full <code inline="">Schema</code>.</p>
<hr>
<h3>From the patch (aggregate.rs:391-395):</h3>
<blockquote>
<p>Physical expressions may reference columns that are not one-to-one with
the aggregate arguments (for example, <code inline="">SUM(a + b)</code>
references both <code inline="">a</code> and <code inline="">b</code>).
Replacing the schema with the synthesized argument fields would prevent those
expressions from resolving their column references.</p>
</blockquote>
<hr>
<h2>What the Patch Actually Changes</h2>
<p>The patch <strong>doesn't change runtime behavior</strong> — it:</p>
<ol>
<li>
<p><strong>Clarifies documentation:</strong> Makes explicit that <code
inline="">schema</code> is the physical input schema</p>
</li>
<li>
<p><strong>Adds convenience methods:</strong> <code
inline="">input_field()</code> and <code inline="">input_fields()</code> wrap
the common pattern of calling <code
inline="">expr.return_field(&schema)</code></p>
</li>
<li>
<p><strong>Aligns ergonomics:</strong> Matches the convenience of <code
inline="">ScalarFunctionArgs.arg_fields</code> and <code
inline="">PartitionEvaluatorArgs.input_fields()</code>, without losing the
schema access that aggregate expressions require</p>
</li>
</ol>
<hr>
<h1>Comparison Table</h1>
Aspect | Scalar / Window | Accumulator (Before) | Accumulator (After)
-- | -- | -- | --
Fields | Pre-computed at planning | Computed on-demand | Computed on-demand
Schema | Hidden (not needed) | Exposed (confusing) | Exposed (documented)
Field access | Direct: args.arg_fields[i] | Manual:
exprs[i].return_field(schema)? | Helper: input_field(i)?
Why different? | Simple 1:1 args | Expressions need schema for column
resolution | Same (clarified)
<hr>
<h2>Why This Fix Is Correct</h2>
<p>The patch acknowledges that <code inline="">AccumulatorArgs</code>
<strong>must remain unique</strong> because aggregates operate at the physical
execution layer with complex expressions.<br>
Rather than force-fitting them into the scalar/window model (which would
break column resolution), it:</p>
<ul>
<li>
<p>Preserves the necessary schema access</p>
</li>
<li>
<p>Documents <strong>why</strong> it’s different</p>
</li>
<li>
<p>Adds ergonomic helpers for the common case</p>
</li>
</ul>
<hr>
<p><strong>This is the right approach:</strong><br>
Clarify intent and align ergonomics where possible, while preserving the
functionality that makes aggregates work correctly.</p></body></html>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]