On the PR review, there were questions about adding a new aggregating
class, and whether or not Aggregator[IN,BUF,OUT] could be used. I added a
proof of concept solution based on enhancing Aggregator to the pull-req:
https://github.com/apache/spark/pull/25024/
I wrote up my findings on the PR but
I submitted a PR for this:
https://github.com/apache/spark/pull/25024
On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson wrote:
> I describe some of the details here:
> https://issues.apache.org/jira/browse/SPARK-27296
>
> The short version of the story is that aggregating data structures (UDTs)
> u
Not that I know of. We did do some work to make it work faster in the case of
lower cardinality: https://issues.apache.org/jira/browse/SPARK-17949
On Wed, Mar 27, 2019 at 4:40 PM, Erik Erlandson < eerla...@redhat.com > wrote:
>
> BTW, if this is known, is there an existing JIRA I should link to
BTW, if this is known, is there an existing JIRA I should link to?
On Wed, Mar 27, 2019 at 4:36 PM Erik Erlandson wrote:
>
> At a high level, some candidate strategies are:
> 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF
> trait itself) so that the update method can
They are unfortunately all pretty substantial (which is why this problem
exists) ...
On Wed, Mar 27, 2019 at 4:36 PM, Erik Erlandson < eerla...@redhat.com > wrote:
>
> At a high level, some candidate strategies are:
>
> 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF
At a high level, some candidate strategies are:
1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF
trait itself) so that the update method can do the right thing.
2. Expose TypedImperativeAggregate to users for defining their own, since
it already does the right thing.
3. As
Yes this is known and an issue for performance. Do you have any thoughts on
how to fix this?
On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson wrote:
> I describe some of the details here:
> https://issues.apache.org/jira/browse/SPARK-27296
>
> The short version of the story is that aggregating dat
I describe some of the details here:
https://issues.apache.org/jira/browse/SPARK-27296
The short version of the story is that aggregating data structures (UDTs)
used by UDAFs are serialized to a Row object, and de-serialized, for every
row in a data frame.
Cheers,
Erik