Re: UDAFs have an inefficiency problem

2019-09-30 Thread Erik Erlandson
On the PR review, there were questions about adding a new aggregating class, and whether or not Aggregator[IN,BUF,OUT] could be used. I added a proof of concept solution based on enhancing Aggregator to the pull-req: https://github.com/apache/spark/pull/25024/ I wrote up my findings on the PR but

Re: UDAFs have an inefficiency problem

2019-07-05 Thread Erik Erlandson
I submitted a PR for this: https://github.com/apache/spark/pull/25024 On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson wrote: > I describe some of the details here: > https://issues.apache.org/jira/browse/SPARK-27296 > > The short version of the story is that aggregating data structures (UDTs) > u

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin
Not that I know of. We did do some work to make it work faster in the case of lower cardinality: https://issues.apache.org/jira/browse/SPARK-17949 On Wed, Mar 27, 2019 at 4:40 PM, Erik Erlandson < eerla...@redhat.com > wrote: > > BTW, if this is known, is there an existing JIRA I should link to

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Erik Erlandson
BTW, if this is known, is there an existing JIRA I should link to? On Wed, Mar 27, 2019 at 4:36 PM Erik Erlandson wrote: > > At a high level, some candidate strategies are: > 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF > trait itself) so that the update method can

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin
They are unfortunately all pretty substantial (which is why this problem exists) ... On Wed, Mar 27, 2019 at 4:36 PM, Erik Erlandson < eerla...@redhat.com > wrote: > > At a high level, some candidate strategies are: > > 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Erik Erlandson
At a high level, some candidate strategies are: 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF trait itself) so that the update method can do the right thing. 2. Expose TypedImperativeAggregate to users for defining their own, since it already does the right thing. 3. As

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin
Yes this is known and an issue for performance. Do you have any thoughts on how to fix this? On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson wrote: > I describe some of the details here: > https://issues.apache.org/jira/browse/SPARK-27296 > > The short version of the story is that aggregating dat

UDAFs have an inefficiency problem

2019-03-27 Thread Erik Erlandson
I describe some of the details here: https://issues.apache.org/jira/browse/SPARK-27296 The short version of the story is that aggregating data structures (UDTs) used by UDAFs are serialized to a Row object, and de-serialized, for every row in a data frame. Cheers, Erik