Re: incomplete aggregation in a GROUP BY

Michael Armbrust Thu, 03 Nov 2016 13:12:46 -0700

Sounds like a bug, if you can reproduce on 1.6.3 (currently being voted
on), then please open a JIRA.


On Thu, Nov 3, 2016 at 8:05 AM, Donald Matthews <drm.t...@gmail.com> wrote:

> While upgrading a program from Spark 1.5.2 to Spark 1.6.2, I've run into a
> HiveContext GROUP BY that no longer works reliably.
> The GROUP BY results are not always fully aggregated; instead, I get lots
> of duplicate + triplicate sets of group values.
>
> I've come up with a workaround that works for me, but the behaviour in
> question seems like a Spark bug, and since I don't see anything matching
> this in the Spark Jira or on this list, I thought I should check with this
> list to see if it's a known issue or if it might be worth creating a ticket
> for.
>
> Input:  A single table with 24 fields that I want to group on, and a few
> other fields that I want to aggregate.
>
> Statement: similar to hiveContext.sql("""
> SELECT a,b,c, ..., x, count(y) as yc, sum(z1) as z1s, sum(z2) as z2s
> FROM inputTable
> GROUP BY a,b,c, ..., x
> """)
>
> Checking the data for one sample run, I see that the input table has about
> 1.1M rows, with 18157 unique combinations of those 24 grouped values.
>
> Expected output: A table of 18157 rows.
>
> Observed output: A table of 28006 rows. Looking just at unique
> combinations of those grouped fields, I see that while 10125 rows are
> unique as expected, there are 6215 duplicate rows and 1817 triplicate rows.
>
> This is not quite 100% repeatable. That is, I'll see the issue repeatedly
> one day, but the next day with the same input data the GROUP BY will work
> correctly.
>
> For now it seems that I have a workaround: if I presort the input table on
> those grouped fields, the GROUP BY works correctly. But of course I
> shouldn't have to do that.
>
> Does this sort of GROUP BY issue seem familiar to anyone?
>
> /drm
>

Re: incomplete aggregation in a GROUP BY

Reply via email to