Sounds like a bug, if you can reproduce on 1.6.3 (currently being voted on), then please open a JIRA.
On Thu, Nov 3, 2016 at 8:05 AM, Donald Matthews <drm.t...@gmail.com> wrote: > While upgrading a program from Spark 1.5.2 to Spark 1.6.2, I've run into a > HiveContext GROUP BY that no longer works reliably. > The GROUP BY results are not always fully aggregated; instead, I get lots > of duplicate + triplicate sets of group values. > > I've come up with a workaround that works for me, but the behaviour in > question seems like a Spark bug, and since I don't see anything matching > this in the Spark Jira or on this list, I thought I should check with this > list to see if it's a known issue or if it might be worth creating a ticket > for. > > Input: A single table with 24 fields that I want to group on, and a few > other fields that I want to aggregate. > > Statement: similar to hiveContext.sql(""" > SELECT a,b,c, ..., x, count(y) as yc, sum(z1) as z1s, sum(z2) as z2s > FROM inputTable > GROUP BY a,b,c, ..., x > """) > > Checking the data for one sample run, I see that the input table has about > 1.1M rows, with 18157 unique combinations of those 24 grouped values. > > Expected output: A table of 18157 rows. > > Observed output: A table of 28006 rows. Looking just at unique > combinations of those grouped fields, I see that while 10125 rows are > unique as expected, there are 6215 duplicate rows and 1817 triplicate rows. > > This is not quite 100% repeatable. That is, I'll see the issue repeatedly > one day, but the next day with the same input data the GROUP BY will work > correctly. > > For now it seems that I have a workaround: if I presort the input table on > those grouped fields, the GROUP BY works correctly. But of course I > shouldn't have to do that. > > Does this sort of GROUP BY issue seem familiar to anyone? > > /drm >