While upgrading a program from Spark 1.5.2 to Spark 1.6.2, I've run into a HiveContext GROUP BY that no longer works reliably. The GROUP BY results are not always fully aggregated; instead, I get lots of duplicate + triplicate sets of group values.
I've come up with a workaround that works for me, but the behaviour in question seems like a Spark bug, and since I don't see anything matching this in the Spark Jira or on this list, I thought I should check with this list to see if it's a known issue or if it might be worth creating a ticket for. Input: A single table with 24 fields that I want to group on, and a few other fields that I want to aggregate. Statement: similar to hiveContext.sql(""" SELECT a,b,c, ..., x, count(y) as yc, sum(z1) as z1s, sum(z2) as z2s FROM inputTable GROUP BY a,b,c, ..., x """) Checking the data for one sample run, I see that the input table has about 1.1M rows, with 18157 unique combinations of those 24 grouped values. Expected output: A table of 18157 rows. Observed output: A table of 28006 rows. Looking just at unique combinations of those grouped fields, I see that while 10125 rows are unique as expected, there are 6215 duplicate rows and 1817 triplicate rows. This is not quite 100% repeatable. That is, I'll see the issue repeatedly one day, but the next day with the same input data the GROUP BY will work correctly. For now it seems that I have a workaround: if I presort the input table on those grouped fields, the GROUP BY works correctly. But of course I shouldn't have to do that. Does this sort of GROUP BY issue seem familiar to anyone? /drm