After upgrading someone's Spark/Hive program from Spark 1.5.2 to Spark 1.6.2, I've run into a GROUP BY that does not work reliably in the newer version: the GROUP BY results are not always fully aggregated. Instead, I get lots of duplicate + triplicate sets of group values. Seems like a Hive bug to me, but before I create a Jira ticket I thought I should check with the list.
Input: A single table with 24 fields that I want to group on, and a few other fields that I want to aggregate. Statement: similar to SELECT a,b,c, ..., x, count(y) as yc, sum(z1) as z1s, sum(z2) as z2s FROM inputTable GROUP BY a,b,c, ..., x Checking the data for one sample run, I see that the input table has about 1.1M rows, with 18157 unique combinations of those 24 grouped values. Expected output: A table of 18157 rows. Observed output: A table of 28006 rows. Looking just at unique combinations of those grouped fields, I see that while 10125 rows are unique as expected, there are 6215 duplicate rows and 1817 triplicate rows. This is not 100% repeatable. That is, I'll see the issue repeatedly one day, but the next day with the same input data the GROUP BY will work correctly. Anyway, for what it's worth I have captured parquets of the input + output at a time when this issue kicked in. For now it seems that I have a workaround: if I presort the input table by the grouped fields, the GROUP BY works correctly. But of course I shouldn't have to do that. I was unable to find any recent reports of this sort of thing in the Hive Jira or on this list. Which is odd. Does this misbehaviour seem familiar to anyone? /drm