After upgrading someone's Spark/Hive program from Spark 1.5.2 to Spark
1.6.2, I've run into a GROUP BY that does not work reliably in the newer
version: the GROUP BY results are not always fully aggregated. Instead, I
get lots of duplicate + triplicate sets of group values. Seems like a Hive
bug to me, but before I create a Jira ticket I thought I should check with
the list.

Input:  A single table with 24 fields that I want to group on, and a few
other fields that I want to aggregate.

Statement: similar to
SELECT a,b,c, ..., x, count(y) as yc, sum(z1) as z1s, sum(z2) as z2s
FROM inputTable
GROUP BY a,b,c, ..., x

Checking the data for one sample run, I see that the input table has about
1.1M rows, with 18157 unique combinations of those 24 grouped values.

Expected output: A table of 18157 rows.

Observed output: A table of 28006 rows. Looking just at unique combinations
of those grouped fields, I see that while 10125 rows are unique as
expected, there are 6215 duplicate rows and 1817 triplicate rows.

This is not 100% repeatable. That is, I'll see the issue repeatedly one
day, but the next day with the same input data the GROUP BY will work
correctly. Anyway, for what it's worth I have captured parquets of the
input + output at a time when this issue kicked in.

For now it seems that I have a workaround: if I presort the input table by
the grouped fields, the GROUP BY works correctly. But of course I shouldn't
have to do that.

I was unable to find any recent reports of this sort of thing in the Hive
Jira or on this list. Which is odd. Does this misbehaviour seem familiar to
anyone?

/drm

Reply via email to