incomplete aggregation in a GROUP BY

Donald Matthews Thu, 03 Nov 2016 08:06:47 -0700

While upgrading a program from Spark 1.5.2 to Spark 1.6.2, I've run into a
HiveContext GROUP BY that no longer works reliably.
The GROUP BY results are not always fully aggregated; instead, I get lots
of duplicate + triplicate sets of group values.


I've come up with a workaround that works for me, but the behaviour in
question seems like a Spark bug, and since I don't see anything matching
this in the Spark Jira or on this list, I thought I should check with this
list to see if it's a known issue or if it might be worth creating a ticket
for.

Input:  A single table with 24 fields that I want to group on, and a few
other fields that I want to aggregate.

Statement: similar to hiveContext.sql("""
SELECT a,b,c, ..., x, count(y) as yc, sum(z1) as z1s, sum(z2) as z2s
FROM inputTable
GROUP BY a,b,c, ..., x
""")

Checking the data for one sample run, I see that the input table has about
1.1M rows, with 18157 unique combinations of those 24 grouped values.

Expected output: A table of 18157 rows.

Observed output: A table of 28006 rows. Looking just at unique combinations
of those grouped fields, I see that while 10125 rows are unique as
expected, there are 6215 duplicate rows and 1817 triplicate rows.

This is not quite 100% repeatable. That is, I'll see the issue repeatedly
one day, but the next day with the same input data the GROUP BY will work
correctly.

For now it seems that I have a workaround: if I presort the input table on
those grouped fields, the GROUP BY works correctly. But of course I
shouldn't have to do that.

Does this sort of GROUP BY issue seem familiar to anyone?

/drm

incomplete aggregation in a GROUP BY

Reply via email to