Count for select not matching count for group by

Michael Kelly Mon, 21 Sep 2015 08:07:23 -0700

Hi,

I'm seeing some strange behaviour with spark 1.5, I have a dataframe
that I have built from loading and joining some hive tables stored in
s3.


The dataframe is cached in memory, using df.cache.

What I'm seeing is that the counts I get when I do a group by on a
column are different from what I get when I filter/select and count.

df.select("outcome").groupBy("outcome").count.show
outcome | count
----------------------
'A'           |  100
'B'           |  200

df.filter("outcome = 'A'").count
# 50

df.filter(df("outcome") === "A").count
# 50

I expect the count of columns that match 'A' in the groupBy to match
the count when filtering. Any ideas what might be happening?

Thanks,

Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Count for select not matching count for group by

Reply via email to