Hi, I'm seeing some strange behaviour with spark 1.5, I have a dataframe that I have built from loading and joining some hive tables stored in s3.
The dataframe is cached in memory, using df.cache. What I'm seeing is that the counts I get when I do a group by on a column are different from what I get when I filter/select and count. df.select("outcome").groupBy("outcome").count.show outcome | count ---------------------- 'A' | 100 'B' | 200 df.filter("outcome = 'A'").count # 50 df.filter(df("outcome") === "A").count # 50 I expect the count of columns that match 'A' in the groupBy to match the count when filtering. Any ideas what might be happening? Thanks, Michael --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org