Thanks, sorry I cannot share the data and not sure how much significant it will
be for you.
I am reproducing the issue on a smaller piece of the content and see wether I
find a reason on the inconsistence.
val res2 = data.filter($"closed" === $"ever_closed").groupBy("product", "band
", "aget", "vine", "time", "yyyymm").agg(count($"account_id").as("N"),
sum($"balance").as("balance"), sum($"spend").as("spend"),
sum($"payment").as("payment")).persist()
then I collect distinct values of “vine” (which is StringType) both from data
and res2, and res2 is missing a lot of values:
val t1 = res2.select("vine").distinct.collect
scala> t1.size
res10: Int = 617
val t_real = data.select("vine").distinct.collect
scala> t_real.size
res9: Int = 639
From: Xiao Li [mailto:[email protected]]
Sent: Thursday, October 22, 2015 12:45 PM
To: Ellafi, Saif A.
Cc: user
Subject: Re: Spark groupby and agg inconsistent and missing data
Hi, Saif,
Could you post your code here? It might help others reproduce the errors and
give you a correct answer.
Thanks,
Xiao Li
2015-10-22 8:27 GMT-07:00
<[email protected]<mailto:[email protected]>>:
Hello everyone,
I am doing some analytics experiments under a 4 server stand-alone cluster in a
spark shell, mostly involving a huge database with groupBy and aggregations.
I am picking 6 groupBy columns and returning various aggregated results in a
dataframe. GroupBy fields are of two types, most of them are StringType and the
rest are LongType.
The data source is a splitted json file dataframe, once the data is persisted,
the result is consistent. But if I unload the memory and reload the data, the
groupBy action returns different content results, missing data.
Could I be missing something? this is rather serious for my analytics, and not
sure how to properly diagnose this situation.
Thanks,
Saif