RE: Spark groupby and agg inconsistent and missing data

Saif.A.Ellafi Thu, 22 Oct 2015 08:58:43 -0700

Thanks, sorry I cannot share the data and not sure how much significant it will 
be for you.
I am reproducing the issue on a smaller piece of the content and see wether I 
find a reason on the inconsistence.


val res2 = data.filter($"closed" === $"ever_closed").groupBy("product", "band 
", "aget", "vine", "time", "yyyymm").agg(count($"account_id").as("N"), 
sum($"balance").as("balance"), sum($"spend").as("spend"), 
sum($"payment").as("payment")).persist()

then I collect distinct values of “vine” (which is StringType) both from data 
and res2, and res2 is missing a lot of values:

val t1 = res2.select("vine").distinct.collect
scala> t1.size
res10: Int = 617

val t_real = data.select("vine").distinct.collect
scala> t_real.size
res9: Int = 639


From: Xiao Li [mailto:[email protected]]
Sent: Thursday, October 22, 2015 12:45 PM
To: Ellafi, Saif A.
Cc: user
Subject: Re: Spark groupby and agg inconsistent and missing data

Hi, Saif,

Could you post your code here? It might help others reproduce the errors and 
give you a correct answer.

Thanks,

Xiao Li

2015-10-22 8:27 GMT-07:00 
<[email protected]<mailto:[email protected]>>:
Hello everyone,

I am doing some analytics experiments under a 4 server stand-alone cluster in a 
spark shell, mostly involving a huge database with groupBy and aggregations.

I am picking 6 groupBy columns and returning various aggregated results in a 
dataframe. GroupBy fields are of two types, most of them are StringType and the 
rest are LongType.

The data source is a splitted json file dataframe,  once the data is persisted, 
the result is consistent. But if I unload the memory and reload the data, the 
groupBy action returns different content results, missing data.

Could I be missing something? this is rather serious for my analytics, and not 
sure how to properly diagnose this situation.

Thanks,
Saif

RE: Spark groupby and agg inconsistent and missing data

Reply via email to