Hi Maciek,

I've tested several variants for summing "fieldToSum":

First, RDD-style code:
df.as[A].map(_.fieldToSum).reduce(_ + _)
df.as[A].rdd.map(_.fieldToSum).sum()
df.as[A].map(_.fieldToSum).rdd.sum()
All around 30 seconds. "reduce" and "sum" seem to have the same performance, 
for this use case at least.

Then with sql.functions.sum:
df.agg(sum("fieldToSum")).collect().head.getAs[Long](0)
0.24 seconds, super fast.

Finally, dataset with column selection:
df.as[A].select($"fieldToSum".as[Long]).reduce(_ + _)
0.18 seconds, super fast again.

(I've also tried replacing my sums and reduces by counts on your advice, but 
the performance is unchanged. Apparently, summing does not take much time 
compared to accessing data.)

It seems that we need to use the SQL interface to reach the highest level of 
performance, which somehow breaks the promise of Dataset (preserving type 
safety and having Catalyst and Tungsten performance like datasets).

As for direct access to Row, it seems that it got much slower from 1.6 to 2.0. 
I guess, it's because of the fact that Dataframe is now Dataset[Row], and thus 
uses the same encoding/decoding mechanism as for any other case class.

Best regards,

Julien

> Le 27 août 2016 à 22:32, Maciej Bryński <mac...@brynski.pl> a écrit :
> 
> 
> 2016-08-27 15:27 GMT+02:00 Julien Dumazert <julien.dumaz...@gmail.com 
> <mailto:julien.dumaz...@gmail.com>>:
> df.map(row => row.getAs[Long]("fieldToSum")).reduce(_ + _)
> 
> I think reduce and sum has very different performance. 
> Did you try sql.functions.sum ?
> Or of you want to benchmark access to Row object then  count() function will 
> be better idea.
> 
> Regards,
> -- 
> Maciek Bryński

Reply via email to