s1, s2) => s1 ++= s2)
Best regards, Alexander
-Original Message-
From: Ulanov, Alexander
Sent: Monday, May 11, 2015 11:59 AM
To: Olivier Girardot; Michael Armbrust
Cc: Reynold Xin; dev@spark.apache.org
Subject: RE: DataFrame distinct vs RDD distinct
Hi,
Could you suggest alternative way
Frame distinct vs RDD distinct
I'll try to reproduce what has been reported to me first :) and I'll let you
know. Thanks !
Le jeu. 7 mai 2015 à 21:16, Michael Armbrust a écrit :
> I'd happily merge a PR that changes the distinct implementation to be
> more like Spark core, assum
I'll try to reproduce what has been reported to me first :) and I'll let
you know. Thanks !
Le jeu. 7 mai 2015 à 21:16, Michael Armbrust a
écrit :
> I'd happily merge a PR that changes the distinct implementation to be more
> like Spark core, assuming it includes benchmarks that show better
> pe
I'd happily merge a PR that changes the distinct implementation to be more
like Spark core, assuming it includes benchmarks that show better
performance for both the "fits in memory case" and the "too big for memory
case".
On Thu, May 7, 2015 at 2:23 AM, Olivier Girardot <
o.girar...@lateral-thoug
Ok, but for the moment, this seems to be killing performances on some
computations...
I'll try to give you precise figures on this between rdd and dataframe.
Olivier.
Le jeu. 7 mai 2015 à 10:08, Reynold Xin a écrit :
> In 1.5, we will most likely just rewrite distinct in SQL to either use the
>
In 1.5, we will most likely just rewrite distinct in SQL to either use the
Aggregate operator which will benefit from all the Tungsten optimizations,
or have a Tungsten version of distinct for SQL/DataFrame.
On Thu, May 7, 2015 at 1:32 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:
Hi everyone,
there seems to be different implementations of the "distinct" feature in
DataFrames and RDD and some performance issue with the DataFrame distinct
API.
In RDD.scala :
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] =
withScope { map(x => (x, null)).reduceBy