Re: Java 8 vs Scala

2015-07-16 Thread Marius Danciu
If you takes time to actually learn Scala starting from its fundamental concepts AND quite importantly get familiar with general functional programming concepts, you'd immediately realize the things that you'd really miss going back to Java (8). On Fri, Jul 17, 2015 at 8:14 AM Wojciech Pituła wr

DataFrame from RDD[Row]

2015-07-16 Thread Marius Danciu
Hi, This is an ugly solution because it requires pulling out a row: val rdd: RDD[Row] = ... ctx.createDataFrame(rdd, rdd.first().schema) Is there a better alternative to get a DataFrame from an RDD[Row] since toDF won't work as Row is not a Product ? Thanks, Marius

Re: Optimizations

2015-07-03 Thread Marius Danciu
. Then run a map operation to perform the > join and whatever else you need to do. This will remove a shuffle stage but > you will still have to collect the joined RDD and broadcast it. All depends > on the size of your data if it’s worth it or not. > > From: Marius Danciu > Date:

Optimizations

2015-07-03 Thread Marius Danciu
Hi all, If I have something like: rdd.join(...).mapPartitionToPair(...) It looks like mapPartitionToPair runs in a different stage then join. Is there a way to piggyback this computation inside the join stage ? ... such that each result partition after join is passed to the mapPartitionToPair fu

Re: Spark partitioning question

2015-05-05 Thread Marius Danciu
Turned out that is was sufficient do to repartitionAndSortWithinPartitions ... so far so good ;) On Tue, May 5, 2015 at 9:45 AM Marius Danciu wrote: > Hi Imran, > > Yes that's what MyPartitioner does. I do see (using traces from > MyPartitioner) that the key is partitioned o

Re: Spark partitioning question

2015-05-04 Thread Marius Danciu
the same, but most probably close enough, and avoids doing > another expensive shuffle). If you can share a bit more information on > your partitioner, and what properties you need for your "f", that might > help. > > thanks, > Imran > > > On Tue, Apr 28, 2015

Re: Spark partitioning question

2015-04-28 Thread Marius Danciu
need to sort and repartition, try using > repartitionAndSortWithinPartitions to do it in one shot. > > Thanks, > Silvio > > From: Marius Danciu > Date: Tuesday, April 28, 2015 at 8:10 AM > To: user > Subject: Spark partitioning question > >

Spark partitioning question

2015-04-28 Thread Marius Danciu
Hello all, I have the following Spark (pseudo)code: rdd = mapPartitionsWithIndex(...) .mapPartitionsToPair(...) .groupByKey() .sortByKey(comparator) .partitionBy(myPartitioner) .mapPartitionsWithIndex(...) .mapPartitionsToPair( *f* ) The input data

Re: Shuffle question

2015-04-22 Thread Marius Danciu
Thank you Iulian ! That's precisely what I discovered today. Best, Marius On Wed, Apr 22, 2015 at 3:31 PM Iulian Dragoș wrote: > On Tue, Apr 21, 2015 at 2:38 PM, Marius Danciu > wrote: > >> Hello anyone, >> >> I have a question regarding the sort shuffle. Rou

Re: Shuffle question

2015-04-22 Thread Marius Danciu
Anyone ? On Tue, Apr 21, 2015 at 3:38 PM Marius Danciu wrote: > Hello anyone, > > I have a question regarding the sort shuffle. Roughly I'm doing something > like: > > rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2) > > The problem is that in

Shuffle question

2015-04-21 Thread Marius Danciu
Hello anyone, I have a question regarding the sort shuffle. Roughly I'm doing something like: rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2) The problem is that in f2 I don't see the keys being sorted. The keys are Java Comparable not scala.math.Ordered or scala.math.Ordering