Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Rishi Mishra
Agree with Koert that UnionRDD should have a narrow dependencies . Although union of two RDDs increases the number of tasks to be executed ( rdd1.partitions + rdd2.partitions) . If your two RDDs have same number of partitions , you can also use zipPartitions, which causes lesser number of tasks, he

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Koert Kuipers
i am surprised union introduces a stage. UnionRDD should have only narrow dependencies. On Tue, Feb 2, 2016 at 11:25 PM, Koert Kuipers wrote: > well the "hadoop" way is to save to a/b and a/c and read from a/* :) > > On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam wrote: > >> Hi Spark users and deve

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Koert Kuipers
well the "hadoop" way is to save to a/b and a/c and read from a/* :) On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam wrote: > Hi Spark users and developers, > > anyone knows how to union two RDDs without the overhead of it? > > say rdd1.union(rdd2).saveTextFile(..) > This requires a stage to union th

Union of RDDs without the overhead of Union

2016-02-02 Thread Jerry Lam
Hi Spark users and developers, anyone knows how to union two RDDs without the overhead of it? say rdd1.union(rdd2).saveTextFile(..) This requires a stage to union the 2 rdds before saveAsTextFile (2 stages). Is there a way to skip the union step but have the contents of the two rdds save to the s