Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik
Hi I wanted to change the functioning of the "zipWithIndex" function for spark RDDs in which the output of the function is, just for an example, "(data, prev_index+data.length)" instead of "(data,prev_index+1)". How can I do this? -- Thank You Regards Punit Naik

Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik
In.asInstanceOf[ZippedWithIndexRDDPartition] > firstParent[T].iterator(split.prev, context).zipWithIndex.map { x => > (x._1, split.startIndex + x._2) > > You can modify the second component of the tuple to take data.length into > account. > > On Tue, Jun 28, 2016 at

Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik
29-Jun-2016 6:31 AM, "Ted Yu" wrote: > Since the data.length is variable, I am not sure whether mixing data.length > and the index makes sense. > > Can you describe your use case in bit more detail ? > > Thanks > > On Tue, Jun 28, 2016 at 11:34 AM, Punit Naik &

Spark Terasort Help

2016-07-08 Thread Punit Naik
e one which is above is he latest one which is failing. Can anyone help me in designing the configuration or set some properties which will not result in executors failing and let the tersort complete? -- Thank You Regards Punit Naik

repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
ortByKey" after "repartitionAndSortWithinPartitions" be faster now that the individual partitions are sorted? -- Thank You Regards Punit Naik

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
alues per key, so not > really secondary sort by itself. > > for secondary sort also check out: > https://github.com/tresata/spark-sorted > > > On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik > wrote: > >> Hi guys >> >> In my spark/scala code I am impleme

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
artitionAndSortWithinPartitions you do not get much benefit from running > sortByKey after repartitionAndSortWithinPartitions (because all the data > will get shuffled again) > > > On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik > wrote: > >> Hi Koert >> >> I have

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
Can we increase the sorting speed of RDD by doing a secondary sort first? On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik wrote: > Okay. Can't I supply the same partitioner I used for > "repartitionAndSortWithinPartitions" as an argument to "sortByKey"? > > On

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
I meant to say that first we can sort the individual partitions and then sort them again by merging. Sort of a divide and conquer mechanism. Does sortByKey take care of all this internally? On Fri, Jul 15, 2016 at 12:08 AM, Punit Naik wrote: > Can we increase the sorting speed of RDD by doin

Re: repartitionAndSortWithinPartitions HELP

2016-07-15 Thread Punit Naik
Okay that clears my doubt! Thanks a lot. On 15-Jul-2016 7:43 PM, "Koert Kuipers" wrote: spark's shuffle mechanism takes care of this kind of optimization internally when you use the sort-based shuffle (which is the default). On Thu, Jul 14, 2016 at 2:57 PM, Punit Naik wrote: &

Partition RDD based on K-Means Clusters

2016-09-15 Thread Punit Naik
r so that the number of partitions created are equal to the number of clusters (2 in this case) and each partition has all the elements belonging to a certain cluster in it. -- Thank You Regards Punit Naik

Spark to HBase Fast Bulk Upload

2016-09-19 Thread Punit Naik
Hi Guys I have a huge dataset (~ 1TB) which has about a billion records. I have to transfer it to an HBase table. What is the fastest way of doing it? -- Thank You Regards Punit Naik

Executor Lost error

2016-10-03 Thread Punit Naik
t; --conf "spark.driver.maxResultSize=16g" --conf "spark.driver.cores=10" --conf "spark.driver.memory=10g" Can anyone tell me any more configs to circumvent this "executor lost" and "executor lost failure" error? -- Thank You Regards Punit Naik