from:"Punit Naik"

Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik

Hi I wanted to change the functioning of the "zipWithIndex" function for spark RDDs in which the output of the function is, just for an example, "(data, prev_index+data.length)" instead of "(data,prev_index+1)". How can I do this? -- Thank You Regards Punit Naik

Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik

In.asInstanceOf[ZippedWithIndexRDDPartition] > firstParent[T].iterator(split.prev, context).zipWithIndex.map { x => > (x._1, split.startIndex + x._2) > > You can modify the second component of the tuple to take data.length into > account. > > On Tue, Jun 28, 2016 at

Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik

29-Jun-2016 6:31 AM, "Ted Yu" wrote: > Since the data.length is variable, I am not sure whether mixing data.length > and the index makes sense. > > Can you describe your use case in bit more detail ? > > Thanks > > On Tue, Jun 28, 2016 at 11:34 AM, Punit Naik &

Spark Terasort Help

2016-07-08 Thread Punit Naik

e one which is above is he latest one which is failing. Can anyone help me in designing the configuration or set some properties which will not result in executors failing and let the tersort complete? -- Thank You Regards Punit Naik

repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik

ortByKey" after "repartitionAndSortWithinPartitions" be faster now that the individual partitions are sorted? -- Thank You Regards Punit Naik

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik

alues per key, so not > really secondary sort by itself. > > for secondary sort also check out: > https://github.com/tresata/spark-sorted > > > On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik > wrote: > >> Hi guys >> >> In my spark/scala code I am impleme

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik

artitionAndSortWithinPartitions you do not get much benefit from running > sortByKey after repartitionAndSortWithinPartitions (because all the data > will get shuffled again) > > > On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik > wrote: > >> Hi Koert >> >> I have

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik

Can we increase the sorting speed of RDD by doing a secondary sort first? On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik wrote: > Okay. Can't I supply the same partitioner I used for > "repartitionAndSortWithinPartitions" as an argument to "sortByKey"? > > On

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik

I meant to say that first we can sort the individual partitions and then sort them again by merging. Sort of a divide and conquer mechanism. Does sortByKey take care of all this internally? On Fri, Jul 15, 2016 at 12:08 AM, Punit Naik wrote: > Can we increase the sorting speed of RDD by doin

Re: repartitionAndSortWithinPartitions HELP

2016-07-15 Thread Punit Naik

Okay that clears my doubt! Thanks a lot. On 15-Jul-2016 7:43 PM, "Koert Kuipers" wrote: spark's shuffle mechanism takes care of this kind of optimization internally when you use the sort-based shuffle (which is the default). On Thu, Jul 14, 2016 at 2:57 PM, Punit Naik wrote: &

Partition RDD based on K-Means Clusters

2016-09-15 Thread Punit Naik

r so that the number of partitions created are equal to the number of clusters (2 in this case) and each partition has all the elements belonging to a certain cluster in it. -- Thank You Regards Punit Naik

Spark to HBase Fast Bulk Upload

2016-09-19 Thread Punit Naik

Hi Guys I have a huge dataset (~ 1TB) which has about a billion records. I have to transfer it to an HBase table. What is the fastest way of doing it? -- Thank You Regards Punit Naik

Executor Lost error

2016-10-03 Thread Punit Naik

t; --conf "spark.driver.maxResultSize=16g" --conf "spark.driver.cores=10" --conf "spark.driver.memory=10g" Can anyone tell me any more configs to circumvent this "executor lost" and "executor lost failure" error? -- Thank You Regards Punit Naik

Modify the functioning of zipWithIndex function for RDDs

Re: Modify the functioning of zipWithIndex function for RDDs

Re: Modify the functioning of zipWithIndex function for RDDs

Spark Terasort Help

repartitionAndSortWithinPartitions HELP

Re: repartitionAndSortWithinPartitions HELP

Re: repartitionAndSortWithinPartitions HELP

Re: repartitionAndSortWithinPartitions HELP

Re: repartitionAndSortWithinPartitions HELP

Re: repartitionAndSortWithinPartitions HELP

Partition RDD based on K-Means Clusters

Spark to HBase Fast Bulk Upload

Executor Lost error

13 matches

Site Navigation

Mail list logo

Footer information