Hi
I wanted to change the functioning of the "zipWithIndex" function for spark
RDDs in which the output of the function is, just for an example, "(data,
prev_index+data.length)" instead of "(data,prev_index+1)".
How can I do this?
--
Thank You
Regards
Punit Naik
In.asInstanceOf[ZippedWithIndexRDDPartition]
> firstParent[T].iterator(split.prev, context).zipWithIndex.map { x =>
> (x._1, split.startIndex + x._2)
>
> You can modify the second component of the tuple to take data.length into
> account.
>
> On Tue, Jun 28, 2016 at
29-Jun-2016 6:31 AM, "Ted Yu" wrote:
> Since the data.length is variable, I am not sure whether mixing data.length
> and the index makes sense.
>
> Can you describe your use case in bit more detail ?
>
> Thanks
>
> On Tue, Jun 28, 2016 at 11:34 AM, Punit Naik
&
e
one which is above is he latest one which is failing.
Can anyone help me in designing the configuration or set some properties
which will not result in executors failing and let the tersort complete?
--
Thank You
Regards
Punit Naik
ortByKey" after
"repartitionAndSortWithinPartitions" be faster now that the individual
partitions are sorted?
--
Thank You
Regards
Punit Naik
alues per key, so not
> really secondary sort by itself.
>
> for secondary sort also check out:
> https://github.com/tresata/spark-sorted
>
>
> On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik
> wrote:
>
>> Hi guys
>>
>> In my spark/scala code I am impleme
artitionAndSortWithinPartitions you do not get much benefit from running
> sortByKey after repartitionAndSortWithinPartitions (because all the data
> will get shuffled again)
>
>
> On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik
> wrote:
>
>> Hi Koert
>>
>> I have
Can we increase the sorting speed of RDD by doing a secondary sort first?
On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik wrote:
> Okay. Can't I supply the same partitioner I used for
> "repartitionAndSortWithinPartitions" as an argument to "sortByKey"?
>
> On
I meant to say that first we can sort the individual partitions and then
sort them again by merging. Sort of a divide and conquer mechanism.
Does sortByKey take care of all this internally?
On Fri, Jul 15, 2016 at 12:08 AM, Punit Naik wrote:
> Can we increase the sorting speed of RDD by doin
Okay that clears my doubt! Thanks a lot.
On 15-Jul-2016 7:43 PM, "Koert Kuipers" wrote:
spark's shuffle mechanism takes care of this kind of optimization
internally when you use the sort-based shuffle (which is the default).
On Thu, Jul 14, 2016 at 2:57 PM, Punit Naik wrote:
&
r so that the number of partitions created are equal to the
number of clusters (2 in this case) and each partition has all the elements
belonging to a certain cluster in it.
--
Thank You
Regards
Punit Naik
Hi Guys
I have a huge dataset (~ 1TB) which has about a billion records. I have to
transfer it to an HBase table. What is the fastest way of doing it?
--
Thank You
Regards
Punit Naik
t;
--conf "spark.driver.maxResultSize=16g"
--conf "spark.driver.cores=10"
--conf "spark.driver.memory=10g"
Can anyone tell me any more configs to circumvent this "executor lost" and
"executor lost failure" error?
--
Thank You
Regards
Punit Naik
13 matches
Mail list logo