serially, my tests are running fine with a shared spark
context. That begs the question. Is anyone running suites in parallel with
a shared spark context? If so, I'd love to learn how as it would cut down
my test time considerably.
Thanks,
Ameet
On Wed, Feb 19, 2014 at 11:26 AM, Ameet Kini
val rdd2 = rdd.partitionBy(my partitioner).reduceByKey(some function)
I see that rdd2's partitions are not internally sorted. Can someone confirm
that this is expected behavior? And if so, the only way to get partitions
internally sorted is to follow it with something like this
val rdd2 = rdd.par
www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Thu, Mar 20, 2014 at 3:20 PM, Ameet Kini wrote:
>
>>
>> val rdd2 = rdd.partitionBy(my partitioner).reduceByKey(some function)
>>
>> I see that rdd2's partitions are not
val hrdd = sc.hadoopRDD(..)
val res =
hrdd.partitionBy(myCustomPartitioner).reduceKey(..).mapPartitionsWithIndex(
some code to save those partitions )
I'm getting OutOfMemoryErrors on the read side of partitionBy shuffle. My
custom partitioner generates over 20,000 partitions, so there are 20,000
or.scala:176)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
Thanks,
Ameet
On Wed, Apr 9, 2014 at 10:48 PM, Ameet Kini wrote:
>
A typo - I mean't section 2.1.2.5 "ulimit and nproc" of
https://hbase.apache.org/book.html
Ameet
On Fri, Apr 11, 2014 at 10:32 AM, Ameet Kini wrote:
>
> Turns out that my ulimit settings were too low. I bumped up and the job
> successfully completes. Here's wha
I don't think there is a setup() or cleanup() in Spark but you can usually
achieve the same using mapPartitions and having the "setup" code at the top
of the mapPartitions and "cleanup" at the end.
The reason why this usually works is that in Hadoop map/reduce, each map
task runs over an input spl
I'd like to override the logic of comparing keys for equality in
groupByKey. Kinda like how combineByKey allows you to pass in the combining
logic for "values", I'd like to do the same for keys.
My code looks like this:
val res = rdd.groupBy(myPartitioner)
Here, rdd is of type RDD[(MyKey, MyValue)