Re: Equally weighted partitions in Spark

2014-05-15 Thread Syed A. Hashmi
I took a stab at it and wrote a partitionerthat I intend to contribute back to main repo some time later. The partitioner takes in parameter which governs minimum number of keys / partition and once all partition h

Re: Equally weighted partitions in Spark

2014-05-02 Thread Syed A. Hashmi
You can override the default partitioner with range partitionerwhich distributes data in roughly equal sized partitions. On Thu, May 1, 2014 at 11:14 PM, deenar.toraskar wrote: > Yes > > On a

Re: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread Syed A. Hashmi
>From the jist of it, it seems like you need to override the default partitioner to control how your data is distributed among partitions. Take a look at different Partitioners available (Default, Range, Hash) if none of these get you desired result, you might want to provide your own. On Fri, Ma

Re: GC overhead limit exceeded

2014-03-28 Thread Syed A. Hashmi
Default is MEMORY_ONLY ... if you explicitly persist a RDD, you have to explicitly unpersist it if you want to free memory during the job. On Thu, Mar 27, 2014 at 11:17 PM, Sai Prasanna wrote: > Oh sorry, that was a mistake, the default level is MEMORY_ONLY !! > My doubt was, between two differe

Re: GC overhead limit exceeded

2014-03-27 Thread Syed A. Hashmi
Which storage scheme are you using? I am guessing it is MEMORY_ONLY. In large datasets, MEMORY_AND_DISK or MEMORY_AND_DISK_SER work better. You can call unpersist on an RDD to remove it from Cache though. On Thu, Mar 27, 2014 at 11:57 AM, Sai Prasanna wrote: > No i am running on 0.8.1. > Yes i

Re: question about partitions

2014-03-24 Thread Syed A. Hashmi
RDD.coalesce should be fine for rebalancing data across all RDD partitions. Coalesce is pretty handy in situations where you have sparse data and want to compact it (e.g. data after applying a strict filter) OR you know the magic number of partitions according to your cluster which will be optimal.