Re: Using Spark on Data size larger than Memory size

2014-06-11 Thread Allen Chang
Thanks. We've run into timeout issues at scale as well. We were able to workaround them by setting the following JVM options: -Dspark.akka.askTimeout=300 -Dspark.akka.timeout=300 -Dspark.worker.timeout=300 NOTE: these JVM options *must* be set on worker nodes (and not just the driver/master) for

Re: Using Spark on Data size larger than Memory size

2014-06-11 Thread Surendranauth Hiraman
My team has been using DISK_ONLY. The challenge with this approach is knowing when to unpersist if your job creates a lot of intermediate data. The "right solution" would be to mark a transient RDD as being capable of spilling to disk, rather than having to persist it to force this behavior. Hopefu

Re: Using Spark on Data size larger than Memory size

2014-06-10 Thread Allen Chang
Thanks for the clarification. What is the proper way to configure RDDs when your aggregate data size exceeds your available working memory size? In particular, in additional to typical operations, I'm performing cogroups, joins, and coalesces/shuffles. I see that the default storage level for RDD

Re: Using Spark on Data size larger than Memory size

2014-06-07 Thread Vibhor Banga
Aaron, Thank You for your response and clarifying things. -Vibhor On Sun, Jun 1, 2014 at 11:40 AM, Aaron Davidson wrote: > There is no fundamental issue if you're running on data that is larger > than cluster memory size. Many operations can stream data through, and thus > memory usage is inde

Re: Using Spark on Data size larger than Memory size

2014-06-06 Thread Andrew Ash
If an individual partition becomes too large to fit in memory then the usual approach would be to repartition to more partitions, so each one is smaller. Hopefully then it would fit. On Jun 6, 2014 5:47 PM, "Roger Hoover" wrote: > Andrew, > > Thank you. I'm using mapPartitions() but as you say,

Re: Using Spark on Data size larger than Memory size

2014-06-06 Thread Roger Hoover
Andrew, Thank you. I'm using mapPartitions() but as you say, it requires that every partition fit in memory. This will work for now but may not always work so I was wondering about another way. Thanks, Roger On Thu, Jun 5, 2014 at 5:26 PM, Andrew Ash wrote: > Hi Roger, > > You should be ab

Re: Using Spark on Data size larger than Memory size

2014-06-05 Thread Andrew Ash
Hi Roger, You should be able to sort within partitions using the rdd.mapPartitions() method, and that shouldn't require holding all data in memory at once. It does require holding the entire partition in memory though. Do you need the partition to never be held in memory all at once? As far as

Re: Using Spark on Data size larger than Memory size

2014-06-05 Thread Roger Hoover
I think it would very handy to be able to specify that you want sorting during a partitioning stage. On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover wrote: > Hi Aaron, > > When you say that sorting is being worked on, can you elaborate a little > more please? > > If particular, I want to sort the

Re: Using Spark on Data size larger than Memory size

2014-06-05 Thread Roger Hoover
Hi Aaron, When you say that sorting is being worked on, can you elaborate a little more please? If particular, I want to sort the items within each partition (not globally) without necessarily bringing them all into memory at once. Thanks, Roger On Sat, May 31, 2014 at 11:10 PM, Aaron Davidso

Re: Using Spark on Data size larger than Memory size

2014-05-31 Thread Aaron Davidson
There is no fundamental issue if you're running on data that is larger than cluster memory size. Many operations can stream data through, and thus memory usage is independent of input data size. Certain operations require an entire *partition* (not dataset) to fit in memory, but there are not many

Re: Using Spark on Data size larger than Memory size

2014-05-31 Thread Mayur Rustagi
Clearly thr will be impact on performance but frankly depends on what you are trying to achieve with the dataset. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga wrote: > Some

Re: Using Spark on Data size larger than Memory size

2014-05-30 Thread Vibhor Banga
Some inputs will be really helpful. Thanks, -Vibhor On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga wrote: > Hi all, > > I am planning to use spark with HBase, where I generate RDD by reading > data from HBase Table. > > I want to know that in the case when the size of HBase Table grows larger >