RE: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Cheng, Hao
n jira. From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, September 17, 2015 12:28 AM To: Pete Robbins Cc: Dev Subject: Re: Unable to acquire memory errors in HiveCompatibilitySuite SparkEnv for the driver was created in SparkContext. The default parallelism field is set to the num

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Reynold Xin
SparkEnv for the driver was created in SparkContext. The default parallelism field is set to the number of slots (max number of active tasks). Maybe we can just use the default parallelism to compute that in local mode. On Wednesday, September 16, 2015, Pete Robbins wrote: > so forcing the Shuff

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Pete Robbins
so forcing the ShuffleMemoryManager to assume 32 cores and therefore calculate a pagesize of 1MB passes the tests. How can we determine the correct value to use in getPageSize rather than Runtime.getRuntime.availableProcessors()? On 16 September 2015 at 10:17, Pete Robbins wrote: > I see what y

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Pete Robbins
I see what you are saying. Full stack trace: java.io.IOException: Unable to acquire 4194304 bytes of memory at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalS

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Reynold Xin
Can you paste the entire stacktrace of the error? In your original email you only included the last function call. Maybe I'm missing something here, but I still think the bad heuristics is the issue. Some operators pre-reserve memory before running anything in order to avoid starvation. For examp

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Pete Robbins
ok so let me try again ;-) I don't think that the page size calculation matters apart from hitting the allocation limit earlier if the page size is too large. If a task is going to need X bytes, it is going to need X bytes. In this case, for at least one of the tasks, X > maxmemory/no_active_task

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Reynold Xin
It is exactly the issue here, isn't it? We are using memory / N, where N should be the maximum number of active tasks. In the current master, we use the number of cores to approximate the number of tasks -- but it turned out to be a bad approximation in tests because it is set to 32 to increase co

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Pete Robbins
Oops... I meant to say "The page size calculation is NOT the issue here" On 16 September 2015 at 06:46, Pete Robbins wrote: > The page size calculation is the issue here as there is plenty of free > memory, although there is maybe a fair bit of wasted space in some pages. > It is that when we ha

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Pete Robbins
The page size calculation is the issue here as there is plenty of free memory, although there is maybe a fair bit of wasted space in some pages. It is that when we have a lot of tasks each is only allowed to reach 1/n of the available memory and several of the tasks bump in to that limit. With task

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Reynold Xin
Maybe we can change the heuristics in memory calculation to use SparkContext.defaultParallelism if it is local mode. On Tue, Sep 15, 2015 at 10:28 AM, Pete Robbins wrote: > Yes and at least there is an override by setting spark.sql.test.master to > local[8] , in fact local[16] worked on my 8 c

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Pete Robbins
Yes and at least there is an override by setting spark.sql.test.master to local[8] , in fact local[16] worked on my 8 core box. I'm happy to use this as a workaround but the 32 hard-coded will fail running build/tests on a clean checkout if you only have 8 cores. On 15 September 2015 at 17:40, M

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Marcelo Vanzin
That test explicitly sets the number of executor cores to 32. object TestHive extends TestHiveContext( new SparkContext( System.getProperty("spark.sql.test.master", "local[32]"), On Mon, Sep 14, 2015 at 11:22 PM, Reynold Xin wrote: > Yea I think this is where the heuristics is faili

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Pete Robbins
This is the culprit: https://issues.apache.org/jira/browse/SPARK-8406 "2. Make `TestHive` use a local mode `SparkContext` with 32 threads to increase parallelism The major reason for this is that, the original parallelism of 2 is too low to reproduce the data loss issue. Also, higher concu

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Pete Robbins
Ok so it looks like the max number of active tasks reaches 30. I'm not setting anything as it is a clean environment with clean spark code checkout. I'll dig further to see why so many tasks are active. Cheers, On 15 September 2015 at 07:22, Reynold Xin wrote: > Yea I think this is where the he

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-14 Thread Reynold Xin
Yea I think this is where the heuristics is failing -- it uses 8 cores to approximate the number of active tasks, but the tests somehow is using 32 (maybe because it explicitly sets it to that, or you set it yourself? I'm not sure which one) On Mon, Sep 14, 2015 at 11:06 PM, Pete Robbins wrote:

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-14 Thread Pete Robbins
Reynold, thanks for replying. getPageSize parameters: maxMemory=515396075, numCores=0 Calculated values: cores=8, default=4194304 So am I getting a large page size as I only have 8 cores? On 15 September 2015 at 00:40, Reynold Xin wrote: > Pete - can you do me a favor? > > > https://github.com

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-14 Thread Reynold Xin
Pete - can you do me a favor? https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleMemoryManager.scala#L174 Print the parameters that are passed into the getPageSize function, and check their values. On Mon, Sep 14, 2015 at 4:32 PM, Reynold Xin wrote:

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-14 Thread Reynold Xin
Is this on latest master / branch-1.5? out of the box we reserve only 16% (0.2 * 0.8) of the memory for execution (e.g. aggregate, join) / shuffle sorting. With a 3GB heap, that's 480MB. So each task gets 480MB / 32 = 15MB, and each operator reserves at least one page for execution. If your page s