Yea I think this is where the heuristics is failing -- it uses 8 cores to approximate the number of active tasks, but the tests somehow is using 32 (maybe because it explicitly sets it to that, or you set it yourself? I'm not sure which one)
On Mon, Sep 14, 2015 at 11:06 PM, Pete Robbins <robbin...@gmail.com> wrote: > Reynold, thanks for replying. > > getPageSize parameters: maxMemory=515396075, numCores=0 > Calculated values: cores=8, default=4194304 > > So am I getting a large page size as I only have 8 cores? > > On 15 September 2015 at 00:40, Reynold Xin <r...@databricks.com> wrote: > >> Pete - can you do me a favor? >> >> >> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleMemoryManager.scala#L174 >> >> Print the parameters that are passed into the getPageSize function, and >> check their values. >> >> On Mon, Sep 14, 2015 at 4:32 PM, Reynold Xin <r...@databricks.com> wrote: >> >>> Is this on latest master / branch-1.5? >>> >>> out of the box we reserve only 16% (0.2 * 0.8) of the memory for >>> execution (e.g. aggregate, join) / shuffle sorting. With a 3GB heap, that's >>> 480MB. So each task gets 480MB / 32 = 15MB, and each operator reserves at >>> least one page for execution. If your page size is 4MB, it only takes 3 >>> operators to use up its memory. >>> >>> The thing is page size is dynamically determined -- and in your case it >>> should be smaller than 4MB. >>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleMemoryManager.scala#L174 >>> >>> Maybe there is a place that in the maven tests that we explicitly set >>> the page size (spark.buffer.pageSize) to 4MB? If yes, we need to find it >>> and just remove it. >>> >>> >>> On Mon, Sep 14, 2015 at 4:16 AM, Pete Robbins <robbin...@gmail.com> >>> wrote: >>> >>>> I keep hitting errors running the tests on 1.5 such as >>>> >>>> >>>> - join31 *** FAILED *** >>>> Failed to execute query using catalyst: >>>> Error: Job aborted due to stage failure: Task 9 in stage 3653.0 >>>> failed 1 times, most recent failure: Lost task 9.0 in stage 3653.0 (TID >>>> 123363, localhost): java.io.IOException: Unable to acquire 4194304 bytes of >>>> memory >>>> at >>>> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) >>>> >>>> >>>> This is using the command >>>> build/mvn -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver test >>>> >>>> >>>> I don't see these errors in any of the amplab jenkins builds. Do those >>>> builds have any configuration/environment that I may be missing? My build >>>> is running with whatever defaults are in the top level pom.xml, eg -Xmx3G. >>>> >>>> I can make these tests pass by setting spark.shuffle.memoryFraction=0.6 >>>> in the HiveCompatibilitySuite rather than the default 0.2 value. >>>> >>>> Trying to analyze what is going on with the test it is related to the >>>> number of active tasks, which seems to rise to 32, and so the >>>> ShuffleMemoryManager allows less memory per task even though most of those >>>> tasks do not have any memory allocated to them. >>>> >>>> Has anyone seen issues like this before? >>>> >>> >>> >> >