I wonder if it's worth logging a warning on executor startup if the heap size is >32GB? Given that this has come up a few times it might be good to embed this community knowledge into the product instead of email threads.
On Thu, Oct 16, 2014 at 7:47 AM, Akshat Aranya <aara...@gmail.com> wrote: > I just want to pitch in and say that I ran into the same problem with > running with 64GB executors. For example, some of the tasks take 5 minutes > to execute, out of which 4 minutes are spent in GC. I'll try out smaller > executors. > > On Mon, Oct 6, 2014 at 6:35 PM, Otis Gospodnetic < > otis.gospodne...@gmail.com> wrote: > >> Hi, >> >> The other option to consider is using G1 GC, which should behave better >> with large heaps. But pointers are not compressed in heaps > 32 GB in >> size, so you may be better off staying under 32 GB. >> >> Otis >> -- >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management >> Solr & Elasticsearch Support * http://sematext.com/ >> >> >> On Mon, Oct 6, 2014 at 8:08 PM, Mingyu Kim <m...@palantir.com> wrote: >> >>> Ok, cool. This seems to be general issues in JVM with very large heaps. >>> I agree that the best workaround would be to keep the heap size below 32GB. >>> Thanks guys! >>> >>> Mingyu >>> >>> From: Arun Ahuja <aahuj...@gmail.com> >>> Date: Monday, October 6, 2014 at 7:50 AM >>> To: Andrew Ash <and...@andrewash.com> >>> Cc: Mingyu Kim <m...@palantir.com>, "user@spark.apache.org" < >>> user@spark.apache.org>, Dennis Lawler <dlaw...@palantir.com> >>> Subject: Re: Larger heap leads to perf degradation due to GC >>> >>> We have used the strategy that you suggested, Andrew - using many >>> workers per machine and keeping the heaps small (< 20gb). >>> >>> Using a large heap resulted in workers hanging or not responding >>> (leading to timeouts). The same dataset/job for us will fail (most often >>> due to akka disassociated or fetch failures errors) with 10 cores / 100 >>> executors, 60 gb per executor while succceed with 1 core / 1000 executors / >>> 6gb per executor. >>> >>> When the job does succceed with more cores per executor and larger heap >>> it is usually much slower than the smaller executors (the same 8-10 min job >>> taking 15-20 min to complete) >>> >>> The unfortunate downside of this has been, we have had some large >>> broadcast variables which may not fit into memory (and unnecessarily >>> duplicated) when using the smaller executors. >>> >>> Most of this is anecdotal but for the most part we have had more success >>> and consistency with more executors with smaller memory requirements. >>> >>> On Sun, Oct 5, 2014 at 7:20 PM, Andrew Ash <and...@andrewash.com> wrote: >>> >>>> Hi Mingyu, >>>> >>>> Maybe we should be limiting our heaps to 32GB max and running multiple >>>> workers per machine to avoid large GC issues. >>>> >>>> For a 128GB memory, 32 core machine, this could look like: >>>> >>>> SPARK_WORKER_INSTANCES=4 >>>> SPARK_WORKER_MEMORY=32 >>>> SPARK_WORKER_CORES=8 >>>> >>>> Are people running with large (32GB+) executor heaps in production? >>>> I'd be curious to hear if so. >>>> >>>> Cheers! >>>> Andrew >>>> >>>> On Thu, Oct 2, 2014 at 1:30 PM, Mingyu Kim <m...@palantir.com> wrote: >>>> >>>>> This issue definitely needs more investigation, but I just wanted to >>>>> quickly check if anyone has run into this problem or has general guidance >>>>> around it. We’ve seen a performance degradation with a large heap on a >>>>> simple map task (I.e. No shuffle). We’ve seen the slowness starting around >>>>> from 50GB heap. (I.e. spark.executor.memoty=50g) And, when we checked the >>>>> CPU usage, there were just a lot of GCs going on. >>>>> >>>>> Has anyone seen a similar problem? >>>>> >>>>> Thanks, >>>>> Mingyu >>>>> >>>> >>>> >>> >> >