Seeing similar issues, did you find a solution? One would be to increase the number of partitions if you're doing lots of object creation.
On Thu, Feb 12, 2015 at 7:26 PM, fightf...@163.com <fightf...@163.com> wrote: > Hi, patrick > > Really glad to get your reply. > Yes, we are doing group by operations for our work. We know that this is > common for growTable when processing large data sets. > > The problem actually goes to : Do we have any possible chance to > self-modify the initialCapacity using specifically for our > application? Does spark provide such configs for achieving that goal? > > We know that this is trickle to get it working. Just want to know that how > could this be resolved, or from other possible channel for > we did not cover. > > Expecting for your kind advice. > > Thanks, > Sun. > > ------------------------------ > fightf...@163.com > > > *From:* Patrick Wendell <pwend...@gmail.com> > *Date:* 2015-02-12 16:12 > *To:* fightf...@163.com > *CC:* user <u...@spark.apache.org>; dev <dev@spark.apache.org> > *Subject:* Re: Re: Sort Shuffle performance issues about using > AppendOnlyMap for large data sets > The map will start with a capacity of 64, but will grow to accommodate > new data. Are you using the groupBy operator in Spark or are you using > Spark SQL's group by? This usually happens if you are grouping or > aggregating in a way that doesn't sufficiently condense the data > created from each input partition. > > - Patrick > > On Wed, Feb 11, 2015 at 9:37 PM, fightf...@163.com <fightf...@163.com> > wrote: > > Hi, > > > > Really have no adequate solution got for this issue. Expecting any > available > > analytical rules or hints. > > > > Thanks, > > Sun. > > > > ________________________________ > > fightf...@163.com > > > > > > From: fightf...@163.com > > Date: 2015-02-09 11:56 > > To: user; dev > > Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap > for > > large data sets > > Hi, > > Problem still exists. Any experts would take a look at this? > > > > Thanks, > > Sun. > > > > ________________________________ > > fightf...@163.com > > > > > > From: fightf...@163.com > > Date: 2015-02-06 17:54 > > To: user; dev > > Subject: Sort Shuffle performance issues about using AppendOnlyMap for > large > > data sets > > Hi, all > > Recently we had caught performance issues when using spark 1.2.0 to read > > data from hbase and do some summary work. > > Our scenario means to : read large data sets from hbase (maybe 100G+ > file) , > > form hbaseRDD, transform to schemardd, > > groupby and aggregate the data while got fewer new summary data sets, > > loading data into hbase (phoenix). > > > > Our major issue lead to : aggregate large datasets to get summary data > sets > > would consume too long time (1 hour +) , while that > > should be supposed not so bad performance. We got the dump file attached > and > > stacktrace from jstack like the following: > > > > From the stacktrace and dump file we can identify that processing large > > datasets would cause frequent AppendOnlyMap growing, and > > leading to huge map entrysize. We had referenced the source code of > > org.apache.spark.util.collection.AppendOnlyMap and found that > > the map had been initialized with capacity of 64. That would be too small > > for our use case. > > > > So the question is : Does anyone had encounted such issues before? How > did > > that be resolved? I cannot find any jira issues for such problems and > > if someone had seen, please kindly let us know. > > > > More specified solution would goes to : Does any possibility exists for > user > > defining the map capacity releatively in spark? If so, please > > tell how to achieve that. > > > > Best Thanks, > > Sun. > > > > Thread 22432: (state = IN_JAVA) > > - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, > > line=224 (Compiled frame; information may be imprecise) > > - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() > > @bci=1, line=38 (Interpreted frame) > > - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, > > line=198 (Compiled frame) > > - > > > org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, > > scala.Function2) @bci=201, line=145 (Compiled frame) > > - > > > org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object, > > scala.Function2) @bci=3, line=32 (Compiled frame) > > - > > > org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator) > > @bci=141, line=205 (Compiled frame) > > - > > > org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator) > > @bci=74, line=58 (Interpreted frame) > > - > > > org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) > > @bci=169, line=68 (Interpreted frame) > > - > > > org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) > > @bci=2, line=41 (Interpreted frame) > > - org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted > > frame) > > - org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196 > > (Interpreted frame) > > - > > > java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) > > @bci=95, line=1145 (Interpreted frame) > > - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 > > (Interpreted frame) > > - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame) > > > > > > Thread 22431: (state = IN_JAVA) > > - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, > > line=224 (Compiled frame; information may be imprecise) > > - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() > > @bci=1, line=38 (Interpreted frame) > > - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, > > line=198 (Compiled frame) > > - > > > org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, > > scala.Function2) @bci=201, line=145 (Compiled frame) > > - > > > org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object, > > scala.Function2) @bci=3, line=32 (Compiled frame) > > - > > > org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator) > > @bci=141, line=205 (Compiled frame) > > - > > > org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator) > > @bci=74, line=58 (Interpreted frame) > > - > > > org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) > > @bci=169, line=68 (Interpreted frame) > > - > > > org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) > > @bci=2, line=41 (Interpreted frame) > > - org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted > > frame) > > - org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196 > > (Interpreted frame) > > - > > > java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) > > @bci=95, line=1145 (Interpreted frame) > > - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 > > (Interpreted frame) > > - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame) > > > > > > fightf...@163.com > > 1 attachments > > dump.png(42K) download preview > >