Another data point on the 1.1.0 FetchFailures:

Running this SQL command works on 1.0.2 but fails on 1.1.0 due to the
exceptions mentioned earlier in this thread: "SELECT stringCol,
SUM(doubleCol) FROM parquetTable GROUP BY stringCol"

The FetchFailure exception has the remote block manager that failed to
produce the shuffle.  I enabled GC logging and repeated, and the
CoarseGrainedExecutorBackend JVM is just pounding in full GCs:

943.047: [Full GC [PSYoungGen: 5708288K->5536188K(8105472K)] [ParOldGen:
20971043K->20971202K(20971520K)] 26679331K->26507390K(29076992K)
[PSPermGen: 52897K->52897K(57344K)], 48.4514680 secs] [Times: user=602.38
sys=4.43, real=48.44 secs]
991.591: [Full GC [PSYoungGen: 5708288K->5591884K(8105472K)] [ParOldGen:
20971202K->20971044K(20971520K)] 26679490K->26562928K(29076992K)
[PSPermGen: 52897K->52897K(56832K)], 51.8109380 secs] [Times: user=645.44
sys=5.03, real=51.81 secs]
1043.431: [Full GC [PSYoungGen: 5708288K->5606238K(8105472K)] [ParOldGen:
20971044K->20971100K(20971520K)] 26679332K->26577339K(29076992K)
[PSPermGen: 52908K->52908K(56320K)], 85.9367800 secs] [Times: user=1074.29
sys=9.49, real=85.92 secs]
1129.419: [Full GC [PSYoungGen: 5708288K->5634246K(8105472K)] [ParOldGen:
20971100K->20971471K(20971520K)] 26679388K->26605717K(29076992K)
[PSPermGen: 52912K->52912K(55808K)], 52.2114100 secs] [Times: user=652.29
sys=4.94, real=52.21 secs]
1181.671: [Full GC [PSYoungGen: 5708288K->5656389K(8105472K)] [ParOldGen:
20971471K->20971125K(20971520K)] 26679759K->26627514K(29076992K)
[PSPermGen: 52961K->52961K(55296K)], 65.3284620 secs] [Times: user=818.58
sys=6.71, real=65.31 secs]
1247.034: [Full GC [PSYoungGen: 5708288K->5672356K(8105472K)] [ParOldGen:
20971125K->20971417K(20971520K)] 26679413K->26643774K(29076992K)
[PSPermGen: 52982K->52982K(54784K)], 91.2656940 secs] [Times: user=1146.94
sys=9.83, real=91.25 secs]
1338.318: [Full GC [PSYoungGen: 5708288K->5683177K(8105472K)] [ParOldGen:
20971417K->20971364K(20971520K)] 26679705K->26654541K(29076992K)
[PSPermGen: 52982K->52982K(54784K)], 68.9840690 secs] [Times: user=866.72
sys=7.31, real=68.97 secs]
1407.319: [Full GC [PSYoungGen: 5708288K->5691352K(8105472K)] [ParOldGen:
20971364K->20971041K(20971520K)] 26679652K->26662394K(29076992K)
[PSPermGen: 52985K->52985K(54272K)], 58.2522860 secs] [Times: user=724.33
sys=5.74, real=58.24 secs]
1465.572: [Full GC [PSYoungGen: 5708288K->5691382K(8105472K)] [ParOldGen:
20971041K->20971041K(20971520K)] 26679329K->26662424K(29076992K)
[PSPermGen: 52986K->52986K(54272K)], 17.8034740 secs] [Times: user=221.43
sys=0.72, real=17.80 secs]
1483.377: [Full GC [PSYoungGen: 5708288K->5691383K(8105472K)] [ParOldGen:
20971041K->20971041K(20971520K)] 26679329K->26662424K(29076992K)
[PSPermGen: 52987K->52987K(54272K)], 64.3194300 secs] [Times: user=800.32
sys=6.65, real=64.31 secs]
1547.700: [Full GC [PSYoungGen: 5708288K->5692228K(8105472K)] [ParOldGen:
20971041K->20971029K(20971520K)] 26679329K->26663257K(29076992K)
[PSPermGen: 52991K->52991K(53760K)], 54.8107170 secs] [Times: user=681.07
sys=5.41, real=54.80 secs]
1602.519: [Full GC [PSYoungGen: 5708288K->5695801K(8105472K)] [ParOldGen:
20971029K->20971401K(20971520K)] 26679317K->26667203K(29076992K)
[PSPermGen: 52993K->52993K(53760K)], 71.7970690 secs] [Times: user=896.22
sys=7.61, real=71.79 secs]



I repeated the job, this time taking jmap -histos as it went along.  The
last histo I was able to get before the JVM locked up (getting a histo on a
JVM in GC storms is very difficult) is here:

 num     #instances         #bytes  class name
----------------------------------------------
   1:      31437598     2779681704  [B
   2:      62794123     1507058952  scala.collection.immutable.$colon$colon
   3:      31387645     1506606960
 org.apache.spark.sql.catalyst.expressions.Cast
   4:      31387645     1506606960
 org.apache.spark.sql.catalyst.expressions.SumFunction
   5:      31387645     1255505800
 org.apache.spark.sql.catalyst.expressions.Literal
   6:      31387645     1255505800
 org.apache.spark.sql.catalyst.expressions.Coalesce
   7:      31387645     1255505800
 org.apache.spark.sql.catalyst.expressions.MutableLiteral
   8:      31387645     1255505800
 org.apache.spark.sql.catalyst.expressions.Add
   9:      31391224     1004519168  java.util.HashMap$Entry
  10:      31402978      756090664  [Ljava.lang.Object;
  11:      31395785      753498840  java.lang.Double
  12:      31387645      753303480
 [Lorg.apache.spark.sql.catalyst.expressions.AggregateFunction;
  13:      31395808      502332928
 org.apache.spark.sql.catalyst.expressions.GenericRow
  14:      31387645      502202320
 org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5
  15:           772      234947960  [Ljava.util.HashMap$Entry;
  16:           711      106309792  [I
  17:        106747       13673936  <methodKlass>
  18:        106747       12942856  <constMethodKlass>
  19:          8186        9037880  <constantPoolKlass>
  20:          8186        8085000  <instanceKlassKlass>
  21:        222494        5339856  scala.Tuple2
  22:        213731        5129544  java.lang.Long
  23:          6609        4754144  <constantPoolCacheKlass>
  24:         36254        3348992  [C
  25:            73        2393232
 [Lscala.concurrent.forkjoin.ForkJoinTask;
  26:         88787        2130888  org.apache.spark.storage.ShuffleBlockId
  27:          4826        1812216  <methodDataKlass>
  28:          8688        1020352  java.lang.Class
  29:         15957         869016  [[I


For me at least a symptom is large GC storms on the executors.  Is anyone
observing these FetchFailures on a consistent basis that doesn't also see
heavy GC?

Hope this helps with debugging.

Andrew


On Mon, Sep 22, 2014 at 3:36 AM, David Rowe <davidr...@gmail.com> wrote:

> I've run into this with large shuffles - I assumed that there was
> contention between the shuffle output files and the JVM for memory.
> Whenever we start getting these fetch failures, it corresponds with high
> load on the machines the blocks are being fetched from, and in some cases
> complete unresponsiveness (no ssh etc). Setting the timeout higher, or the
> JVM heap lower (as a percentage of total machine memory) seemed to help..
>
>
>
> On Mon, Sep 22, 2014 at 8:02 PM, Christoph Sawade <
> christoph.saw...@googlemail.com> wrote:
>
> > Hey all. We had also the same problem described by Nishkam almost in the
> > same big data setting. We fixed the fetch failure by increasing the
> timeout
> > for acks in the driver:
> >
> > set("spark.core.connection.ack.wait.timeout", "600") // 10 minutes
> timeout
> > for acks between nodes
> >
> > Cheers, Christoph
> >
> > 2014-09-22 9:24 GMT+02:00 Hortonworks <zzh...@hortonworks.com>:
> >
> > > Actually I met similar issue when doing groupByKey and then count if
> the
> > > shuffle size is big e.g. 1tb.
> > >
> > > Thanks.
> > >
> > > Zhan Zhang
> > >
> > > Sent from my iPhone
> > >
> > > > On Sep 21, 2014, at 10:56 PM, Nishkam Ravi <nr...@cloudera.com>
> wrote:
> > > >
> > > > Thanks for the quick follow up Reynold and Patrick. Tried a run with
> > > > significantly higher ulimit, doesn't seem to help. The executors have
> > > 35GB
> > > > each. Btw, with a recent version of the branch, the error message is
> > > "fetch
> > > > failures" as opposed to "too many open files". Not sure if they are
> > > > related.  Please note that the workload runs fine with head set to
> > > 066765d.
> > > > In case you want to reproduce the problem: I'm running slightly
> > modified
> > > > ScalaPageRank (with KryoSerializer and persistence level
> > > > memory_and_disk_ser) on a 30GB input dataset and a 6-node cluster.
> > > >
> > > > Thanks,
> > > > Nishkam
> > > >
> > > > On Sun, Sep 21, 2014 at 10:32 PM, Patrick Wendell <
> pwend...@gmail.com>
> > > > wrote:
> > > >
> > > >> Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible
> > > >> that you are just having more spilling as a result of the patch and
> so
> > > >> the filesystem is opening more files. I would try increasing the
> > > >> ulimit.
> > > >>
> > > >> How much memory do your executors have?
> > > >>
> > > >> - Patrick
> > > >>
> > > >> On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell <
> pwend...@gmail.com
> > >
> > > >> wrote:
> > > >>> Hey the numbers you mentioned don't quite line up - did you mean PR
> > > 2711?
> > > >>>
> > > >>> On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin <r...@databricks.com>
> > > >> wrote:
> > > >>>> It seems like you just need to raise the ulimit?
> > > >>>>
> > > >>>>
> > > >>>> On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi <nr...@cloudera.com
> >
> > > >> wrote:
> > > >>>>
> > > >>>>> Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one
> > of
> > > >> the
> > > >>>>> workloads. Tried tracing the problem through change set analysis.
> > > Looks
> > > >>>>> like the offending commit is 4fde28c from Aug 4th for PR1707.
> > Please
> > > >> see
> > > >>>>> SPARK-3633 for more details.
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>> Nishkam
> > > >>
> > >
> > > --
> > > CONFIDENTIALITY NOTICE
> > > NOTICE: This message is intended for the use of the individual or
> entity
> > to
> > > which it is addressed and may contain information that is confidential,
> > > privileged and exempt from disclosure under applicable law. If the
> reader
> > > of this message is not the intended recipient, you are hereby notified
> > that
> > > any printing, copying, dissemination, distribution, disclosure or
> > > forwarding of this communication is strictly prohibited. If you have
> > > received this communication in error, please contact the sender
> > immediately
> > > and delete it from your system. Thank You.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > > For additional commands, e-mail: dev-h...@spark.apache.org
> > >
> > >
> >
>

Reply via email to