Another data point on the 1.1.0 FetchFailures: Running this SQL command works on 1.0.2 but fails on 1.1.0 due to the exceptions mentioned earlier in this thread: "SELECT stringCol, SUM(doubleCol) FROM parquetTable GROUP BY stringCol"
The FetchFailure exception has the remote block manager that failed to produce the shuffle. I enabled GC logging and repeated, and the CoarseGrainedExecutorBackend JVM is just pounding in full GCs: 943.047: [Full GC [PSYoungGen: 5708288K->5536188K(8105472K)] [ParOldGen: 20971043K->20971202K(20971520K)] 26679331K->26507390K(29076992K) [PSPermGen: 52897K->52897K(57344K)], 48.4514680 secs] [Times: user=602.38 sys=4.43, real=48.44 secs] 991.591: [Full GC [PSYoungGen: 5708288K->5591884K(8105472K)] [ParOldGen: 20971202K->20971044K(20971520K)] 26679490K->26562928K(29076992K) [PSPermGen: 52897K->52897K(56832K)], 51.8109380 secs] [Times: user=645.44 sys=5.03, real=51.81 secs] 1043.431: [Full GC [PSYoungGen: 5708288K->5606238K(8105472K)] [ParOldGen: 20971044K->20971100K(20971520K)] 26679332K->26577339K(29076992K) [PSPermGen: 52908K->52908K(56320K)], 85.9367800 secs] [Times: user=1074.29 sys=9.49, real=85.92 secs] 1129.419: [Full GC [PSYoungGen: 5708288K->5634246K(8105472K)] [ParOldGen: 20971100K->20971471K(20971520K)] 26679388K->26605717K(29076992K) [PSPermGen: 52912K->52912K(55808K)], 52.2114100 secs] [Times: user=652.29 sys=4.94, real=52.21 secs] 1181.671: [Full GC [PSYoungGen: 5708288K->5656389K(8105472K)] [ParOldGen: 20971471K->20971125K(20971520K)] 26679759K->26627514K(29076992K) [PSPermGen: 52961K->52961K(55296K)], 65.3284620 secs] [Times: user=818.58 sys=6.71, real=65.31 secs] 1247.034: [Full GC [PSYoungGen: 5708288K->5672356K(8105472K)] [ParOldGen: 20971125K->20971417K(20971520K)] 26679413K->26643774K(29076992K) [PSPermGen: 52982K->52982K(54784K)], 91.2656940 secs] [Times: user=1146.94 sys=9.83, real=91.25 secs] 1338.318: [Full GC [PSYoungGen: 5708288K->5683177K(8105472K)] [ParOldGen: 20971417K->20971364K(20971520K)] 26679705K->26654541K(29076992K) [PSPermGen: 52982K->52982K(54784K)], 68.9840690 secs] [Times: user=866.72 sys=7.31, real=68.97 secs] 1407.319: [Full GC [PSYoungGen: 5708288K->5691352K(8105472K)] [ParOldGen: 20971364K->20971041K(20971520K)] 26679652K->26662394K(29076992K) [PSPermGen: 52985K->52985K(54272K)], 58.2522860 secs] [Times: user=724.33 sys=5.74, real=58.24 secs] 1465.572: [Full GC [PSYoungGen: 5708288K->5691382K(8105472K)] [ParOldGen: 20971041K->20971041K(20971520K)] 26679329K->26662424K(29076992K) [PSPermGen: 52986K->52986K(54272K)], 17.8034740 secs] [Times: user=221.43 sys=0.72, real=17.80 secs] 1483.377: [Full GC [PSYoungGen: 5708288K->5691383K(8105472K)] [ParOldGen: 20971041K->20971041K(20971520K)] 26679329K->26662424K(29076992K) [PSPermGen: 52987K->52987K(54272K)], 64.3194300 secs] [Times: user=800.32 sys=6.65, real=64.31 secs] 1547.700: [Full GC [PSYoungGen: 5708288K->5692228K(8105472K)] [ParOldGen: 20971041K->20971029K(20971520K)] 26679329K->26663257K(29076992K) [PSPermGen: 52991K->52991K(53760K)], 54.8107170 secs] [Times: user=681.07 sys=5.41, real=54.80 secs] 1602.519: [Full GC [PSYoungGen: 5708288K->5695801K(8105472K)] [ParOldGen: 20971029K->20971401K(20971520K)] 26679317K->26667203K(29076992K) [PSPermGen: 52993K->52993K(53760K)], 71.7970690 secs] [Times: user=896.22 sys=7.61, real=71.79 secs] I repeated the job, this time taking jmap -histos as it went along. The last histo I was able to get before the JVM locked up (getting a histo on a JVM in GC storms is very difficult) is here: num #instances #bytes class name ---------------------------------------------- 1: 31437598 2779681704 [B 2: 62794123 1507058952 scala.collection.immutable.$colon$colon 3: 31387645 1506606960 org.apache.spark.sql.catalyst.expressions.Cast 4: 31387645 1506606960 org.apache.spark.sql.catalyst.expressions.SumFunction 5: 31387645 1255505800 org.apache.spark.sql.catalyst.expressions.Literal 6: 31387645 1255505800 org.apache.spark.sql.catalyst.expressions.Coalesce 7: 31387645 1255505800 org.apache.spark.sql.catalyst.expressions.MutableLiteral 8: 31387645 1255505800 org.apache.spark.sql.catalyst.expressions.Add 9: 31391224 1004519168 java.util.HashMap$Entry 10: 31402978 756090664 [Ljava.lang.Object; 11: 31395785 753498840 java.lang.Double 12: 31387645 753303480 [Lorg.apache.spark.sql.catalyst.expressions.AggregateFunction; 13: 31395808 502332928 org.apache.spark.sql.catalyst.expressions.GenericRow 14: 31387645 502202320 org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5 15: 772 234947960 [Ljava.util.HashMap$Entry; 16: 711 106309792 [I 17: 106747 13673936 <methodKlass> 18: 106747 12942856 <constMethodKlass> 19: 8186 9037880 <constantPoolKlass> 20: 8186 8085000 <instanceKlassKlass> 21: 222494 5339856 scala.Tuple2 22: 213731 5129544 java.lang.Long 23: 6609 4754144 <constantPoolCacheKlass> 24: 36254 3348992 [C 25: 73 2393232 [Lscala.concurrent.forkjoin.ForkJoinTask; 26: 88787 2130888 org.apache.spark.storage.ShuffleBlockId 27: 4826 1812216 <methodDataKlass> 28: 8688 1020352 java.lang.Class 29: 15957 869016 [[I For me at least a symptom is large GC storms on the executors. Is anyone observing these FetchFailures on a consistent basis that doesn't also see heavy GC? Hope this helps with debugging. Andrew On Mon, Sep 22, 2014 at 3:36 AM, David Rowe <davidr...@gmail.com> wrote: > I've run into this with large shuffles - I assumed that there was > contention between the shuffle output files and the JVM for memory. > Whenever we start getting these fetch failures, it corresponds with high > load on the machines the blocks are being fetched from, and in some cases > complete unresponsiveness (no ssh etc). Setting the timeout higher, or the > JVM heap lower (as a percentage of total machine memory) seemed to help.. > > > > On Mon, Sep 22, 2014 at 8:02 PM, Christoph Sawade < > christoph.saw...@googlemail.com> wrote: > > > Hey all. We had also the same problem described by Nishkam almost in the > > same big data setting. We fixed the fetch failure by increasing the > timeout > > for acks in the driver: > > > > set("spark.core.connection.ack.wait.timeout", "600") // 10 minutes > timeout > > for acks between nodes > > > > Cheers, Christoph > > > > 2014-09-22 9:24 GMT+02:00 Hortonworks <zzh...@hortonworks.com>: > > > > > Actually I met similar issue when doing groupByKey and then count if > the > > > shuffle size is big e.g. 1tb. > > > > > > Thanks. > > > > > > Zhan Zhang > > > > > > Sent from my iPhone > > > > > > > On Sep 21, 2014, at 10:56 PM, Nishkam Ravi <nr...@cloudera.com> > wrote: > > > > > > > > Thanks for the quick follow up Reynold and Patrick. Tried a run with > > > > significantly higher ulimit, doesn't seem to help. The executors have > > > 35GB > > > > each. Btw, with a recent version of the branch, the error message is > > > "fetch > > > > failures" as opposed to "too many open files". Not sure if they are > > > > related. Please note that the workload runs fine with head set to > > > 066765d. > > > > In case you want to reproduce the problem: I'm running slightly > > modified > > > > ScalaPageRank (with KryoSerializer and persistence level > > > > memory_and_disk_ser) on a 30GB input dataset and a 6-node cluster. > > > > > > > > Thanks, > > > > Nishkam > > > > > > > > On Sun, Sep 21, 2014 at 10:32 PM, Patrick Wendell < > pwend...@gmail.com> > > > > wrote: > > > > > > > >> Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible > > > >> that you are just having more spilling as a result of the patch and > so > > > >> the filesystem is opening more files. I would try increasing the > > > >> ulimit. > > > >> > > > >> How much memory do your executors have? > > > >> > > > >> - Patrick > > > >> > > > >> On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell < > pwend...@gmail.com > > > > > > >> wrote: > > > >>> Hey the numbers you mentioned don't quite line up - did you mean PR > > > 2711? > > > >>> > > > >>> On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin <r...@databricks.com> > > > >> wrote: > > > >>>> It seems like you just need to raise the ulimit? > > > >>>> > > > >>>> > > > >>>> On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi <nr...@cloudera.com > > > > > >> wrote: > > > >>>> > > > >>>>> Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one > > of > > > >> the > > > >>>>> workloads. Tried tracing the problem through change set analysis. > > > Looks > > > >>>>> like the offending commit is 4fde28c from Aug 4th for PR1707. > > Please > > > >> see > > > >>>>> SPARK-3633 for more details. > > > >>>>> > > > >>>>> Thanks, > > > >>>>> Nishkam > > > >> > > > > > > -- > > > CONFIDENTIALITY NOTICE > > > NOTICE: This message is intended for the use of the individual or > entity > > to > > > which it is addressed and may contain information that is confidential, > > > privileged and exempt from disclosure under applicable law. If the > reader > > > of this message is not the intended recipient, you are hereby notified > > that > > > any printing, copying, dissemination, distribution, disclosure or > > > forwarding of this communication is strictly prohibited. If you have > > > received this communication in error, please contact the sender > > immediately > > > and delete it from your system. Thank You. > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > > > > > >