Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Patrick Wendell Mon, 14 Jul 2014 22:21:07 -0700

Hey Andrew,

Yeah, that would be preferable. Definitely worth investigating both,
but the regression is more pressing at the moment.


- Patrick

On Mon, Jul 14, 2014 at 10:02 PM, Andrew Ash <and...@andrewash.com> wrote:
> I don't believe mine is a regression. But it is related to thread safety on
> Hadoop Configuration objects. Should I start a new thread?
> On Jul 15, 2014 12:55 AM, "Patrick Wendell" <pwend...@gmail.com> wrote:
>
>> Andrew is your issue also a regression from 1.0.0 to 1.0.1? The
>> immediate priority is addressing regressions between these two
>> releases.
>>
>> On Mon, Jul 14, 2014 at 9:05 PM, Andrew Ash <and...@andrewash.com> wrote:
>> > I'm not sure either of those PRs will fix the concurrent adds to
>> > Configuration issue I observed. I've got a stack trace and writeup I'll
>> > share in an hour or two (traveling today).
>> > On Jul 14, 2014 9:50 PM, "scwf" <wangf...@huawei.com> wrote:
>> >
>> >> hi，Cody
>> >>   i met this issue days before and i post a PR for this(
>> >> https://github.com/apache/spark/pull/1385)
>> >> it's very strange that if i synchronize conf it will deadlock but it is
>> ok
>> >> when synchronize initLocalJobConfFuncOpt
>> >>
>> >>
>> >>  Here's the entire jstack output.
>> >>>
>> >>>
>> >>> On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell <pwend...@gmail.com
>> >>> <mailto:pwend...@gmail.com>> wrote:
>> >>>
>> >>>     Hey Cody,
>> >>>
>> >>>     This Jstack seems truncated, would you mind giving the entire stack
>> >>>     trace? For the second thread, for instance, we can't see where the
>> >>>     lock is being acquired.
>> >>>
>> >>>     - Patrick
>> >>>
>> >>>     On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
>> >>>     <cody.koenin...@mediacrossing.com <mailto:cody.koeninger@
>> >>> mediacrossing.com>> wrote:
>> >>>      > Hi all, just wanted to give a heads up that we're seeing a
>> >>> reproducible
>> >>>      > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
>> >>>      >
>> >>>      > If jira is a better place for this, apologies in advance -
>> figured
>> >>> talking
>> >>>      > about it on the mailing list was friendlier than randomly
>> >>> (re)opening jira
>> >>>      > tickets.
>> >>>      >
>> >>>      > I know Gary had mentioned some issues with 1.0.1 on the mailing
>> >>> list, once
>> >>>      > we got a thread dump I wanted to follow up.
>> >>>      >
>> >>>      > The thread dump shows the deadlock occurs in the synchronized
>> >>> block of code
>> >>>      > that was changed in HadoopRDD.scala, for the Spark-1097 issue
>> >>>      >
>> >>>      > Relevant portions of the thread dump are summarized below, we
>> can
>> >>> provide
>> >>>      > the whole dump if it's useful.
>> >>>      >
>> >>>      > Found one Java-level deadlock:
>> >>>      > =============================
>> >>>      > "Executor task launch worker-1":
>> >>>      >   waiting to lock monitor 0x00007f250400c520 (object
>> >>> 0x00000000fae7dc30, a
>> >>>      > org.apache.hadoop.co <http://org.apache.hadoop.co>
>> >>>      > nf.Configuration),
>> >>>      >   which is held by "Executor task launch worker-0"
>> >>>      > "Executor task launch worker-0":
>> >>>      >   waiting to lock monitor 0x00007f2520495620 (object
>> >>> 0x00000000faeb4fc8, a
>> >>>      > java.lang.Class),
>> >>>      >   which is held by "Executor task launch worker-1"
>> >>>      >
>> >>>      >
>> >>>      > "Executor task launch worker-1":
>> >>>      >         at
>> >>>      > org.apache.hadoop.conf.Configuration.reloadConfiguration(
>> >>> Configuration.java:791)
>> >>>      >         - waiting to lock <0x00000000fae7dc30> (a
>> >>>      > org.apache.hadoop.conf.Configuration)
>> >>>      >         at
>> >>>      > org.apache.hadoop.conf.Configuration.addDefaultResource(
>> >>> Configuration.java:690)
>> >>>      >         - locked <0x00000000faca6ff8> (a java.lang.Class for
>> >>>      > org.apache.hadoop.conf.Configurati
>> >>>      > on)
>> >>>      >         at
>> >>>      > org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(
>> >>> HdfsConfiguration.java:34)
>> >>>      >         at
>> >>>      > org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>
>> >>> (DistributedFileSystem.java:110
>> >>>      > )
>> >>>      >         at sun.reflect.NativeConstructorAccessorImpl.
>> >>> newInstance0(Native
>> >>>      > Method)
>> >>>      >         at
>> >>>      > sun.reflect.NativeConstructorAccessorImpl.newInstance(
>> >>> NativeConstructorAccessorImpl.
>> >>>      > java:57)
>> >>>      >         at sun.reflect.NativeConstructorAccessorImpl.
>> >>> newInstance0(Native
>> >>>      > Method)
>> >>>      >         at
>> >>>      > sun.reflect.NativeConstructorAccessorImpl.newInstance(
>> >>> NativeConstructorAccessorImpl.
>> >>>      > java:57)
>> >>>      >         at
>> >>>      > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
>> >>> DelegatingConstructorAcces
>> >>>      > sorImpl.java:45)
>> >>>      >         at java.lang.reflect.Constructor.
>> >>> newInstance(Constructor.java:525)
>> >>>      >         at java.lang.Class.newInstance0(Class.java:374)
>> >>>      >         at java.lang.Class.newInstance(Class.java:327)
>> >>>      >         at java.util.ServiceLoader$LazyIterator.next(
>> >>> ServiceLoader.java:373)
>> >>>      >         at
>> java.util.ServiceLoader$1.next(ServiceLoader.java:445)
>> >>>      >         at
>> >>>      > org.apache.hadoop.fs.FileSystem.loadFileSystems(
>> >>> FileSystem.java:2364)
>> >>>      >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
>> >>>      > org.apache.hadoop.fs.FileSystem)
>> >>>      >         at
>> >>>      > org.apache.hadoop.fs.FileSystem.getFileSystemClass(
>> >>> FileSystem.java:2375)
>> >>>      >         at
>> >>>      > org.apache.hadoop.fs.FileSystem.createFileSystem(
>> >>> FileSystem.java:2392)
>> >>>      >         at org.apache.hadoop.fs.FileSystem.access$200(
>> >>> FileSystem.java:89)
>> >>>      >         at
>> >>>      > org.apache.hadoop.fs.FileSystem$Cache.getInternal(
>> >>> FileSystem.java:2431)
>> >>>      >         at org.apache.hadoop.fs.FileSystem$Cache.get(
>> >>> FileSystem.java:2413)
>> >>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>> >>> java:368)
>> >>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>> >>> java:167)
>> >>>      >         at
>> >>>      > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(
>> >>> JobConf.java:587)
>> >>>      >         at
>> >>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>> >>> FileInputFormat.java:315)
>> >>>      >         at
>> >>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>> >>> FileInputFormat.java:288)
>> >>>      >         at
>> >>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>> >>> SparkContext.scala:546)
>> >>>      >         at
>> >>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>> >>> SparkContext.scala:546)
>> >>>      >         at
>> >>>      > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$
>> >>> 1.apply(HadoopRDD.scala:145)
>> >>>      >
>> >>>      >
>> >>>      >
>> >>>      > ...elided...
>> >>>      >
>> >>>      >
>> >>>      > "Executor task launch worker-0" daemon prio=10
>> >>> tid=0x0000000001e71800
>> >>>      > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
>> >>>      >    java.lang.Thread.State: BLOCKED (on object monitor)
>> >>>      >         at
>> >>>      > org.apache.hadoop.fs.FileSystem.loadFileSystems(
>> >>> FileSystem.java:2362)
>> >>>      >         - waiting to lock <0x00000000faeb4fc8> (a
>> java.lang.Class
>> >>> for
>> >>>      > org.apache.hadoop.fs.FileSystem)
>> >>>      >         at
>> >>>      > org.apache.hadoop.fs.FileSystem.getFileSystemClass(
>> >>> FileSystem.java:2375)
>> >>>      >         at
>> >>>      > org.apache.hadoop.fs.FileSystem.createFileSystem(
>> >>> FileSystem.java:2392)
>> >>>      >         at org.apache.hadoop.fs.FileSystem.access$200(
>> >>> FileSystem.java:89)
>> >>>      >         at
>> >>>      > org.apache.hadoop.fs.FileSystem$Cache.getInternal(
>> >>> FileSystem.java:2431)
>> >>>      >         at org.apache.hadoop.fs.FileSystem$Cache.get(
>> >>> FileSystem.java:2413)
>> >>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>> >>> java:368)
>> >>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>> >>> java:167)
>> >>>      >         at
>> >>>      > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(
>> >>> JobConf.java:587)
>> >>>      >         at
>> >>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>> >>> FileInputFormat.java:315)
>> >>>      >         at
>> >>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>> >>> FileInputFormat.java:288)
>> >>>      >         at
>> >>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>> >>> SparkContext.scala:546)
>> >>>      >         at
>> >>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>> >>> SparkContext.scala:546)
>> >>>      >         at
>> >>>      > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$
>> >>> 1.apply(HadoopRDD.scala:145)
>> >>>
>> >>>
>> >>>
>> >>
>> >> --
>> >>
>> >> Best Regards
>> >> Fei Wang
>> >>
>> >> ------------------------------------------------------------
>> >> --------------------
>> >>
>> >>
>> >>
>>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Reply via email to