Hi all, I am having difficulty determining the reason my Samza task is failing. It generally failed within 10 minutes of start. When I examine the YARN log I see the following exception on some but not all containers:
java.rmi.server.ExportException: Port already in use: 40029; nested exception is: java.net.BindException: Address already in use at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:341) at sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:249) at sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:411) at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:147) at sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:208) at java.rmi.server.UnicastRemoteObject.exportObject(UnicastRemoteObject.java:383) at java.rmi.server.UnicastRemoteObject.exportObject(UnicastRemoteObject.java:346) at javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:118) at javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:95) at javax.management.remote.rmi.RMIConnectorServer.start(RMIConnectorServer.java:404) at org.apache.samza.metrics.JmxServer.<init>(JmxServer.scala:89) at org.apache.samza.metrics.JmxServer.<init>(JmxServer.scala:43) at org.apache.samza.container.SamzaContainer$$anonfun$main$2.apply(SamzaContainer.scala:66) at org.apache.samza.container.SamzaContainer$$anonfun$main$2.apply(SamzaContainer.scala:66) at org.apache.samza.container.SamzaContainer$.safeMain(SamzaContainer.scala:91) at org.apache.samza.container.SamzaContainer$.main(SamzaContainer.scala:66) at org.apache.samza.container.SamzaContainer.main(SamzaContainer.scala) Caused by: java.net.BindException: Address already in use at java.net.PlainSocketImpl.socketBind(Native Method) at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387) at java.net.ServerSocket.bind(ServerSocket.java:375) at java.net.ServerSocket.<init>(ServerSocket.java:237) at java.net.ServerSocket.<init>(ServerSocket.java:128) at sun.rmi.transport.proxy.RMIDirectSocketFactory.createServerSocket(RMIDirectSocketFactory.java:45) at sun.rmi.transport.proxy.RMIMasterSocketFactory.createServerSocket(RMIMasterSocketFactory.java:345) at sun.rmi.transport.tcp.TCPEndpoint.newServerSocket(TCPEndpoint.java:666) at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:330) ... 16 more Aside from this, I also seeing a lot of garbage collection messages, but no OutOfMemoryError: 2016-06-10T09:52:48.335+0000: 1.150: [GC (System.gc()) 115345K->9996K(1005056K), 0.0097432 secs] 2016-06-10T09:52:48.345+0000: 1.160: [Full GC (System.gc()) 9996K->9414K(1005056K), 0.0291634 secs] 2016-06-10T09:52:50.032+0000: 2.846: [GC (Allocation Failure) 271558K->18186K(1005056K), 0.0094124 secs] 2016-06-10T09:52:50.592+0000: 3.406: [GC (Allocation Failure) 280330K->13998K(1005056K), 0.0029778 secs] 2016-06-10T09:52:51.036+0000: 3.850: [GC (Allocation Failure) 276142K->11966K(1005056K), 0.0029768 secs] 2016-06-10T09:52:51.437+0000: 4.252: [GC (Allocation Failure) 274110K->11942K(1005056K), 0.0033398 secs] 2016-06-10T09:52:53.367+0000: 6.182: [GC (Metadata GC Threshold) 114006K->14958K(1037312K), 0.0050709 secs] 2016-06-10T09:52:53.372+0000: 6.187: [Full GC (Metadata GC Threshold) 14958K->8364K(1037312K), 0.0251294 secs] 2016-06-10T09:52:54.438+0000: 7.252: [GC (Allocation Failure) 302764K->14676K(1005056K), 0.0073039 secs] 2016-06-10T09:52:54.952+0000: 7.766: [GC (Allocation Failure) 309076K->18076K(1034752K), 0.0105222 secs] 2016-06-10T09:52:55.471+0000: 8.285: [GC (Allocation Failure) 342684K->14556K(1036288K), 0.0067447 secs] 2016-06-10T09:52:55.970+0000: 8.784: [GC (Allocation Failure) 339164K->15956K(1037312K), 0.0038883 secs] 2016-06-10T09:52:56.473+0000: 9.287: [GC (Allocation Failure) 342612K->13940K(1037312K), 0.0034693 secs] 2016-06-10T09:52:56.958+0000: 9.773: [GC (Allocation Failure) 340596K->15608K(1037824K), 0.0049325 secs] 2016-06-10T09:52:57.452+0000: 10.266: [GC (Allocation Failure) 343288K->18659K(1037824K), 0.0155791 secs] 2016-06-10T09:52:58.000+0000: 10.814: [GC (Allocation Failure) 346339K->19508K(1036800K), 0.0154724 secs] 2016-06-10T09:52:58.528+0000: 11.342: [GC (Allocation Failure) 346164K->23116K(1037312K), 0.0033848 secs] 2016-06-10T09:52:58.999+0000: 11.813: [GC (Allocation Failure) 349772K->26455K(1038336K), 0.0081673 secs] 2016-06-10T09:52:59.488+0000: 12.302: [GC (Allocation Failure) 354647K->27086K(1037824K), 0.0046321 secs] 2016-06-10T09:52:59.937+0000: 12.751: [GC (Allocation Failure) 355278K->30694K(1038336K), 0.0032607 secs] 2016-06-10T09:53:00.375+0000: 13.189: [GC (Allocation Failure) 358886K->34214K(1037824K), 0.0053298 secs] 2016-06-10T09:53:00.819+0000: 13.634: [GC (Allocation Failure) 362406K->34860K(1038848K), 0.0064049 secs] 2016-06-10T09:53:01.261+0000: 14.075: [GC (Allocation Failure) 364588K->38492K(1038848K), 0.0053598 secs] 2016-06-10T09:53:01.708+0000: 14.522: [GC (Allocation Failure) 368220K->40955K(1039360K), 0.0054838 secs] 2016-06-10T09:53:02.156+0000: 14.971: [GC (Allocation Failure) 371707K->42472K(1039360K), 0.0097664 secs] 2016-06-10T09:53:02.602+0000: 15.416: [GC (Allocation Failure) 373224K->46177K(1040384K), 0.0057107 secs] 2016-06-10T09:53:03.052+0000: 15.866: [GC (Allocation Failure) 378465K->46599K(1039872K), 0.0063725 secs] 2016-06-10T09:53:03.497+0000: 16.311: [GC (Allocation Failure) 378887K->50183K(1040384K), 0.0095219 secs] 2016-06-10T09:53:03.945+0000: 16.759: [GC (Allocation Failure) 382983K->52824K(1040384K), 0.0044979 secs] 2016-06-10T09:53:04.392+0000: 17.206: [GC (Allocation Failure) 385624K->54387K(1040896K), 0.0053487 secs] 2016-06-10T09:53:04.841+0000: 17.656: [GC (Allocation Failure) 388211K->56947K(1040896K), 0.0025053 secs] 2016-06-10T09:53:05.303+0000: 18.117: [GC (Allocation Failure) 390771K->59701K(1041408K), 0.0054432 secs] 2016-06-10T09:53:05.757+0000: 18.571: [GC (Allocation Failure) 394037K->60303K(1040896K), 0.0053059 secs] 2016-06-10T09:53:06.208+0000: 19.022: [GC (Allocation Failure) 394639K->63759K(1041408K), 0.0024715 secs] 2016-06-10T09:53:06.661+0000: 19.475: [GC (Allocation Failure) 398607K->66585K(1041408K), 0.0039495 secs] 2016-06-10T09:53:07.109+0000: 19.924: [GC (Allocation Failure) 401433K->67278K(1041408K), 0.0056410 secs] 2016-06-10T09:53:07.563+0000: 20.377: [GC (Allocation Failure) 402126K->70798K(1041408K), 0.0031031 secs] 2016-06-10T09:53:08.038+0000: 20.853: [GC (Allocation Failure) 405646K->73652K(1041408K), 0.0057616 secs] Can anyone help? Regards, Jack