Can you attach (1) the full log file for the AM? (2) the full log file for the container
> Btw, how do I pull a thread-dump of the stuck container? Should not be any different than a thread dump for a Java process. Log in to the machine YARN shows the container as running, locate its process-id and use jstack on the PID <https://confluence.atlassian.com/doc/generating-a-thread-dump-externally-182158040.html#GeneratingaThreadDumpExternally-GeneratingthreaddumpsonLinux> > java.net.ConnectException Since this is a ConnectException, can you rule out network issues? Can the AM host and container host communicate? On Fri, May 10, 2019 at 9:47 AM Malcolm McFarland <mmcfarl...@cavulus.com> wrote: > Hey all, > > Logs are working, the AM process is running. I haven't hit a "known good > version" yet; the deploy seems to be hitting this wall each time, which, > once again, seems to fail slightly differently each time. Looking at the > node manager logs, I am seeing this line being repeated: > > 2019-05-10 05:48:07,705 WARN [org.apache.samza.util.Util$:74] Error getting > response from Job coordinator server. received IOException: class > java.net.ConnectException. Retrying... > > There's no other information in the log about what is going on. Does > anybody have ideas on this? > > Btw, how do I pull a thread-dump of the stuck container? > > Cheers, > Malcolm > > > On Tue, May 7, 2019 at 10:48 PM Jagadish Venkatraman < > jagadish1...@gmail.com> > wrote: > > > Malcolm, > > > > Did the AM-process come up? If so, can you attach its entire log-file? > > > > "> everything will launch fine one time, and then it will do this > > RUNNING-but-no-Samza thing the next." > > > > IIUC, you believe your container is not making progress. If the issue is > > recurs, can you attach a thread-dump & log-file(s) of the "stuck" > > container? > > > > "> my logs are showing that Samza is not actually starting inside of the > > container" > > > > Can you confirm that logging is actually working? eg: have you verified > > there is only one log4j binding in your class-path? > > > > Did anything change on your end? eg: did you upgrade to a new Samza > > version/ app-version/yarn-version? > > > > Can you roll-back to a known-good version to better isolate the issue? > > > > Best, > > Jagadish > > > > On Tue, May 7, 2019 at 3:54 PM Malcolm McFarland <mmcfarl...@cavulus.com > > > > wrote: > > > > > As a followup to this, here's what I see when the Samza app tries to > > start; > > > it actually seems to be getting to the run-container script, and then > > > stops: > > > > > > > > > Kafka version : 0.11.0.2 > > > Kafka commitId : 73be1e1168f91ee2 > > > Error registering AppInfo mbean > > > Started coordinator stream writer. > > > sent SetConfig message with key = samza.autoscaling.server.url and > value > > = > > > http://ba6ecb67825e:34205/ > > > Stopping the coordinator stream producer. > > > Stopping coordinator stream producer. > > > Stopping producer for system: kafka > > > Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms. > > > Webapp is started at (rpc http://ba6ecb67825e:35629/, tracking > > > http://ba6ecb67825e:34151/, coordinator http://ba6ecb67825e:34205/) > > > Starting YarnContainerManager. > > > Upper bound of the thread pool size is 500 > > > yarn.client.max-cached-nodemanagers-proxies : 0 > > > Got AM register response. The YARN RM supports container requests with > > > max-mem: 8192, max-cpu: 32 > > > Finished starting YarnContainerManager > > > Starting the Samza task manager > > > Resource Request created for 0 on ANY_HOST at 1557268807252 > > > Requesting resources on ANY_HOST for container 0 > > > Making a request for ANY_HOST > > > Starting the container allocator thread > > > Received new token for : > ip-10-60-31-121.us-west-2.compute.internal:8032 > > > Container allocated from RM on > ip-10-60-31-121.us-west-2.compute.internal > > > Container allocated from RM on > ip-10-60-31-121.us-west-2.compute.internal > > > Host affinity not enabled. Saving the samzaResource > > > container_e39_1557265340810_0003_01_000002 in the buffer for ANY_HOST > > > Returning a buffered resource: > container_e39_1557265340810_0003_01_000002 > > > for ANY_HOST from preferred-host buffer. > > > Returning a buffered resource: > container_e39_1557265340810_0003_01_000002 > > > for ANY_HOST from preferred-host buffer. > > > Cancelling request SamzaResourceRequest{numCores=4, memoryMB=8192, > > > preferredHost='ANY_HOST', > > requestID='1507e2c5-e437-409b-821c-ef505ee19b85', > > > containerID=0, requestTimestampMs=1557268807252} > > > Found available resources on ANY_HOST. Assigning request for > > container_id 0 > > > with timestamp 1557268807252 to resource > > > container_e39_1557265340810_0003_01_000002 > > > Received launch request for 0 on hostname > > > ip-10-60-31-121.us-west-2.compute.internal > > > Got available container ID (0) for container: Container: [ContainerId: > > > container_e39_1557265340810_0003_01_000002, NodeId: > > > ip-10-60-31-121.us-west-2.compute.internal:8032, NodeHttpAddress: > > > ip-10-60-31-121.us-west-2.compute.internal:8088, Resource: > <memory:8192, > > > vCores:4>, Priority: 1, Token: Token { kind: ContainerToken, service: > > > 10.60.31.121:8032 }, ] > > > In runContainer in util: fwkPath= ;cmdPath=./__package/;jobLib= > > > Container ID 0 using command ./__package//bin/run-container.sh > > > > > > Cheers, > > > Malcolm > > > > > > > > > On Tue, May 7, 2019 at 3:22 PM Malcolm McFarland < > mmcfarl...@cavulus.com > > > > > > wrote: > > > > > > > Hey folks, > > > > > > > > We're having some trouble running Samza under YARN. The YARN > > > > containers are launching fully into the RUNNING state, and I can see > > > > in the node manager logs that the containers are running, but my logs > > > > are showing that Samza is not actually starting inside of the > > > > container. What's really curious is that this is intermittent; > > > > everything will launch fine one time, and then it will do this > > > > RUNNING-but-no-Samza thing the next. > > > > > > > > I've been trying to get into the AM UI to see what's going on, but I > > > > see the following error when I try accessing it: > > > > > > > > Problem accessing /proxy/application_1557265340810_0002/. Reason: > > > > Cannot assign requested address (Bind failed) > > > > Caused by: > > > > java.net.BindException: Cannot assign requested address (Bind failed) > > > > > > > > Has anybody seen this issue with the AM web interface? Also, are > there > > > > any other ways that I could introspect the YARN container to try and > > > > deduce what's happening? > > > > > > > > Cheers, > > > > Malcolm > > > > > > > > > > > > -- > > > > Malcolm McFarland > > > > Cavulus > > > > > > > > > > > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any > > > > unauthorized or improper disclosure, copying, distribution, or use of > > > > the contents of this message is prohibited. The information contained > > > > in this message is intended only for the personal and confidential > use > > > > of the recipient(s) named above. If you have received this message in > > > > error, please notify the sender immediately and delete the original > > > > message. > > > > > > > > > > > > > -- > > > Malcolm McFarland > > > Cavulus > > > 1-800-760-6915 > > > mmcfarl...@cavulus.com > > > > > > > > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any > > > unauthorized or improper disclosure, copying, distribution, or use of > the > > > contents of this message is prohibited. The information contained in > this > > > message is intended only for the personal and confidential use of the > > > recipient(s) named above. If you have received this message in error, > > > please notify the sender immediately and delete the original message. > > > > > > > > > -- > > Jagadish V, > > Graduate Student, > > Department of Computer Science, > > Stanford University > > > > > -- > Malcolm McFarland > Cavulus > > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any > unauthorized or improper disclosure, copying, distribution, or use of the > contents of this message is prohibited. The information contained in this > message is intended only for the personal and confidential use of the > recipient(s) named above. If you have received this message in error, > please notify the sender immediately and delete the original message. > -- Jagadish V, Graduate Student, Department of Computer Science, Stanford University