Can you attach

(1) the full log file for the AM?
(2) the full log file for the container

> Btw, how do I pull a thread-dump of the stuck container?

Should not be any different than a thread dump for a Java process.

Log in to the machine YARN shows the container as running, locate its
process-id and use jstack on the PID
<https://confluence.atlassian.com/doc/generating-a-thread-dump-externally-182158040.html#GeneratingaThreadDumpExternally-GeneratingthreaddumpsonLinux>

> java.net.ConnectException

Since this is a ConnectException, can you rule out network issues? Can the
AM host and container host communicate?





On Fri, May 10, 2019 at 9:47 AM Malcolm McFarland <mmcfarl...@cavulus.com>
wrote:

> Hey all,
>
> Logs are working, the AM process is running. I haven't hit a "known good
> version" yet; the deploy seems to be hitting this wall each time, which,
> once again, seems to fail slightly differently each time. Looking at the
> node manager logs, I am seeing this line being repeated:
>
> 2019-05-10 05:48:07,705 WARN [org.apache.samza.util.Util$:74] Error getting
> response from Job coordinator server. received IOException: class
> java.net.ConnectException. Retrying...
>
> There's no other information in the log about what is going on. Does
> anybody have ideas on this?
>
> Btw, how do I pull a thread-dump of the stuck container?
>
> Cheers,
> Malcolm
>
>
> On Tue, May 7, 2019 at 10:48 PM Jagadish Venkatraman <
> jagadish1...@gmail.com>
> wrote:
>
> > Malcolm,
> >
> > Did the AM-process come up? If so, can you attach its entire log-file?
> >
> > "> everything will launch fine one time, and then it will do this
> > RUNNING-but-no-Samza thing the next."
> >
> > IIUC, you believe your container is not making progress. If the issue is
> > recurs, can you attach a thread-dump & log-file(s) of the "stuck"
> > container?
> >
> > "> my logs are showing that Samza is not actually starting inside of the
> > container"
> >
> > Can you confirm that logging is actually working? eg: have you verified
> > there is only one log4j binding in your class-path?
> >
> > Did anything change on your end? eg: did you upgrade to a new Samza
> > version/ app-version/yarn-version?
> >
> > Can you roll-back to a known-good version to better isolate the issue?
> >
> > Best,
> > Jagadish
> >
> > On Tue, May 7, 2019 at 3:54 PM Malcolm McFarland <mmcfarl...@cavulus.com
> >
> > wrote:
> >
> > > As a followup to this, here's what I see when the Samza app tries to
> > start;
> > > it actually seems to be getting to the run-container script, and then
> > > stops:
> > >
> > >
> > > Kafka version : 0.11.0.2
> > > Kafka commitId : 73be1e1168f91ee2
> > > Error registering AppInfo mbean
> > > Started coordinator stream writer.
> > > sent SetConfig message with key = samza.autoscaling.server.url and
> value
> > =
> > > http://ba6ecb67825e:34205/
> > > Stopping the coordinator stream producer.
> > > Stopping coordinator stream producer.
> > > Stopping producer for system: kafka
> > > Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms.
> > > Webapp is started at (rpc http://ba6ecb67825e:35629/, tracking
> > > http://ba6ecb67825e:34151/, coordinator http://ba6ecb67825e:34205/)
> > > Starting YarnContainerManager.
> > > Upper bound of the thread pool size is 500
> > > yarn.client.max-cached-nodemanagers-proxies : 0
> > > Got AM register response. The YARN RM supports container requests with
> > > max-mem: 8192, max-cpu: 32
> > > Finished starting YarnContainerManager
> > > Starting the Samza task manager
> > > Resource Request created for 0 on ANY_HOST at 1557268807252
> > > Requesting resources on  ANY_HOST for container 0
> > > Making a request for ANY_HOST
> > > Starting the container allocator thread
> > > Received new token for :
> ip-10-60-31-121.us-west-2.compute.internal:8032
> > > Container allocated from RM on
> ip-10-60-31-121.us-west-2.compute.internal
> > > Container allocated from RM on
> ip-10-60-31-121.us-west-2.compute.internal
> > > Host affinity not enabled. Saving the samzaResource
> > > container_e39_1557265340810_0003_01_000002 in the buffer for ANY_HOST
> > > Returning a buffered resource:
> container_e39_1557265340810_0003_01_000002
> > > for ANY_HOST from preferred-host buffer.
> > > Returning a buffered resource:
> container_e39_1557265340810_0003_01_000002
> > > for ANY_HOST from preferred-host buffer.
> > > Cancelling request SamzaResourceRequest{numCores=4, memoryMB=8192,
> > > preferredHost='ANY_HOST',
> > requestID='1507e2c5-e437-409b-821c-ef505ee19b85',
> > > containerID=0, requestTimestampMs=1557268807252}
> > > Found available resources on ANY_HOST. Assigning request for
> > container_id 0
> > > with timestamp 1557268807252 to resource
> > > container_e39_1557265340810_0003_01_000002
> > > Received launch request for 0 on hostname
> > > ip-10-60-31-121.us-west-2.compute.internal
> > > Got available container ID (0) for container: Container: [ContainerId:
> > > container_e39_1557265340810_0003_01_000002, NodeId:
> > > ip-10-60-31-121.us-west-2.compute.internal:8032, NodeHttpAddress:
> > > ip-10-60-31-121.us-west-2.compute.internal:8088, Resource:
> <memory:8192,
> > > vCores:4>, Priority: 1, Token: Token { kind: ContainerToken, service:
> > > 10.60.31.121:8032 }, ]
> > > In runContainer in util: fwkPath= ;cmdPath=./__package/;jobLib=
> > > Container ID 0 using command ./__package//bin/run-container.sh
> > >
> > > Cheers,
> > > Malcolm
> > >
> > >
> > > On Tue, May 7, 2019 at 3:22 PM Malcolm McFarland <
> mmcfarl...@cavulus.com
> > >
> > > wrote:
> > >
> > > > Hey folks,
> > > >
> > > > We're having some trouble running Samza under YARN. The YARN
> > > > containers are launching fully into the RUNNING state, and I can see
> > > > in the node manager logs that the containers are running, but my logs
> > > > are showing that Samza is not actually starting inside of the
> > > > container. What's really curious is that this is intermittent;
> > > > everything will launch fine one time, and then it will do this
> > > > RUNNING-but-no-Samza thing the next.
> > > >
> > > > I've been trying to get into the AM UI to see what's going on, but I
> > > > see the following error when I try accessing it:
> > > >
> > > > Problem accessing /proxy/application_1557265340810_0002/. Reason:
> > > >     Cannot assign requested address (Bind failed)
> > > > Caused by:
> > > > java.net.BindException: Cannot assign requested address (Bind failed)
> > > >
> > > > Has anybody seen this issue with the AM web interface? Also, are
> there
> > > > any other ways that I could introspect the YARN container to try and
> > > > deduce what's happening?
> > > >
> > > > Cheers,
> > > > Malcolm
> > > >
> > > >
> > > > --
> > > > Malcolm McFarland
> > > > Cavulus
> > > >
> > > >
> > > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> > > > unauthorized or improper disclosure, copying, distribution, or use of
> > > > the contents of this message is prohibited. The information contained
> > > > in this message is intended only for the personal and confidential
> use
> > > > of the recipient(s) named above. If you have received this message in
> > > > error, please notify the sender immediately and delete the original
> > > > message.
> > > >
> > >
> > >
> > > --
> > > Malcolm McFarland
> > > Cavulus
> > > 1-800-760-6915
> > > mmcfarl...@cavulus.com
> > >
> > >
> > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> > > unauthorized or improper disclosure, copying, distribution, or use of
> the
> > > contents of this message is prohibited. The information contained in
> this
> > > message is intended only for the personal and confidential use of the
> > > recipient(s) named above. If you have received this message in error,
> > > please notify the sender immediately and delete the original message.
> > >
> >
> >
> > --
> > Jagadish V,
> > Graduate Student,
> > Department of Computer Science,
> > Stanford University
> >
>
>
> --
> Malcolm McFarland
> Cavulus
>
>
> This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> unauthorized or improper disclosure, copying, distribution, or use of the
> contents of this message is prohibited. The information contained in this
> message is intended only for the personal and confidential use of the
> recipient(s) named above. If you have received this message in error,
> please notify the sender immediately and delete the original message.
>


-- 
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University

Reply via email to