Hey Jeremiah,

That's basically what I mean by "all at once"; ie, a script to submits
multiple tasks in sequence without breaks. I'd be surprised if YARN
couldn't handle that, but just thought I'd ask.

It really looks like it's a problem with the way our infrastructure is
configured. I think I need to find which logs to look at in order to trace
the startup/localization steps. If anybody has advice on this, I'd
appreciate it!

Cheers,
Malcolm

On Mon, May 20, 2019 at 9:49 PM Malcolm McFarland <mmcfarl...@cavulus.com>
wrote:
>
> Hey Jagadish,
>
> Thanks for the tip. I'm including the stacktrace for a running, stuck
> process. Usually when I try to restart a single process manually, it
> works fine. This brings up the question: are there any
> resource-contention issues with submitting multiple jobs to YARN at
> once?
>
> Also, I was using the same tarball for launch multiple Samza tasks
> (but with different inputs and config). Can YARN handle this, or does
> each running application need to have its own tarball? I ask this
> because when I simulated using different applications by using
> tarballs with different names, the chance of the applications
> launching on the first try seemed to improve.
>
> Cheers,
> Malcolm
>
>
> ====== STACK TRACE FOR STUCK PROCESS ======
>
> "Attach Listener" #8 daemon prio=9 os_prio=0 tid=0x00007ffb6c753800
> nid=0x3a5 waiting on condition [0x0000000000000000]
>    java.lang.Thread.State: RUNNABLE
>
> "Service Thread" #7 daemon prio=9 os_prio=0 tid=0x00007ffb6c0c2800
> nid=0x2c5 runnable [0x0000000000000000]
>    java.lang.Thread.State: RUNNABLE
>
> "C1 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00007ffb6c0bd000
> nid=0x2c4 waiting on condition [0x0000000000000000]
>    java.lang.Thread.State: RUNNABLE
>
> "C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007ffb6c0ba000
> nid=0x2c3 waiting on condition [0x0000000000000000]
>    java.lang.Thread.State: RUNNABLE
>
> "Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007ffb6c0b8000
> nid=0x2c2 runnable [0x0000000000000000]
>    java.lang.Thread.State: RUNNABLE
>
> "Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007ffb6c091800
> nid=0x2c1 in Object.wait() [0x00007ffb5cdfc000]
>    java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x00000000e3000af0> (a java.lang.ref.ReferenceQueue$Lock)
>   at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
>   - locked <0x00000000e3000af0> (a java.lang.ref.ReferenceQueue$Lock)
>   at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
>   at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:216)
>
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007ffb6c08f000
> nid=0x2c0 in Object.wait() [0x00007ffb5cefd000]
>    java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x00000000e3008660> (a java.lang.ref.Reference$Lock)
>   at java.lang.Object.wait(Object.java:502)
>   at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
>   - locked <0x00000000e3008660> (a java.lang.ref.Reference$Lock)
>   at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
>
> "main" #1 prio=5 os_prio=0 tid=0x00007ffb6c01b800 nid=0x2bc waiting on
> condition [0x00007ffb75876000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at
org.apache.samza.util.ExponentialSleepStrategy$RetryLoopState.sleep(ExponentialSleepStrategy.scala:113)
>   at
org.apache.samza.util.ExponentialSleepStrategy.run(ExponentialSleepStrategy.scala:99)
>   at org.apache.samza.util.HttpUtil$.read(HttpUtil.scala:42)
>   at org.apache.samza.util.HttpUtil.read(HttpUtil.scala)
>   at
org.apache.samza.logging.log4j.StreamAppender.getConfig(StreamAppender.java:272)
>   at
org.apache.samza.logging.log4j.StreamAppender.setupSystem(StreamAppender.java:283)
>   at
org.apache.samza.logging.log4j.StreamAppender.activateOptions(StreamAppender.java:154)
>   at
org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
>   at
org.apache.log4j.xml.DOMConfigurator.parseAppender(DOMConfigurator.java:295)
>   at
org.apache.log4j.xml.DOMConfigurator.findAppenderByName(DOMConfigurator.java:176)
>   at
org.apache.log4j.xml.DOMConfigurator.findAppenderByReference(DOMConfigurator.java:191)
>   at
org.apache.log4j.xml.DOMConfigurator.parseChildrenOfLoggerElement(DOMConfigurator.java:523)
>   at
org.apache.log4j.xml.DOMConfigurator.parseRoot(DOMConfigurator.java:492)
>   - locked <0x00000000e3009558> (a org.apache.log4j.spi.RootLogger)
>   at org.apache.log4j.xml.DOMConfigurator.parse(DOMConfigurator.java:1006)
>   at
org.apache.log4j.xml.DOMConfigurator.doConfigure(DOMConfigurator.java:872)
>   at
org.apache.log4j.xml.DOMConfigurator.doConfigure(DOMConfigurator.java:778)
>   at
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
>   at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
>   at
org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:73)
>   - locked <0x00000000e3009788> (a org.slf4j.impl.Log4jLoggerFactory)
>   at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:253)
>   at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:265)
>   at
org.apache.samza.runtime.AbstractApplicationRunner.<clinit>(AbstractApplicationRunner.java:50)
> <br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon,
> May 13, 2019 at 8:01 PM Jagadish Venkatraman
> &lt;jagadish1...@gmail.com&gt; wrote:<br></div><blockquote
> class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px
> solid rgb(204, 204, 204); padding-left: 1ex;">Can you attach<br>
> <br>
> (1) the full log file for the AM?<br>
> (2) the full log file for the container<br>
> <br>
> &gt; Btw, how do I pull a thread-dump of the stuck container?<br>
> <br>
> Should not be any different than a thread dump for a Java process.<br>
> <br>
> Log in to the machine YARN shows the container as running, locate its<br>
> process-id and use jstack on the PID<br>
> &lt;<a href="
https://confluence.atlassian.com/doc/generating-a-thread-dump-externally-182158040.html#GeneratingaThreadDumpExternally-GeneratingthreaddumpsonLinux
"
> rel="noreferrer"
> target="_blank">https://confluence.atlassian
.<wbr>com/doc/generating-a-thread-du<wbr>mp-externally-182158040.html#G<wbr>eneratingaThreadDumpExternally<wbr>-GeneratingthreaddumpsonLinux</a>&gt;<br>
> <br>
> &gt; java.net.ConnectException<br>
> <br>
> Since this is a ConnectException, can you rule out network issues? Can
the<br>
> AM host and container host communicate?<br>
> <br>
> <br>
> <br>
> <br>
> <br>
> On Fri, May 10, 2019 at 9:47 AM Malcolm McFarland &lt;<a
> href="mailto:mmcfarl...@cavulus.com";
> target="_blank">mmcfarl...@cavulus.com</a>&gt;<br>
> wrote:<br>
> <br>
> &gt; Hey all,<br>
> &gt;<br>
> &gt; Logs are working, the AM process is running. I haven't hit a
> "known good<br>
> &gt; version" yet; the deploy seems to be hitting this wall each time,
> which,<br>
> &gt; once again, seems to fail slightly differently each time. Looking
> at the<br>
> &gt; node manager logs, I am seeing this line being repeated:<br>
> &gt;<br>
> &gt; 2019-05-10 05:48:07,705 WARN
> [org.apache.samza.util.Util$:7<wbr>4] Error getting<br>
> &gt; response from Job coordinator server. received IOException: class<br>
> &gt; java.net.ConnectException. Retrying...<br>
> &gt;<br>
> &gt; There's no other information in the log about what is going on.
Does<br>
> &gt; anybody have ideas on this?<br>
> &gt;<br>
> &gt; Btw, how do I pull a thread-dump of the stuck container?<br>
> &gt;<br>
> &gt; Cheers,<br>
> &gt; Malcolm<br>
> &gt;<br>
> &gt;<br>
> &gt; On Tue, May 7, 2019 at 10:48 PM Jagadish Venkatraman &lt;<br>
> &gt; <a href="mailto:jagadish1...@gmail.com";
> target="_blank">jagadish1...@gmail.com</a>&gt;<br>
> &gt; wrote:<br>
> &gt;<br>
> &gt; &gt; Malcolm,<br>
> &gt; &gt;<br>
> &gt; &gt; Did the AM-process come up? If so, can you attach its entire
> log-file?<br>
> &gt; &gt;<br>
> &gt; &gt; "&gt; everything will launch fine one time, and then it will
> do this<br>
> &gt; &gt; RUNNING-but-no-Samza thing the next."<br>
> &gt; &gt;<br>
> &gt; &gt; IIUC, you believe your container is not making progress. If
> the issue is<br>
> &gt; &gt; recurs, can you attach a thread-dump &amp; log-file(s) of
> the "stuck"<br>
> &gt; &gt; container?<br>
> &gt; &gt;<br>
> &gt; &gt; "&gt; my logs are showing that Samza is not actually
> starting inside of the<br>
> &gt; &gt; container"<br>
> &gt; &gt;<br>
> &gt; &gt; Can you confirm that logging is actually working? eg: have
> you verified<br>
> &gt; &gt; there is only one log4j binding in your class-path?<br>
> &gt; &gt;<br>
> &gt; &gt; Did anything change on your end? eg: did you upgrade to a
> new Samza<br>
> &gt; &gt; version/ app-version/yarn-version?<br>
> &gt; &gt;<br>
> &gt; &gt; Can you roll-back to a known-good version to better isolate
> the issue?<br>
> &gt; &gt;<br>
> &gt; &gt; Best,<br>
> &gt; &gt; Jagadish<br>
> &gt; &gt;<br>
> &gt; &gt; On Tue, May 7, 2019 at 3:54 PM Malcolm McFarland &lt;<a
> href="mailto:mmcfarl...@cavulus.com";
> target="_blank">mmcfarl...@cavulus.com</a><br>
> &gt; &gt;<br>
> &gt; &gt; wrote:<br>
> &gt; &gt;<br>
> &gt; &gt; &gt; As a followup to this, here's what I see when the Samza
> app tries to<br>
> &gt; &gt; start;<br>
> &gt; &gt; &gt; it actually seems to be getting to the run-container
> script, and then<br>
> &gt; &gt; &gt; stops:<br>
> &gt; &gt; &gt;<br>
> &gt; &gt; &gt;<br>
> &gt; &gt; &gt; Kafka version : 0.11.0.2<br>
> &gt; &gt; &gt; Kafka commitId : 73be1e1168f91ee2<br>
> &gt; &gt; &gt; Error registering AppInfo mbean<br>
> &gt; &gt; &gt; Started coordinator stream writer.<br>
> &gt; &gt; &gt; sent SetConfig message with key =
> samza.autoscaling.server.url and<br>
> &gt; value<br>
> &gt; &gt; =<br>
> &gt; &gt; &gt; <a href="http://ba6ecb67825e:34205/"; rel="noreferrer"
> target="_blank">http://ba6ecb67825e:34205/</a><br>
> &gt; &gt; &gt; Stopping the coordinator stream producer.<br>
> &gt; &gt; &gt; Stopping coordinator stream producer.<br>
> &gt; &gt; &gt; Stopping producer for system: kafka<br>
> &gt; &gt; &gt; Closing the Kafka producer with timeoutMillis =
> 9223372036854775807 ms.<br>
> &gt; &gt; &gt; Webapp is started at (rpc <a
> href="http://ba6ecb67825e:35629/"; rel="noreferrer"
> target="_blank">http://ba6ecb67825e:35629/</a>, tracking<br>
> &gt; &gt; &gt; <a href="http://ba6ecb67825e:34151/"; rel="noreferrer"
> target="_blank">http://ba6ecb67825e:34151/</a>, coordinator <a
> href="http://ba6ecb67825e:34205/"; rel="noreferrer"
> target="_blank">http://ba6ecb67825e:34205/</a>)<br>
> &gt; &gt; &gt; Starting YarnContainerManager.<br>
> &gt; &gt; &gt; Upper bound of the thread pool size is 500<br>
> &gt; &gt; &gt; yarn.client.max-cached-nodeman<wbr>agers-proxies : 0<br>
> &gt; &gt; &gt; Got AM register response. The YARN RM supports
> container requests with<br>
> &gt; &gt; &gt; max-mem: 8192, max-cpu: 32<br>
> &gt; &gt; &gt; Finished starting YarnContainerManager<br>
> &gt; &gt; &gt; Starting the Samza task manager<br>
> &gt; &gt; &gt; Resource Request created for 0 on ANY_HOST at
1557268807252<br>
> &gt; &gt; &gt; Requesting resources on&nbsp; ANY_HOST for container 0<br>
> &gt; &gt; &gt; Making a request for ANY_HOST<br>
> &gt; &gt; &gt; Starting the container allocator thread<br>
> &gt; &gt; &gt; Received new token for :<br>
> &gt; ip-10-60-31-121.us-west-2.comp<wbr>ute.internal:8032<br>
> &gt; &gt; &gt; Container allocated from RM on<br>
> &gt; ip-10-60-31-121.us-west-2.comp<wbr>ute.internal<br>
> &gt; &gt; &gt; Container allocated from RM on<br>
> &gt; ip-10-60-31-121.us-west-2.comp<wbr>ute.internal<br>
> &gt; &gt; &gt; Host affinity not enabled. Saving the samzaResource<br>
> &gt; &gt; &gt; container_e39_1557265340810_00<wbr>03_01_000002 in the
> buffer for ANY_HOST<br>
> &gt; &gt; &gt; Returning a buffered resource:<br>
> &gt; container_e39_1557265340810_00<wbr>03_01_000002<br>
> &gt; &gt; &gt; for ANY_HOST from preferred-host buffer.<br>
> &gt; &gt; &gt; Returning a buffered resource:<br>
> &gt; container_e39_1557265340810_00<wbr>03_01_000002<br>
> &gt; &gt; &gt; for ANY_HOST from preferred-host buffer.<br>
> &gt; &gt; &gt; Cancelling request
> SamzaResourceRequest{numCores=<wbr>4, memoryMB=8192,<br>
> &gt; &gt; &gt; preferredHost='ANY_HOST',<br>
> &gt; &gt; requestID='1507e2c5-e437-409b-<wbr>821c-ef505ee19b85',<br>
> &gt; &gt; &gt; containerID=0, requestTimestampMs=15572688072<wbr>52}<br>
> &gt; &gt; &gt; Found available resources on ANY_HOST. Assigning request
for<br>
> &gt; &gt; container_id 0<br>
> &gt; &gt; &gt; with timestamp 1557268807252 to resource<br>
> &gt; &gt; &gt; container_e39_1557265340810_00<wbr>03_01_000002<br>
> &gt; &gt; &gt; Received launch request for 0 on hostname<br>
> &gt; &gt; &gt; ip-10-60-31-121.us-west-2.comp<wbr>ute.internal<br>
> &gt; &gt; &gt; Got available container ID (0) for container:
> Container: [ContainerId:<br>
> &gt; &gt; &gt; container_e39_1557265340810_00<wbr>03_01_000002,
NodeId:<br>
> &gt; &gt; &gt; ip-10-60-31-121.us-west-2.comp<wbr>ute.internal:8032,
> NodeHttpAddress:<br>
> &gt; &gt; &gt; ip-10-60-31-121.us-west-2.comp<wbr>ute.internal:8088,
> Resource:<br>
> &gt; &lt;memory:8192,<br>
> &gt; &gt; &gt; vCores:4&gt;, Priority: 1, Token: Token { kind:
> ContainerToken, service:<br>
> &gt; &gt; &gt; <a href="http://10.60.31.121:8032"; rel="noreferrer"
> target="_blank">10.60.31.121:8032</a> }, ]<br>
> &gt; &gt; &gt; In runContainer in util: fwkPath=
> ;cmdPath=./__package/;jobLib=<br>
> &gt; &gt; &gt; Container ID 0 using command
> ./__package//bin/run-container<wbr>.sh<br>
> &gt; &gt; &gt;<br>
> &gt; &gt; &gt; Cheers,<br>
> &gt; &gt; &gt; Malcolm<br>
> &gt; &gt; &gt;<br>
> &gt; &gt; &gt;<br>
> &gt; &gt; &gt; On Tue, May 7, 2019 at 3:22 PM Malcolm McFarland &lt;<br>
> &gt; <a href="mailto:mmcfarl...@cavulus.com";
> target="_blank">mmcfarl...@cavulus.com</a><br>
> &gt; &gt; &gt;<br>
> &gt; &gt; &gt; wrote:<br>
> &gt; &gt; &gt;<br>
> &gt; &gt; &gt; &gt; Hey folks,<br>
> &gt; &gt; &gt; &gt;<br>
> &gt; &gt; &gt; &gt; We're having some trouble running Samza under
> YARN. The YARN<br>
> &gt; &gt; &gt; &gt; containers are launching fully into the RUNNING
> state, and I can see<br>
> &gt; &gt; &gt; &gt; in the node manager logs that the containers are
> running, but my logs<br>
> &gt; &gt; &gt; &gt; are showing that Samza is not actually starting
> inside of the<br>
> &gt; &gt; &gt; &gt; container. What's really curious is that this is
> intermittent;<br>
> &gt; &gt; &gt; &gt; everything will launch fine one time, and then it
> will do this<br>
> &gt; &gt; &gt; &gt; RUNNING-but-no-Samza thing the next.<br>
> &gt; &gt; &gt; &gt;<br>
> &gt; &gt; &gt; &gt; I've been trying to get into the AM UI to see
> what's going on, but I<br>
> &gt; &gt; &gt; &gt; see the following error when I try accessing it:<br>
> &gt; &gt; &gt; &gt;<br>
> &gt; &gt; &gt; &gt; Problem accessing
> /proxy/application_15572653408<wbr>10_0002/. Reason:<br>
> &gt; &gt; &gt; &gt;&nbsp; &nbsp; &nbsp;Cannot assign requested address
> (Bind failed)<br>
> &gt; &gt; &gt; &gt; Caused by:<br>
> &gt; &gt; &gt; &gt; java.net.BindException: Cannot assign requested
> address (Bind failed)<br>
> &gt; &gt; &gt; &gt;<br>
> &gt; &gt; &gt; &gt; Has anybody seen this issue with the AM web
> interface? Also, are<br>
> &gt; there<br>
> &gt; &gt; &gt; &gt; any other ways that I could introspect the YARN
> container to try and<br>
> &gt; &gt; &gt; &gt; deduce what's happening?<br>
> &gt; &gt; &gt; &gt;<br>
> &gt; &gt; &gt; &gt; Cheers,<br>
> &gt; &gt; &gt; &gt; Malcolm<br>
> &gt; &gt; &gt; &gt;<br>
> &gt; &gt; &gt; &gt;<br>
> &gt; &gt; &gt; &gt; --<br>
> &gt; &gt; &gt; &gt; Malcolm McFarland<br>
> &gt; &gt; &gt; &gt; Cavulus<br>
> &gt; &gt; &gt; &gt;<br>
> &gt; &gt; &gt; &gt;<br>
> &gt; &gt; &gt; &gt; This correspondence is from HealthPlanCRM, LLC,
> d/b/a Cavulus. Any<br>
> &gt; &gt; &gt; &gt; unauthorized or improper disclosure, copying,
> distribution, or use of<br>
> &gt; &gt; &gt; &gt; the contents of this message is prohibited. The
> information contained<br>
> &gt; &gt; &gt; &gt; in this message is intended only for the personal
> and confidential<br>
> &gt; use<br>
> &gt; &gt; &gt; &gt; of the recipient(s) named above. If you have
> received this message in<br>
> &gt; &gt; &gt; &gt; error, please notify the sender immediately and
> delete the original<br>
> &gt; &gt; &gt; &gt; message.<br>
> &gt; &gt; &gt; &gt;<br>
> &gt; &gt; &gt;<br>
> &gt; &gt; &gt;<br>
> &gt; &gt; &gt; --<br>
> &gt; &gt; &gt; Malcolm McFarland<br>
> &gt; &gt; &gt; Cavulus<br>
> &gt; &gt; &gt; 1-800-760-6915<br>
> &gt; &gt; &gt; <a href="mailto:mmcfarl...@cavulus.com";
> target="_blank">mmcfarl...@cavulus.com</a><br>
> &gt; &gt; &gt;<br>
> &gt; &gt; &gt;<br>
> &gt; &gt; &gt; This correspondence is from HealthPlanCRM, LLC, d/b/a
> Cavulus. Any<br>
> &gt; &gt; &gt; unauthorized or improper disclosure, copying,
> distribution, or use of<br>
> &gt; the<br>
> &gt; &gt; &gt; contents of this message is prohibited. The information
> contained in<br>
> &gt; this<br>
> &gt; &gt; &gt; message is intended only for the personal and
> confidential use of the<br>
> &gt; &gt; &gt; recipient(s) named above. If you have received this
> message in error,<br>
> &gt; &gt; &gt; please notify the sender immediately and delete the
> original message.<br>
> &gt; &gt; &gt;<br>
> &gt; &gt;<br>
> &gt; &gt;<br>
> &gt; &gt; --<br>
> &gt; &gt; Jagadish V,<br>
> &gt; &gt; Graduate Student,<br>
> &gt; &gt; Department of Computer Science,<br>
> &gt; &gt; Stanford University<br>
> &gt; &gt;<br>
> &gt;<br>
> &gt;<br>
> &gt; --<br>
> &gt; Malcolm McFarland<br>
> &gt; Cavulus<br>
> &gt;<br>
> &gt;<br>
> &gt; This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus.
Any<br>
> &gt; unauthorized or improper disclosure, copying, distribution, or
> use of the<br>
> &gt; contents of this message is prohibited. The information contained
> in this<br>
> &gt; message is intended only for the personal and confidential use of
the<br>
> &gt; recipient(s) named above. If you have received this message in
error,<br>
> &gt; please notify the sender immediately and delete the original
message.<br>
> &gt;<br>
> <br>
> <br>
> -- <br>
> Jagadish V,<br>
> Graduate Student,<br>
> Department of Computer Science,<br>
> Stanford University<br>
> </blockquote></div><br clear="all"><div><br></div>-- <br><div
> dir="ltr" class="gmail_signature">Malcolm
> McFarland<br>Cavulus<br>1-800-760-6915<br><a
> href="mailto:mmcfarl...@cavulus.com";
> target="_blank">mmcfarl...@cavulus.com</a><br><br><br>This
> correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> unauthorized or improper disclosure, copying, distribution, or use of
> the contents of this message is prohibited. The information contained
> in this message is intended only for the personal and confidential use
> of the recipient(s) named above. If you have received this message in
> error, please notify the sender immediately and delete the original
> message.<br></div>



-- 
Malcolm McFarland
Cavulus


This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
unauthorized or improper disclosure, copying, distribution, or use of the
contents of this message is prohibited. The information contained in this
message is intended only for the personal and confidential use of the
recipient(s) named above. If you have received this message in error,
please notify the sender immediately and delete the original message.

Reply via email to