Apart from /etc/hosts and /bin/hostname the only other relevant place might be to modify values in /etc/resolv.conf, to point to, e.g., a dnsmasq instance.
On Fri, May 31, 2019 at 2:43 PM Malcolm McFarland <mmcfarl...@cavulus.com> wrote: > Hey Rayman, > > The ops group and I went through the configuration today and observed the > YARN containers as they were coming up. We seem to have found the root of > the problem, and I'm putting this out there for anybody else that's trying > to do something similar on AWS ECS: > > The ECS container instances set their hostname to the container ID on > startup (ie 717b6f75aaf8), and this looks like it's interfering with the > YARN container startup process. This *seems* to be corroborated in that > containers that start on the same host as their AM look to be starting fine > (ie they can locally resolve their IP address correctly), but containers > starting on other hosts don't seem to be. We were *not* having this problem > on Fargate, and my only guess is that, given Fargate's intended use case as > a replicated-services-in-the-cloud environment, AWS sets the hostname for > Fargate-bound Docker containers on launch (ie > ip-10-#-#-#.us-west-#.internal.local or whatever). (As a side note, we > probably would have stuck with Fargate and not run into this problem, but > Fargate instances are only allowed 10GB of disk space, and this wasn't > enough for YARN's VM requirements.) > > I've been fishing around for a way to get Samza to resolve the hostname to > something more publicly-available. I've thus far tried a) changing the > /etc/hosts file, and b) replacing the /bin/hostname binary in the container > with a static script, but neither of these options seem to have an effect > on Java's DNS resolution. Two further options I can think of are: > > - find some place in the Samza configuration where the hostname can be set > explicitly; or > - change just the right piece of information in the system so that > java.net.InetAddress will resolve the localhost to something other than > what's returned from /bin/hostname (I'm guessing it uses gethostname() on > Ubuntu, could be wrong). > > Anybody ideas? > > Cheers, > Malcolm McFarland > Cavulus > > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any > unauthorized or improper disclosure, copying, distribution, or use of the > contents of this message is prohibited. The information contained in this > message is intended only for the personal and confidential use of the > recipient(s) named above. If you have received this message in error, > please notify the sender immediately and delete the original message. > > Malcolm McFarland > Cavulus > > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any > unauthorized or improper disclosure, copying, distribution, or use of the > contents of this message is prohibited. The information contained in this > message is intended only for the personal and confidential use of the > recipient(s) named above. If you have received this message in error, > please notify the sender immediately and delete the original message. > > > On Fri, May 31, 2019 at 9:27 AM rayman preet <rayman7...@gmail.com> wrote: > > > Yes I think your hunch is right. Each container queries the AM over HTTP > to > > obtain > > the jobModel that it is supposed to run. The AM runs a HTTP server > usually > > on > > a dynamically allocated free port on the machine it's running on. > > So its possible that a firewall rule blocks the container when it tries > to > > reach this port > > on the AM's machine? > > > > -- > > thanks > > rayman > > > > On Thu, May 30, 2019 at 5:30 PM Malcolm McFarland < > mmcfarl...@cavulus.com> > > wrote: > > > > > Thanks for the image, appreciate you taking the effort to do that! I'm > > > still hitting this wall. The AM will launch the container, the > container > > > will go from "accepted" to "running", but there will be no output from > > the > > > container (I'm piping all of the Samza, org.apache, org.kafka, and our > > own > > > application's logging output to a Kafka topic). During these periods, > the > > > container will hang out at ~100MB/8GB memory usage and stall. There's > no > > > error output when this happens; it just kind of stops. My suspicion is > > that > > > our Ops group has a firewall rule up that's interfering with this,or > > maybe > > > just isn't white-listing a port correctly, and if I could identify > where > > > the application is stalling, it'd probably help to narrow down the > > > possibilities. > > > > > > Cheers, > > > Malcolm McFarland > > > Cavulus > > > > > > > > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any > > > unauthorized or improper disclosure, copying, distribution, or use of > the > > > contents of this message is prohibited. The information contained in > this > > > message is intended only for the personal and confidential use of the > > > recipient(s) named above. If you have received this message in error, > > > please notify the sender immediately and delete the original message. > > > > > > > > > On Thu, May 30, 2019 at 1:39 PM rayman preet <rayman7...@gmail.com> > > wrote: > > > > > > > I uploaded the image here: > > > > https://www.dropbox.com/s/rv57v165ysp12c5/samza%20flow.png?dl=0 > > > > > > > > Are you still running into this issue? > > > > Is there anything in the container's log that shows any > > > exceptions/errors. > > > > > > > > On Wed, May 22, 2019 at 10:15 PM Malcolm McFarland < > > > mmcfarl...@cavulus.com > > > > > > > > > wrote: > > > > > > > > > Hey rayman, > > > > > > > > > > What it looks like is that the AM has started, the container has > > > started, > > > > > but, ie, here will be the last messages I see in the Samza logs: > > > > > > > > > > 2019-05-23T05:10:45.048Z INFO Making a request for > ANY_HOST > > > > > 2019-05-23T05:10:45.057Z INFO Starting the container > > > allocator > > > > > thread > > > > > 2019-05-23T05:10:47.098Z INFO Received new token for : > > > > > <valid_host>:8032 > > > > > 2019-05-23T05:10:47.102Z INFO Container allocated from RM > > on > > > > > <same_valid_host> > > > > > 2019-05-23T05:10:47.105Z INFO Container allocated from RM > > on > > > > > <same_valid_host> > > > > > > > > > > At this point, it seems to stall, and no more output is produced. > > > > > > > > > > Also, I couldn't see you diagram (it's possible my company's email > > > > filters > > > > > attachments); can I see that on the web anywhere? > > > > > > > > > > Cheers, > > > > > Malcolm > > > > > > > > > > On Wed, May 22, 2019 at 4:30 PM rayman preet <rayman7...@gmail.com > > > > > > wrote: > > > > > > > > > > > Hi Malcolm, > > > > > > > > > > > > This figure (attached) gives an overview of the flow. Is > > > > > > this something you were looking for? > > > > > > > > > > > > Also, by "don't fully start up" do you mean that > > > > > > applications are missing some containers (but the > ApplicationMaster > > > is > > > > > > running)? > > > > > > Or the application is missing entirely. > > > > > > > > > > > > -- > > > > > > thanks > > > > > > rayman > > > > > > [image: Samza Job Launch Sequence.png] > > > > > > > > > > > > On Tue, May 21, 2019 at 3:58 PM Malcolm McFarland < > > > > > mmcfarl...@cavulus.com> > > > > > > wrote: > > > > > > > > > > > >> Hey Folks, > > > > > >> > > > > > >> I'm still trying to pin down why these applications are > sometimes > > > not > > > > > >> starting. Everything looks fine in the YARN web UI and in the > > > > > >> immediately available logs, but the applications don't always > > fully > > > > > >> start up. Does anybody have a rundown about how to trace the > Samza > > > > > >> startup process on a YARN cluster, from Accepted status, to > > > > > >> localization, to the application master startup, to the actual > > > > > >> application's startup? > > > > > >> > > > > > >> Cheers, > > > > > >> Malcolm > > > > > >> > > > > > >> -- > > > > > >> Malcolm McFarland > > > > > >> Cavulus > > > > > >> > > > > > >> > > > > > >> This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. > Any > > > > > >> unauthorized or improper disclosure, copying, distribution, or > use > > > of > > > > > >> the contents of this message is prohibited. The information > > > contained > > > > > >> in this message is intended only for the personal and > confidential > > > use > > > > > >> of the recipient(s) named above. If you have received this > message > > > in > > > > > >> error, please notify the sender immediately and delete the > > original > > > > > >> message. > > > > > >> > > > > > > > > > > > > > > > > > > -- > > > > > > thanks > > > > > > rayman > > > > > > > > > > > > > > > > > > > > > -- > > > > > Malcolm McFarland > > > > > Cavulus > > > > > 1-800-760-6915 > > > > > mmcfarl...@cavulus.com > > > > > > > > > > > > > > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any > > > > > unauthorized or improper disclosure, copying, distribution, or use > of > > > the > > > > > contents of this message is prohibited. The information contained > in > > > this > > > > > message is intended only for the personal and confidential use of > the > > > > > recipient(s) named above. If you have received this message in > error, > > > > > please notify the sender immediately and delete the original > message. > > > > > > > > > > > > > > > > > -- > > > > thanks > > > > rayman > > > > > > > > > > > > > -- > > thanks > > rayman > > > -- thanks rayman