Thank you Greg, I'll check if this was the cause for my TMs to disappear. On Mon, Jul 11, 2016 at 11:34 AM, Greg Hogan <c...@greghogan.com> wrote:
> The OOM killer doesn't give warning so you'll need to call dmesg or look > in /var/log/messages or similar. The following reports that Debian flavors > may use /var/log/syslog. > > http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer > > On Sun, Jul 10, 2016 at 11:55 PM, Saliya Ekanayake <esal...@gmail.com> > wrote: > >> Greg, >> >> where did you see the OOM log as shown in this mail thread? In my case >> none of the TaskManagers nor JobManger reports an error like this. >> >> On Sun, Jul 10, 2016 at 8:45 PM, Greg Hogan <c...@greghogan.com> wrote: >> >>> These symptoms sounds similar to what I was experiencing in the >>> following thread. Flink can have some unexpected memory usage which can >>> result in an OOM kill by the kernel, and this becomes more pronounced as >>> the cluster size grows. >>> https://www.mail-archive.com/dev@flink.apache.org/msg06346.html >>> >>> On Fri, Jul 8, 2016 at 12:46 PM, Saliya Ekanayake <esal...@gmail.com> >>> wrote: >>> >>>> I checked, but JVMs didn't crash. No puppet or other services like that. >>>> >>>> One thing I found is that things work OK when I have a smaller number >>>> of slaves. For example, here I was trying to run on 16 nodes giving 2 TMs >>>> each. Then I reduced it to 4 nodes each with 2 TMs, which worked. >>>> >>>> >>>> >>>> On Fri, Jul 8, 2016 at 12:31 PM, Robert Metzger <rmetz...@apache.org> >>>> wrote: >>>> >>>>> Hi, >>>>> from the TaskManager logs, I can not see anything suspicious. >>>>> Its a bit weird that the TaskManager logs just end, without any >>>>> shutdown messages. Usually the TMs log some shut down stuff when they are >>>>> stopping. >>>>> Also, if they would be still running, I would expect some error >>>>> messages from akka about the connection status. >>>>> So the only thing I conclude is that one of the TMs was killed by the >>>>> OS or the JVM crashed. Did you check if that happened? >>>>> >>>>> Do you have any service like puppet that is controlling processes? >>>>> >>>>> >>>>> On Thu, Jul 7, 2016 at 5:46 PM, Saliya Ekanayake <esal...@gmail.com> >>>>> wrote: >>>>> >>>>>> I see two logs (attached), but there's only 1 TaskManger process. >>>>>> Also, the Web console says it can find only 1 TM. >>>>>> >>>>>> However, I see this part in JM log, which shows there was a second TM >>>>>> at one point, but it was unregistered. Any thoughts? >>>>>> >>>>>> -------------------------- >>>>>> >>>>>> - Registered TaskManager at j-002 (akka.tcp:// >>>>>> flink@172.16.0.2:42888/user/taskmanager) as >>>>>> 1c65415701f19978c8a8cdc75c993717. Current number of registered hosts is >>>>>> 1. >>>>>> Current number of alive task slots is 12. >>>>>> >>>>>> 2016-07-07 11:32:40,363 WARN akka.remote.ReliableDeliverySupervisor >>>>>> - Association with remote system [akka.tcp://flink@172.16.0.2:42888] >>>>>> has failed, address is now gated for [5000] ms. Reason is: >>>>>> [Disassociated]. >>>>>> >>>>>> 2016-07-07 11:32:42,722 INFO >>>>>> org.apache.flink.runtime.instance.InstanceManager - Registered >>>>>> TaskManager >>>>>> at j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager) as >>>>>> 9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts is >>>>>> 2. >>>>>> Current number of alive task slots is 24. >>>>>> >>>>>> 2016-07-07 11:33:15,316 WARN Remoting - Tried to associate with >>>>>> unreachable remote address [akka.tcp://flink@172.16.0.2:42888]. >>>>>> Address is now gated for 5000 ms, all messages to this address will be >>>>>> delivered to dead letters. Reason: Connection refused: / >>>>>> 172.16.0.2:42888 >>>>>> >>>>>> 2016-07-07 11:33:15,320 INFO >>>>>> org.apache.flink.runtime.jobmanager.JobManager - Task manager >>>>>> akka.tcp:// >>>>>> flink@172.16.0.2:42888/user/taskmanager terminated. >>>>>> 2016-07-07 11:33:15,320 INFO >>>>>> org.apache.flink.runtime.instance.InstanceManager - Unregistered task >>>>>> manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager. Number >>>>>> of registered task managers 1. Number of available slots 12. >>>>>> >>>>>> >>>>>> On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <u...@apache.org> wrote: >>>>>> >>>>>>> No that should suffice. Can you check whether there are any task >>>>>>> manager logs for the second TM on that machine >>>>>>> (taskmanager-X-j-011.log where X is the TM number)? If yes, the task >>>>>>> manager process does start up and there is another problem. If not, >>>>>>> the task managers seems not to start even. >>>>>>> >>>>>>> – Ufuk >>>>>>> >>>>>>> On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <esal...@gmail.com> >>>>>>> wrote: >>>>>>> > I tried to run more than one task manager per node by duplicating >>>>>>> the slave >>>>>>> > IPs. At startup it says for example, >>>>>>> > >>>>>>> > [INFO] 1 instance(s) of taskmanager are already running on j-011. >>>>>>> > Starting taskmanager daemon on host j-011. >>>>>>> > >>>>>>> > but I only see 1 task manager process running. >>>>>>> > >>>>>>> > Is there anything else I need to do? >>>>>>> > >>>>>>> > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <u...@apache.org> >>>>>>> wrote: >>>>>>> >> >>>>>>> >> Yes, exactly. >>>>>>> >> >>>>>>> >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake < >>>>>>> esal...@gmail.com> >>>>>>> >> wrote: >>>>>>> >> > Thank you, yes, it can be done externally, if not supported >>>>>>> within >>>>>>> >> > Flink. >>>>>>> >> > >>>>>>> >> > So the way to spawn multiple task managers would be to list the >>>>>>> same >>>>>>> >> > slave >>>>>>> >> > machines N times as necessary in the slaves file? >>>>>>> >> > >>>>>>> >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <u...@apache.org> >>>>>>> wrote: >>>>>>> >> >> >>>>>>> >> >> No, not inside of Flink. That sounds like something like the >>>>>>> OS or >>>>>>> >> >> resource manager should handle. >>>>>>> >> >> >>>>>>> >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake < >>>>>>> esal...@gmail.com> >>>>>>> >> >> wrote: >>>>>>> >> >> > That's great, so is there support to pin task managers to >>>>>>> sockets as >>>>>>> >> >> > well? >>>>>>> >> >> > >>>>>>> >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <u...@apache.org> >>>>>>> wrote: >>>>>>> >> >> >> >>>>>>> >> >> >> Regarding 2) if you don't manually configure something >>>>>>> else, that >>>>>>> >> >> >> should happen always. >>>>>>> >> >> >> >>>>>>> >> >> >> Yes, you can run more than one task manager per node >>>>>>> depending on >>>>>>> >> >> >> the >>>>>>> >> >> >> process isolation you want. Within a task manager, there are >>>>>>> >> >> >> multiple >>>>>>> >> >> >> threads for each slot. For example, if you have 2 task >>>>>>> managers with >>>>>>> >> >> >> 2 >>>>>>> >> >> >> slots each and submit a job with parallelism 4, each task >>>>>>> manager >>>>>>> >> >> >> will >>>>>>> >> >> >> execute 2 sub tasks in separate Threads. >>>>>>> >> >> >> >>>>>>> >> >> >> >>>>>>> >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake < >>>>>>> esal...@gmail.com> >>>>>>> >> >> >> wrote: >>>>>>> >> >> >> > Hi Ufuk, >>>>>>> >> >> >> > >>>>>>> >> >> >> > Looking at the document you sent it seems only 1 task >>>>>>> manager per >>>>>>> >> >> >> > node >>>>>>> >> >> >> > exist >>>>>>> >> >> >> > and within that you have multiple slots. Is it possible >>>>>>> to run >>>>>>> >> >> >> > more >>>>>>> >> >> >> > than >>>>>>> >> >> >> > 1 >>>>>>> >> >> >> > task manager per node? Also, within a task manager is the >>>>>>> >> >> >> > parallelism >>>>>>> >> >> >> > done >>>>>>> >> >> >> > through threads or processes? >>>>>>> >> >> >> > >>>>>>> >> >> >> > Thank you, >>>>>>> >> >> >> > Saliya >>>>>>> >> >> >> > >>>>>>> >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake >>>>>>> >> >> >> > <esal...@gmail.com> >>>>>>> >> >> >> > wrote: >>>>>>> >> >> >> >> >>>>>>> >> >> >> >> Thank you, I'll check these. >>>>>>> >> >> >> >> >>>>>>> >> >> >> >> In 2.) you said they are likely to exchange through >>>>>>> memory. Is >>>>>>> >> >> >> >> there >>>>>>> >> >> >> >> a >>>>>>> >> >> >> >> case why they wouldn't? >>>>>>> >> >> >> >> >>>>>>> >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi < >>>>>>> u...@apache.org> >>>>>>> >> >> >> >> wrote: >>>>>>> >> >> >> >>> >>>>>>> >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake >>>>>>> >> >> >> >>> <esal...@gmail.com> >>>>>>> >> >> >> >>> wrote: >>>>>>> >> >> >> >>> > 1. What parameters are available to control >>>>>>> parallelism within >>>>>>> >> >> >> >>> > a >>>>>>> >> >> >> >>> > node? >>>>>>> >> >> >> >>> >>>>>>> >> >> >> >>> Task Manager processing slots: >>>>>>> >> >> >> >>> >>>>>>> >> >> >> >>> >>>>>>> >> >> >> >>> >>>>>>> >> >> >> >>> >>>>>>> >> >> >> >>> >>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots >>>>>>> >> >> >> >>> >>>>>>> >> >> >> >>> > 2. Does Flink support shared memory-based messaging >>>>>>> within a >>>>>>> >> >> >> >>> > node >>>>>>> >> >> >> >>> > (without >>>>>>> >> >> >> >>> > doing TCP calls)? >>>>>>> >> >> >> >>> >>>>>>> >> >> >> >>> Yes, local exchanges happen via memory and not TCP, for >>>>>>> example >>>>>>> >> >> >> >>> if >>>>>>> >> >> >> >>> you >>>>>>> >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 >>>>>>> are likely >>>>>>> >> >> >> >>> to >>>>>>> >> >> >> >>> exchange data locally. >>>>>>> >> >> >> >>> >>>>>>> >> >> >> >>> > 3. Is there support for Infiniband interconnect? >>>>>>> >> >> >> >>> >>>>>>> >> >> >> >>> No, not that I'm aware of. >>>>>>> >> >> >> >>> >>>>>>> >> >> >> >>> – Ufuk >>>>>>> >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> >> >> >> >> -- >>>>>>> >> >> >> >> Saliya Ekanayake >>>>>>> >> >> >> >> Ph.D. Candidate | Research Assistant >>>>>>> >> >> >> >> School of Informatics and Computing | Digital Science >>>>>>> Center >>>>>>> >> >> >> >> Indiana University, Bloomington >>>>>>> >> >> >> >> >>>>>>> >> >> >> > >>>>>>> >> >> >> > >>>>>>> >> >> >> > >>>>>>> >> >> >> > -- >>>>>>> >> >> >> > Saliya Ekanayake >>>>>>> >> >> >> > Ph.D. Candidate | Research Assistant >>>>>>> >> >> >> > School of Informatics and Computing | Digital Science >>>>>>> Center >>>>>>> >> >> >> > Indiana University, Bloomington >>>>>>> >> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > -- >>>>>>> >> >> > Saliya Ekanayake >>>>>>> >> >> > Ph.D. Candidate | Research Assistant >>>>>>> >> >> > School of Informatics and Computing | Digital Science Center >>>>>>> >> >> > Indiana University, Bloomington >>>>>>> >> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > -- >>>>>>> >> > Saliya Ekanayake >>>>>>> >> > Ph.D. Candidate | Research Assistant >>>>>>> >> > School of Informatics and Computing | Digital Science Center >>>>>>> >> > Indiana University, Bloomington >>>>>>> >> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > -- >>>>>>> > Saliya Ekanayake >>>>>>> > Ph.D. Candidate | Research Assistant >>>>>>> > School of Informatics and Computing | Digital Science Center >>>>>>> > Indiana University, Bloomington >>>>>>> > >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Saliya Ekanayake >>>>>> Ph.D. Candidate | Research Assistant >>>>>> School of Informatics and Computing | Digital Science Center >>>>>> Indiana University, Bloomington >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Saliya Ekanayake >>>> Ph.D. Candidate | Research Assistant >>>> School of Informatics and Computing | Digital Science Center >>>> Indiana University, Bloomington >>>> >>>> >>> >> >> >> -- >> Saliya Ekanayake >> Ph.D. Candidate | Research Assistant >> School of Informatics and Computing | Digital Science Center >> Indiana University, Bloomington >> >> > -- Saliya Ekanayake Ph.D. Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington