Hi Ovidiu, Checking the /var/log/messages based on Greg's response revealed TMs were killed due to out of memory. Here's the node architecture. Each node has 128GB of RAM. I was trying to run 2 TMs per node binding each to 12 cores (or 1 socket). The total number of nodes were 16. I finally, managed to get it working with the following (non-default) settings.
taskmanager.heap.mb: 12288 taskmanager.numberOfTaskSlots: 12 akka.ask.timeout: 1000 s taskmanager.network.numberOfBuffers: 36864 Note, the number of buffers value, this had to be higher (twice in this case) than what's suggested in Flink (#slots-per-TM^2 * #TMs * 4, which would be 12*12*32*4 = 18432). Otherwise, it would throw me the not enough buffers error. Thank you, Saliya [image: Inline image 2] On Tue, Jul 12, 2016 at 7:39 AM, Ovidiu-Cristian MARCU < ovidiu-cristian.ma...@inria.fr> wrote: > Hi, > > Can you post your configuration parameters (exclude default settings) and > cluster description? > > Best, > Ovidiu > > On 11 Jul 2016, at 17:49, Saliya Ekanayake <esal...@gmail.com> wrote: > > Thank you Greg, I'll check if this was the cause for my TMs to disappear. > > On Mon, Jul 11, 2016 at 11:34 AM, Greg Hogan <c...@greghogan.com> wrote: > >> The OOM killer doesn't give warning so you'll need to call dmesg or look >> in /var/log/messages or similar. The following reports that Debian flavors >> may use /var/log/syslog. >> >> http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer >> >> On Sun, Jul 10, 2016 at 11:55 PM, Saliya Ekanayake <esal...@gmail.com> >> wrote: >> >>> Greg, >>> >>> where did you see the OOM log as shown in this mail thread? In my case >>> none of the TaskManagers nor JobManger reports an error like this. >>> >>> On Sun, Jul 10, 2016 at 8:45 PM, Greg Hogan <c...@greghogan.com> wrote: >>> >>>> These symptoms sounds similar to what I was experiencing in the >>>> following thread. Flink can have some unexpected memory usage which can >>>> result in an OOM kill by the kernel, and this becomes more pronounced as >>>> the cluster size grows. >>>> https://www.mail-archive.com/dev@flink.apache.org/msg06346.html >>>> >>>> On Fri, Jul 8, 2016 at 12:46 PM, Saliya Ekanayake <esal...@gmail.com> >>>> wrote: >>>> >>>>> I checked, but JVMs didn't crash. No puppet or other services like >>>>> that. >>>>> >>>>> One thing I found is that things work OK when I have a smaller number >>>>> of slaves. For example, here I was trying to run on 16 nodes giving 2 TMs >>>>> each. Then I reduced it to 4 nodes each with 2 TMs, which worked. >>>>> >>>>> >>>>> >>>>> On Fri, Jul 8, 2016 at 12:31 PM, Robert Metzger <rmetz...@apache.org> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> from the TaskManager logs, I can not see anything suspicious. >>>>>> Its a bit weird that the TaskManager logs just end, without any >>>>>> shutdown messages. Usually the TMs log some shut down stuff when they are >>>>>> stopping. >>>>>> Also, if they would be still running, I would expect some error >>>>>> messages from akka about the connection status. >>>>>> So the only thing I conclude is that one of the TMs was killed by the >>>>>> OS or the JVM crashed. Did you check if that happened? >>>>>> >>>>>> Do you have any service like puppet that is controlling processes? >>>>>> >>>>>> >>>>>> On Thu, Jul 7, 2016 at 5:46 PM, Saliya Ekanayake <esal...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I see two logs (attached), but there's only 1 TaskManger process. >>>>>>> Also, the Web console says it can find only 1 TM. >>>>>>> >>>>>>> However, I see this part in JM log, which shows there was a second >>>>>>> TM at one point, but it was unregistered. Any thoughts? >>>>>>> >>>>>>> -------------------------- >>>>>>> >>>>>>> - Registered TaskManager at j-002 (akka.tcp:// >>>>>>> flink@172.16.0.2:42888/user/taskmanager) as >>>>>>> 1c65415701f19978c8a8cdc75c993717. Current number of registered hosts is >>>>>>> 1. >>>>>>> Current number of alive task slots is 12. >>>>>>> >>>>>>> 2016-07-07 11:32:40,363 WARN akka.remote.ReliableDeliverySupervisor >>>>>>> - Association with remote system [akka.tcp://flink@172.16.0.2:42888] >>>>>>> has failed, address is now gated for [5000] ms. Reason is: >>>>>>> [Disassociated]. >>>>>>> >>>>>>> 2016-07-07 11:32:42,722 INFO >>>>>>> org.apache.flink.runtime.instance.InstanceManager - Registered >>>>>>> TaskManager >>>>>>> at j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager) as >>>>>>> 9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts is >>>>>>> 2. >>>>>>> Current number of alive task slots is 24. >>>>>>> >>>>>>> 2016-07-07 11:33:15,316 WARN Remoting - Tried to associate with >>>>>>> unreachable remote address [akka.tcp://flink@172.16.0.2:42888]. >>>>>>> Address is now gated for 5000 ms, all messages to this address will be >>>>>>> delivered to dead letters. Reason: Connection refused: / >>>>>>> 172.16.0.2:42888 >>>>>>> >>>>>>> 2016-07-07 11:33:15,320 INFO >>>>>>> org.apache.flink.runtime.jobmanager.JobManager - Task manager >>>>>>> akka.tcp:// >>>>>>> flink@172.16.0.2:42888/user/taskmanager terminated. >>>>>>> 2016-07-07 11:33:15,320 INFO >>>>>>> org.apache.flink.runtime.instance.InstanceManager - Unregistered task >>>>>>> manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager. Number >>>>>>> of registered task managers 1. Number of available slots 12. >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <u...@apache.org> wrote: >>>>>>> >>>>>>>> No that should suffice. Can you check whether there are any task >>>>>>>> manager logs for the second TM on that machine >>>>>>>> (taskmanager-X-j-011.log where X is the TM number)? If yes, the task >>>>>>>> manager process does start up and there is another problem. If not, >>>>>>>> the task managers seems not to start even. >>>>>>>> >>>>>>>> – Ufuk >>>>>>>> >>>>>>>> On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <esal...@gmail.com> >>>>>>>> wrote: >>>>>>>> > I tried to run more than one task manager per node by duplicating >>>>>>>> the slave >>>>>>>> > IPs. At startup it says for example, >>>>>>>> > >>>>>>>> > [INFO] 1 instance(s) of taskmanager are already running on j-011. >>>>>>>> > Starting taskmanager daemon on host j-011. >>>>>>>> > >>>>>>>> > but I only see 1 task manager process running. >>>>>>>> > >>>>>>>> > Is there anything else I need to do? >>>>>>>> > >>>>>>>> > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <u...@apache.org> >>>>>>>> wrote: >>>>>>>> >> >>>>>>>> >> Yes, exactly. >>>>>>>> >> >>>>>>>> >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake < >>>>>>>> esal...@gmail.com> >>>>>>>> >> wrote: >>>>>>>> >> > Thank you, yes, it can be done externally, if not supported >>>>>>>> within >>>>>>>> >> > Flink. >>>>>>>> >> > >>>>>>>> >> > So the way to spawn multiple task managers would be to list >>>>>>>> the same >>>>>>>> >> > slave >>>>>>>> >> > machines N times as necessary in the slaves file? >>>>>>>> >> > >>>>>>>> >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <u...@apache.org> >>>>>>>> wrote: >>>>>>>> >> >> >>>>>>>> >> >> No, not inside of Flink. That sounds like something like the >>>>>>>> OS or >>>>>>>> >> >> resource manager should handle. >>>>>>>> >> >> >>>>>>>> >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake < >>>>>>>> esal...@gmail.com> >>>>>>>> >> >> wrote: >>>>>>>> >> >> > That's great, so is there support to pin task managers to >>>>>>>> sockets as >>>>>>>> >> >> > well? >>>>>>>> >> >> > >>>>>>>> >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi < >>>>>>>> u...@apache.org> wrote: >>>>>>>> >> >> >> >>>>>>>> >> >> >> Regarding 2) if you don't manually configure something >>>>>>>> else, that >>>>>>>> >> >> >> should happen always. >>>>>>>> >> >> >> >>>>>>>> >> >> >> Yes, you can run more than one task manager per node >>>>>>>> depending on >>>>>>>> >> >> >> the >>>>>>>> >> >> >> process isolation you want. Within a task manager, there >>>>>>>> are >>>>>>>> >> >> >> multiple >>>>>>>> >> >> >> threads for each slot. For example, if you have 2 task >>>>>>>> managers with >>>>>>>> >> >> >> 2 >>>>>>>> >> >> >> slots each and submit a job with parallelism 4, each task >>>>>>>> manager >>>>>>>> >> >> >> will >>>>>>>> >> >> >> execute 2 sub tasks in separate Threads. >>>>>>>> >> >> >> >>>>>>>> >> >> >> >>>>>>>> >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake < >>>>>>>> esal...@gmail.com> >>>>>>>> >> >> >> wrote: >>>>>>>> >> >> >> > Hi Ufuk, >>>>>>>> >> >> >> > >>>>>>>> >> >> >> > Looking at the document you sent it seems only 1 task >>>>>>>> manager per >>>>>>>> >> >> >> > node >>>>>>>> >> >> >> > exist >>>>>>>> >> >> >> > and within that you have multiple slots. Is it possible >>>>>>>> to run >>>>>>>> >> >> >> > more >>>>>>>> >> >> >> > than >>>>>>>> >> >> >> > 1 >>>>>>>> >> >> >> > task manager per node? Also, within a task manager is the >>>>>>>> >> >> >> > parallelism >>>>>>>> >> >> >> > done >>>>>>>> >> >> >> > through threads or processes? >>>>>>>> >> >> >> > >>>>>>>> >> >> >> > Thank you, >>>>>>>> >> >> >> > Saliya >>>>>>>> >> >> >> > >>>>>>>> >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake >>>>>>>> >> >> >> > <esal...@gmail.com> >>>>>>>> >> >> >> > wrote: >>>>>>>> >> >> >> >> >>>>>>>> >> >> >> >> Thank you, I'll check these. >>>>>>>> >> >> >> >> >>>>>>>> >> >> >> >> In 2.) you said they are likely to exchange through >>>>>>>> memory. Is >>>>>>>> >> >> >> >> there >>>>>>>> >> >> >> >> a >>>>>>>> >> >> >> >> case why they wouldn't? >>>>>>>> >> >> >> >> >>>>>>>> >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi < >>>>>>>> u...@apache.org> >>>>>>>> >> >> >> >> wrote: >>>>>>>> >> >> >> >>> >>>>>>>> >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake >>>>>>>> >> >> >> >>> <esal...@gmail.com> >>>>>>>> >> >> >> >>> wrote: >>>>>>>> >> >> >> >>> > 1. What parameters are available to control >>>>>>>> parallelism within >>>>>>>> >> >> >> >>> > a >>>>>>>> >> >> >> >>> > node? >>>>>>>> >> >> >> >>> >>>>>>>> >> >> >> >>> Task Manager processing slots: >>>>>>>> >> >> >> >>> >>>>>>>> >> >> >> >>> >>>>>>>> >> >> >> >>> >>>>>>>> >> >> >> >>> >>>>>>>> >> >> >> >>> >>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots >>>>>>>> >> >> >> >>> >>>>>>>> >> >> >> >>> > 2. Does Flink support shared memory-based messaging >>>>>>>> within a >>>>>>>> >> >> >> >>> > node >>>>>>>> >> >> >> >>> > (without >>>>>>>> >> >> >> >>> > doing TCP calls)? >>>>>>>> >> >> >> >>> >>>>>>>> >> >> >> >>> Yes, local exchanges happen via memory and not TCP, >>>>>>>> for example >>>>>>>> >> >> >> >>> if >>>>>>>> >> >> >> >>> you >>>>>>>> >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 >>>>>>>> are likely >>>>>>>> >> >> >> >>> to >>>>>>>> >> >> >> >>> exchange data locally. >>>>>>>> >> >> >> >>> >>>>>>>> >> >> >> >>> > 3. Is there support for Infiniband interconnect? >>>>>>>> >> >> >> >>> >>>>>>>> >> >> >> >>> No, not that I'm aware of. >>>>>>>> >> >> >> >>> >>>>>>>> >> >> >> >>> – Ufuk >>>>>>>> >> >> >> >> >>>>>>>> >> >> >> >> >>>>>>>> >> >> >> >> >>>>>>>> >> >> >> >> >>>>>>>> >> >> >> >> -- >>>>>>>> >> >> >> >> Saliya Ekanayake >>>>>>>> >> >> >> >> Ph.D. Candidate | Research Assistant >>>>>>>> >> >> >> >> School of Informatics and Computing | Digital Science >>>>>>>> Center >>>>>>>> >> >> >> >> Indiana University, Bloomington >>>>>>>> >> >> >> >> >>>>>>>> >> >> >> > >>>>>>>> >> >> >> > >>>>>>>> >> >> >> > >>>>>>>> >> >> >> > -- >>>>>>>> >> >> >> > Saliya Ekanayake >>>>>>>> >> >> >> > Ph.D. Candidate | Research Assistant >>>>>>>> >> >> >> > School of Informatics and Computing | Digital Science >>>>>>>> Center >>>>>>>> >> >> >> > Indiana University, Bloomington >>>>>>>> >> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > -- >>>>>>>> >> >> > Saliya Ekanayake >>>>>>>> >> >> > Ph.D. Candidate | Research Assistant >>>>>>>> >> >> > School of Informatics and Computing | Digital Science Center >>>>>>>> >> >> > Indiana University, Bloomington >>>>>>>> >> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > -- >>>>>>>> >> > Saliya Ekanayake >>>>>>>> >> > Ph.D. Candidate | Research Assistant >>>>>>>> >> > School of Informatics and Computing | Digital Science Center >>>>>>>> >> > Indiana University, Bloomington >>>>>>>> >> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > -- >>>>>>>> > Saliya Ekanayake >>>>>>>> > Ph.D. Candidate | Research Assistant >>>>>>>> > School of Informatics and Computing | Digital Science Center >>>>>>>> > Indiana University, Bloomington >>>>>>>> > >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Saliya Ekanayake >>>>>>> Ph.D. Candidate | Research Assistant >>>>>>> School of Informatics and Computing | Digital Science Center >>>>>>> Indiana University, Bloomington >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Saliya Ekanayake >>>>> Ph.D. Candidate | Research Assistant >>>>> School of Informatics and Computing | Digital Science Center >>>>> Indiana University, Bloomington >>>>> >>>>> >>>> >>> >>> >>> -- >>> Saliya Ekanayake >>> Ph.D. Candidate | Research Assistant >>> School of Informatics and Computing | Digital Science Center >>> Indiana University, Bloomington >>> >>> >> > > > -- > Saliya Ekanayake > Ph.D. Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington > > > -- Saliya Ekanayake Ph.D. Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington