Re: ***UNCHECKED*** Re: Standalone cluster instability

2018-09-19 Thread Piotr Nowojski
Hi, JobManager is not responsible and have no means to restart TaskManager in case of TaskManager process being killed (it would need to have ssh into the machine and restart it…). I don’t know, but from your description of the problem I presume that Flink’s bash startup scripts do not contain

Re: Standalone cluster instability

2018-08-16 Thread Piotr Nowojski
Hi, I’m not aware of such rules of thumb. Memory consumption is highly application and workload specific. It depends on how much things you allocate in your user code and how much memory do you keep on state (in case of heap state backend). Basically just as with most java applications, you hav

Re: Standalone cluster instability

2018-08-16 Thread Shailesh Jain
Thank you for your help Piotrek. I think it was a combination of a. other processes taking up available memory and b. flink processes consuming all the memory allocated to them, that resulted in kernel running out of memory. Are there any heuristics or best practices which you (or anyone in the c

Re: Standalone cluster instability

2018-08-14 Thread Piotr Nowojski
Hi, Good that we are more or less on track with this problem :) But the problem here is not that heap size is too small, bot that your kernel is running out of memory and starts killing processes. Either: 1. some other process is using the available memory 2. Increase memory allocation on your

Re: Standalone cluster instability

2018-08-13 Thread Shailesh Jain
Hi Piotrek, Thanks for your reply. I checked through the syslogs for that time, and I see this: Aug 8 13:20:52 smoketest kernel: [1786160.856662] Out of memory: Kill process 2305 (java) score 468 or sacrifice child Aug 8 13:20:52 smoketest kernel: [1786160.859091] Killed process 2305 (java) tot

Re: Standalone cluster instability

2018-08-10 Thread Piotr Nowojski
Hi, Please post full TaskManager logs, including stderr and stdout. (Have you checked the stderr/stdout for some messages?) I could think of couple reasons: 1. process segfault 2. process killed by OS 3. OS failure 1. Should be visible by some message in stderr/stdout file and can be caused by

Re: Standalone cluster instability

2018-08-09 Thread Shailesh Jain
Hi, I hit a similar issue yesterday, the task manager died suspiciously, no error logs in the task manager logs, but I see the following exceptions in the job manager logs: 2018-08-05 18:03:28,322 ERROR akka.remote.Remoting - Association to [akka.tcp://fli

Re: Standalone cluster instability

2018-03-26 Thread Alexander Smirnov
Hi Piotr, I didn't find anything special in the logs before the failure. Here are the logs, please take a look: https://drive.google.com/drive/folders/1zlUDMpbO9xZjjJzf28lUX-bkn_x7QV59?usp=sharing The configuration is: 3 task managers: qafdsflinkw011.scl qafdsflinkw012.scl qafdsflinkw013.scl -

Re: Standalone cluster instability

2018-03-21 Thread Piotr Nowojski
Hi, Does the issue really happen after 48 hours? Is there some indication of a failure in TaskManager log? If you will be still unable to solve the problem, please provide full TaskManager and JobManager logs. Piotrek > On 21 Mar 2018, at 16:00, Alexander Smirnov > wrote: > > One more ques

Re: Standalone cluster instability

2018-03-21 Thread Alexander Smirnov
One more question - I see a lot of line like the following in the logs [2018-03-21 00:30:35,975] ERROR Association to [akka.tcp:// fl...@qafdsflinkw811.nn.five9lab.com:35320] with UID [1500204560] irrecoverably failed. Quarantining address. (akka.remote.Remoting) [2018-03-21 00:34:15,208] WARN Ass

Standalone cluster instability

2018-03-21 Thread Alexander Smirnov
Hello, I've assembled a standalone cluster of 3 task managers and 3 job managers(and 3 ZK) following the instructions at https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/deployment/cluster_setup.html and https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/jobmanager_hig