Re: 1.5.1

2018-09-17 Thread Juho Autio
For the record, this hasn't been a problem for us any more. Successfully running Flink 1.6.0. We set "web.timeout: 12" in flink-conf.yaml, but so far I've gathered that this setting doesn't have anything to do with heartbeat timeouts (?). Most likely the heartbeat timeouts were caused by some

Re: 1.5.1

2018-08-15 Thread Gary Yao
Hi Juho, the main thread of the RPC endpoint should never be blocked. Blocking on that thread is considered an implementation error. Unfortunately, without logs it is difficult to tell what the exact problem is. If you are able to reproduce heartbeat timeouts on your test staging environment, can

Re: 1.5.1

2018-08-15 Thread Juho Autio
Gary, I found another mail thread about similar issue: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Testing-on-Flink-1-5-tp19565p19647.html Specifically I found this: > we are observing Akka.ask.timeout error for few of our jobs (JM's logs[2]), we tried to increase this pa

Re: 1.5.1

2018-08-15 Thread Juho Autio
Vishal, from which version did you upgrade to 1.5.1? Maybe from 1.5.0 (release)? Knowing that might help narrowing down the source of this. On Wed, Aug 15, 2018 at 11:38 AM Juho Autio wrote: > Thanks Gary.. > > What could be blocking the RPC threads? Slow checkpointing? > > In production we're s

Re: 1.5.1

2018-08-15 Thread Juho Autio
Thanks Gary.. What could be blocking the RPC threads? Slow checkpointing? In production we're still using a self-built Flink package 1.5-SNAPSHOT, flink commit 8395508b0401353ed07375e22882e7581d46ac0e, and the jobs are stable. Now with 1.5.2 the same jobs are failing due to heartbeat timeouts ev

Re: 1.5.1

2018-08-14 Thread Gary Yao
Hi Juho, It seems in your case the JobMaster did not receive a heartbeat from the TaskManager in time [1]. Heartbeat requests and answers are sent over the RPC framework, and RPCs of one component (e.g., TaskManager, JobMaster, etc.) are dispatched by a single thread. Therefore, the reasons for he

Re: 1.5.1

2018-08-13 Thread Juho Autio
I also have jobs failing on a daily basis with the error "Heartbeat of TaskManager with id timed out". I'm using Flink 1.5.2. Could anyone suggest how to debug possible causes? I already set these in flink-conf.yaml, but I'm still getting failures: heartbeat.interval: 1 heartbeat.timeout: 10

Re: 1.5.1

2018-07-22 Thread Vishal Santoshi
According to the UI it seems that " org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed. " was the cause of a pipe restart. As to the TM it is an artifact of the new job allocation regime which will exhaust all slots on a TM rather then distrib

Re: 1.5.1

2018-07-22 Thread Gary Yao
Hi, The first exception should be only logged on info level. It's expected to see this exception when a TaskManager unregisters from the ResourceManager. Heartbeats can be configured via heartbeat.interval and hearbeat.timeout [1]. The default timeout is 50s, which should be a generous value. It

Re: 1.5.1

2018-07-05 Thread Chesnay Schepler
Release notes: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12343053 I'm currently building the release artifacts, if everything goes smoothly it should be released next week. On 05.07.2018 16:16, Vishal Santoshi wrote: We are planning to go to 1.5.0 next