Hi, there is no known limitation in the strict sense, but you might run out of dfs space or job manager memory if you keep around a huge number checkpoints. I wonder what reason you might have that you ever want such a huge number of retained checkpoints? Usually keeping one checkpoint should do the job, maybe a couple more if you are very afraid about corruption that goes beyond your DFSs capabilities to handle it. Is there any reason for that or maybe a misconception about increasing the number of retained checkpoints is good for?
Best, Stefan > Am 10.01.2018 um 08:54 schrieb Piotr Nowojski <pi...@data-artisans.com>: > > Hi, > > Increasing akka’s timeouts is rarely a solution for any problems - it either > do not help, or just mask the issue making it less visible. But yes, it is > possible to bump the limits: > https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/config.html#distributed-coordination-via-akka > > <https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/config.html#distributed-coordination-via-akka> > > I don’t think that state.checkpoints.num-retained was thought to handle such > large numbers of retained checkpoint so maybe there are some known/unknown > limitations. Stefan, do you know something in this regard? > > Parallel thing to do is that like for any other akka timeout, you should > track down the root cause of it. This one warning line doesn’t tell much. > From where does it come from? Client log? Job manager log? Task manager log? > Please search on the opposite side of the time outing connection for possible > root cause of the timeout including: > - possible error/exceptions/warnings > - long GC pauses or other blocking operations (possibly long unnatural gaps > in the logs) > - machine health (CPU usage, disks usage, network connections) > > Piotrek > >> On 9 Jan 2018, at 16:38, Jose Miguel Tejedor Fernandez >> <jose.fernan...@rovio.com <mailto:jose.fernan...@rovio.com>> wrote: >> >> Hello, >> >> I have several stream jobs running (v. 1.3.1 ) in production which always >> fails after a fixed period of around 30h after being executing. That's the >> WARN trace before failing: >> >> Association with remote system >> [akka.tcp://fl...@ip-10-1-51-134.cloud-internal.acme.com:39876 >> <http://fl...@ip-10-1-51-134.cloud-internal.acme.com:39876/>] has failed, >> address is now gated for [5000] ms. Reason: [Association failed with >> [akka.tcp://fl...@ip-10-1-51-134.cloud-internal.acme.com:39876 >> <http://fl...@ip-10-1-51-134.cloud-internal.acme.com:39876/>]] Caused by: >> [No response from remote for outbound association. Handshake timed out after >> [20000 ms]. >> >> The main change done in the job configuration was to increase the >> state.checkpoints.num-retained from 1 to 2880. I am using asynchronous >> RocksDB to persists to snapshot the state. (I attach some screenshots with >> the checkpoint conf from webUI) >> >> May my assumption be correct that the increase of checkpoints.num-retained >> is causing the problem? Any known issue regarding this? >> Besides, Is there any way to increase the Akka handshake timeout from the >> current 20000 ms to a higher value? I considered that it may be convenient >> to increase the timeout to 1 minute instead. >> >> BR >> >> >> <Screen Shot 2018-01-09 at 17.35.25.png><Screen Shot 2018-01-09 at >> 17.35.18.png><Screen Shot 2018-01-09 at 17.35.00.png> >