Re: Stream job failed after increasing number retained checkpoints

Stefan Richter Wed, 10 Jan 2018 00:50:46 -0800

Hi,

there is no known limitation in the strict sense, but you might run out of dfs 
space or job manager memory if you keep around a huge number checkpoints. I 
wonder what reason you might have that you ever want such a huge number of 
retained checkpoints? Usually keeping one checkpoint should do the job, maybe a 
couple more if you are very afraid about corruption that goes beyond your DFSs 
capabilities to handle it. Is there any reason for that or maybe a 
misconception about increasing the number of retained checkpoints is good for?


Best,
Stefan 

> Am 10.01.2018 um 08:54 schrieb Piotr Nowojski <pi...@data-artisans.com>:
> 
> Hi,
> 
> Increasing akka’s timeouts is rarely a solution for any problems - it either 
> do not help, or just mask the issue making it less visible. But yes, it is 
> possible to bump the limits: 
> https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/config.html#distributed-coordination-via-akka
>  
> <https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/config.html#distributed-coordination-via-akka>
> 
> I don’t think that state.checkpoints.num-retained was thought to handle such 
> large numbers of retained checkpoint so maybe there are some known/unknown 
> limitations. Stefan, do you know something in this regard?
> 
> Parallel thing to do is that like for any other akka timeout, you should 
> track down the root cause of it. This one warning line doesn’t tell much. 
> From where does it come from? Client log? Job manager log? Task manager log? 
> Please search on the opposite side of the time outing connection for possible 
> root cause of the timeout including:
> - possible error/exceptions/warnings
> - long GC pauses or other blocking operations (possibly long unnatural gaps 
> in the logs)
> - machine health (CPU usage, disks usage, network connections)
> 
> Piotrek
> 
>> On 9 Jan 2018, at 16:38, Jose Miguel Tejedor Fernandez 
>> <jose.fernan...@rovio.com <mailto:jose.fernan...@rovio.com>> wrote:
>> 
>> Hello,
>> 
>> I have several stream jobs running (v. 1.3.1 ) in production which always 
>> fails after a fixed period of around 30h after being executing. That's the 
>> WARN trace before failing:
>> 
>> Association with remote system 
>> [akka.tcp://fl...@ip-10-1-51-134.cloud-internal.acme.com:39876 
>> <http://fl...@ip-10-1-51-134.cloud-internal.acme.com:39876/>] has failed, 
>> address is now gated for [5000] ms. Reason: [Association failed with 
>> [akka.tcp://fl...@ip-10-1-51-134.cloud-internal.acme.com:39876 
>> <http://fl...@ip-10-1-51-134.cloud-internal.acme.com:39876/>]] Caused by: 
>> [No response from remote for outbound association. Handshake timed out after 
>> [20000 ms].
>> 
>> The main change done in the job configuration was to increase the 
>> state.checkpoints.num-retained from 1 to 2880. I am using asynchronous 
>> RocksDB to persists to snapshot the state. (I attach some screenshots with 
>> the  checkpoint conf from webUI)
>> 
>> May my assumption be correct that the increase of checkpoints.num-retained 
>> is causing the problem? Any known issue regarding this?
>> Besides, Is there any way to increase the Akka handshake timeout from the 
>> current 20000 ms to a higher value? I considered that it may be convenient 
>> to increase the timeout to 1 minute instead.
>> 
>> BR
>> 
>> 
>> <Screen Shot 2018-01-09 at 17.35.25.png><Screen Shot 2018-01-09 at 
>> 17.35.18.png><Screen Shot 2018-01-09 at 17.35.00.png>
>

Re: Stream job failed after increasing number retained checkpoints

Reply via email to