Hi Max,

The ingestion is covering EOD processing from a Kafka source, so we get a lot 
of data from 5pm-8pm and outside of that time we get no data. The checkpoint is 
just storing the Kafka offset for restart.

Sounds like during the period of no data there could be an open buffer. I would 
have thought that would be cleared soon after data starts flowing again though 
and wouldn't lead to an increase in checkpoint size over a number of days.

Unless we are missing something in BEAM and aren't actually triggering a new 
start bundle at any point, which is why the buffer continues to grow and is 
never flushed?

I am going to try to recreate on a very simple test pipeline.

For reference, we are using Flink 1.8.0 and Apache BEAM 2.16 at the moment.

Many thanks,

Steve


Stephen Hesketh | Client Analytics Technology
S +44 (0)7968 039848
+ [email protected] 
250 Bishopsgate | London | EC2M 4AA
The information classification of this email is Confidential unless otherwise 
stated. 


-----Original Message-----
From: Maximilian Michels [mailto:[email protected]] 
Sent: 22 April 2020 20:38
To: [email protected]; Hesketh, Stephen (Technology, NatWest Markets)
Subject: Re: Apache beam job on Flink checkpoint size growing over time


*********************************************
"This is an external email. Do you know who has sent it? Can you be sure that 
any links and attachments contained within it are safe? If in any doubt, use 
the Phishing Reporter Button in your Outlook client or forward the email as an 
attachment to ~ I've Been Phished"
*********************************************

Hi Steve,

The Flink Runner buffers data as part of the checkpoint. This was
originally due to a limitation of Flink where we weren't able to end the
bundle before we persisted the state for a checkpoint. This is due to
how checkpoint barriers are emitted, I spare you the details*.

Does the data ingestion completely stop at one point? I'm asking because
the buffer is only flushed when a new bundle is started. So you might be
persisting data which could have already been flushed out.

Cheers,
Max

*Since Flink version 1.7 it is actually possible to flush all bundle
data before we send checkpoint barriers out but that may also affect
checkpoint barrier alignment and thus we opted for keeping the buffering
on checkpoints.

On 22.04.20 18:02, [email protected] wrote:
> One of our *Apache beam* job running through
> the *FlinkRunner* is *experiencing an odd behaviour with checkpoint
> size*. The state backend is File based. The job receives traffic once a
> day for a period of an hour and then is idle until it receives more data.
> 
>  
> 
> The checkpoint slowly increments in size as we process more data.
> However, the size of the checkpoint does not decrease significantly once
> data has stopped being consumed for that day.
> 
> We thought it could potentially be a bottle neck with the Database sink
> however the same behaviour is present if we remove the sink and simply
> dump the data.
> 
> The behaviour seems to resemble a stepped graph e.g.
> 
> ·         checkpoint = *120KB* (starting size checkpoint)
> 
> ·         checkpoint = *409MB* (starts receiving data)
> 
> ·         checkpoint = *850MB* (processing the backlog data)
> 
> ·         checkpoint = *503MB* (finished processing data)
> 
> ·         checkpoint = *1.2GB* (begins processing new data and backlog)
> 
> ·         checkpoint = *700MB* (finished processing data)
> 
> ·         checkpoint = *700MB* (new starting size for checkpoint)
> 
> ·         ...
> 
>  
> 
> Has anyone see this behaviour before? is this a known issue with Flink
> checkpointing using Apache beam?
> 
> Thanks,
> 
> Steve
> 
>  
> 
>  
> 
> *Stephen Hesketh | Client Analytics Technology*
> 
> The information classification of this email is Confidential unless
> otherwise stated.
> 
>  
> 
> 
> This communication and any attachments are confidential and intended
> solely for the addressee. If you are not the intended recipient please
> advise us immediately and delete it. Unless specifically stated in the
> message or otherwise indicated, you may not duplicate, redistribute or
> forward this message and any attachments are not intended for
> distribution to, or use by any person or entity in any jurisdiction or
> country where such distribution or use would be contrary to local law or
> regulation. NatWest Markets Plc  or any affiliated entity ("NatWest
> Markets") accepts no responsibility for any changes made to this message
> after it was sent.
> Unless otherwise specifically indicated, the contents of this
> communication and its attachments are for information purposes only and
> should not be regarded as an offer or solicitation to buy or sell a
> product or service, confirmation of any transaction, a valuation,
> indicative price or an official statement. Trading desks may have a
> position or interest that is inconsistent with any views expressed in
> this message. In evaluating the information contained in this message,
> you should know that it could have been previously provided to other
> clients and/or internal NatWest Markets personnel, who could have
> already acted on it.
> NatWest Markets cannot provide absolute assurances that all electronic
> communications (sent or received) are secure, error free, not corrupted,
> incomplete or virus free and/or that they will not be lost,
> mis-delivered, destroyed, delayed or intercepted/decrypted by others.
> Therefore NatWest Markets disclaims all liability with regards to
> electronic communications (and the contents therein) if they are
> corrupted, lost destroyed, delayed, incomplete, mis-delivered,
> intercepted, decrypted or otherwise misappropriated by others.
> Any electronic communication that is conducted within or through NatWest
> Markets systems will be subject to being archived, monitored and
> produced to regulators and in litigation in accordance with NatWest
> Markets’ policy and local laws, rules and regulations. Unless expressly
> prohibited by local law, electronic communications may be archived in
> countries other than the country in which you are located, and may be
> treated in accordance with the laws and regulations of the country of
> each individual included in the entire chain.
> Copyright NatWest Markets Plc. All rights reserved. See
> https://www.nwm.com/disclaimer for further risk disclosure.


This communication and any attachments are confidential and intended solely for 
the addressee. If you are not the intended recipient please advise us 
immediately and delete it. Unless specifically stated in the message or 
otherwise indicated, you may not duplicate, redistribute or forward this 
message and any attachments are not intended for distribution to, or use by any 
person or entity in any jurisdiction or country where such distribution or use 
would be contrary to local law or regulation. NatWest Markets Plc  or any 
affiliated entity ("NatWest Markets") accepts no responsibility for any changes 
made to this message after it was sent.

Unless otherwise specifically indicated, the contents of this communication and 
its attachments are for information purposes only and should not be regarded as 
an offer or solicitation to buy or sell a product or service, confirmation of 
any transaction, a valuation, indicative price or an official statement. 
Trading desks may have a position or interest that is inconsistent with any 
views expressed in this message. In evaluating the information contained in 
this message, you should know that it could have been previously provided to 
other clients and/or internal NatWest Markets personnel, who could have already 
acted on it.

NatWest Markets cannot provide absolute assurances that all electronic 
communications (sent or received) are secure, error free, not corrupted, 
incomplete or virus free and/or that they will not be lost, mis-delivered, 
destroyed, delayed or intercepted/decrypted by others. Therefore NatWest 
Markets disclaims all liability with regards to electronic communications (and 
the contents therein) if they are corrupted, lost destroyed, delayed, 
incomplete, mis-delivered, intercepted, decrypted or otherwise misappropriated 
by others.

Any electronic communication that is conducted within or through NatWest 
Markets systems will be subject to being archived, monitored and produced to 
regulators and in litigation in accordance with NatWest Markets’ policy and 
local laws, rules and regulations. Unless expressly prohibited by local law, 
electronic communications may be archived in countries other than the country 
in which you are located, and may be treated in accordance with the laws and 
regulations of the country of each individual included in the entire chain.

Copyright NatWest Markets Plc. All rights reserved. See 
https://www.nwm.com/disclaimer for further risk disclosure.

Reply via email to