Making job fail on Checkpoint Expired?

2020-04-02 Thread Robin Cassan
Hi all, I am wondering if there is a way to make a flink job fail (not cancel it) when one or several checkpoints have failed due to being expired (taking longer than the timeout) ? I am using Flink 1.9.2 and have set ` *setTolerableCheckpointFailureNumber(1)*` which doesn't do the trick. Looking

Re: Making job fail on Checkpoint Expired?

2020-04-03 Thread Robin Cassan
b to fail when checkpoint > expired? > > Best, > Congxian > > > Timo Walther 于2020年4月2日周四 下午11:23写道: > >> Hi Robin, >> >> this is a very good observation and maybe even unintended behavior. >> Maybe Arvid in CC is more familiar with the checkpointing? >>

Re: Making job fail on Checkpoint Expired?

2020-04-09 Thread Robin Cassan
affect step 3; > > [1] https://issues.apache.org/jira/browse/FLINK-17043 > Best, > Congxian > > > Robin Cassan 于2020年4月3日周五 下午8:35写道: > >> Hi Congxian, >> >> Thanks for confirming! The reason I want this behavior is because we are >> currently investiga

Quick survey on checkpointing performance

2020-04-15 Thread Robin Cassan
Hi all, We are currently experiencing long checkpointing times on S3 and are wondering how abnormal it is compared to other loads and setups. Could some of you share a few stats in your running architecture so we can compare? Here are our stats: *Architecture*: 28 TM on Kubernetes, 4 slots per T

Protection against huge values in RocksDB List State

2020-05-14 Thread Robin Cassan
Hi all! I cannot seem to find any setting to limit the number of records appended in a RocksDBListState that is used when we use SessionWindows with a ProcessFunction. It seems that, for each incoming element, the new element will be appended to the value with the RocksDB `merge` operator, without

Re: Protection against huge values in RocksDB List State

2020-05-15 Thread Robin Cassan
040/java/src/main/java/org/rocksdb/RocksDB.java#L1382 > > Best > Yun Tang > -- > *From:* Robin Cassan > *Sent:* Friday, May 15, 2020 0:37 > *To:* user > *Subject:* Protection against huge values in RocksDB List State > > Hi all! &

Re: Protection against huge values in RocksDB List State

2020-05-19 Thread Robin Cassan
u do not have >> many key-by keys, operator state is a good choice as that is on-heap and >> lightweight. >> >> Best >> Yun Tang >> -- >> *From:* Robin Cassan >> *Sent:* Friday, May 15, 2020 20:59 >> *To:* Yun Tang >

[DISCUSSION] Using `DataStreamUtils.reinterpretAsKeyedStream` on a pre-partitioned Kafka topic

2019-11-29 Thread Robin Cassan
could pre-partition our data in Kafka so that Flink can avoid shuffling data before creating the Session Windows? Cheers, Robin -- <https://www.contentsquare.com/> Robin CASSAN Data Engineer +33 6 50 77 88 36 5 boulevard de la Madeleine - 75001 Paris <https://data.sigi

Re: [DISCUSSION] Using `DataStreamUtils.reinterpretAsKeyedStream` on a pre-partitioned Kafka topic

2019-12-03 Thread Robin Cassan
am/experimental.html >> [2] >> https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.html >> Best, >> Congxian >> >> >> Robin Cassan 于2019年11月30日周六 上午12:17写道: >> >>> Hi all! &g

Cleaning old incremental checkpoint files

2021-07-29 Thread Robin Cassan
Hi all! We've happily been running a Flink job in production for a year now, with the RocksDB state backend and incremental retained checkpointing on S3. We often release new versions of our jobs, which means we cancel the running one and submit another while restoring the previous jobId's last re

Re: Cleaning old incremental checkpoint files

2021-09-01 Thread Robin Cassan
still in use. > > I know this is only a partial answer to your question. I'll try to find > out more details and extend my answer later. > > > On Thu, Jul 29, 2021 at 2:31 PM Robin Cassan < > robin.cas...@contentsquare.com> wrote: > >> Hi all! >> >

Re: Cleaning old incremental checkpoint files

2021-09-07 Thread Robin Cassan
ould > let new job get decoupled with older checkpoints. Do you think that could > resolve your case? > > Best > Yun Tang > -- > *From:* Robin Cassan > *Sent:* Wednesday, September 1, 2021 17:38 > *To:* Robert Metzger > *Cc:* user > *Sub

Applying backpressure to limit state memory consumption

2022-05-19 Thread Robin Cassan
Hey all! I have a conceptual question on the DataStream API: when using an in-memory state backend (like the HashMapStateBackend), how can you ensure that the hashmap won't grow uncontrollably until OutOfMemory happens? In my case, I would be consuming from a Kafka topic, into a SessionWindow. The

Re: Applying backpressure to limit state memory consumption

2022-05-23 Thread Robin Cassan
tes on objects on the Java heap; however, > state size is limited by available memory within the cluster. " > > if the size of your window state is really huge, you should choose other > state backend. > Hopes my reply would help to you. > > Best, > Yuan > > > &g

Upgrading a job in rolling mode

2022-06-29 Thread Robin Cassan
Hi all! We are running a flink cluster on kubernetes and deploying a single job on it through "flink run ". Whenever we want to modify the jar, we cancel the job and run the "flink run" command again, with the new jar, and the retained checkpoint URL from the first run. This works well, but this ad

Re: Upgrading a job in rolling mode

2022-06-29 Thread Robin Cassan
oordinator, failover handling, and > operator coordinator. > > But, IMO, Job RollingUpgrade is a worthwhile thing to do. > > Best, > Weihua > > > On Wed, Jun 29, 2022 at 3:52 PM Robin Cassan < > robin.cas...@contentsquare.com> wrote: > >> Hi all! We are r

Ignoring state's offset when restoring checkpoints

2022-07-08 Thread Robin Cassan via user
Hello all! We have a need where, for specific recovery cases, we would need to manually reset our Flink kafka consumer offset to a given date but have the Flink job restore its state. As I understand, restoring from a checkpoint necessarily sets the group's offset to the one that was in the checkp

Re: Ignoring state's offset when restoring checkpoints

2022-07-11 Thread Robin Cassan via user
Thanks a lot Alexander and Tzu-Li for your answers, this helps a lot!! Cheers, Robin Le ven. 8 juil. 2022 à 17:40, Tzu-Li (Gordon) Tai a écrit : > Hi Robin, > > Apart from what Alexander suggested, I think you could also try the > following first: > Let the job use a "new" Kafka source, which y

New licensing for Akka

2022-09-07 Thread Robin Cassan via user
Hi all! It seems Akka have announced a licensing change https://www.lightbend.com/blog/why-we-are-changing-the-license-for-akka If I understand correctly, this could end-up increasing cost a lot for companies using Flink in production. Do you know if the Flink developers have any initial reaction a

Re: New licensing for Akka

2022-09-07 Thread Robin Cassan via user
Thanks a lot for your answers, this is reassuring! Cheers Le mer. 7 sept. 2022 à 13:12, Chesnay Schepler a écrit : > Just to squash concerns, we will make sure this license change will not > affect Flink users in any way. > > On 07/09/2022 11:14, Robin Cassan via user wrote: > &

Limiting backpressure during checkpoints

2022-10-13 Thread Robin Cassan via user
Hello all, hope you're well :) We are attempting to build a Flink job with minimal and stable latency (as much as possible) that consumes data from Kafka. Currently our main limitation happens when our job checkpoints the RocksDB state: backpressure is applied on the stream, causing latency. I am w

Re: Limiting backpressure during checkpoints

2022-10-24 Thread Robin Cassan via user
ps. > > In long term, I think we probably need to separate the compaction process > from the internal db and control/schedule the compaction process ourselves > (compaction takes a good amount of CPU and reduces TPS). > > Best. > Yuan > > > > On Thu, Oct 13, 2022 at 11

Reduce checkpoint-induced latency by tuning the amount of resource available for checkpoints

2022-12-15 Thread Robin Cassan via user
Hello all! We are trying to bring our flink job closer to real-time processing and currently our main issue is latency that happens during checkpoints. Our job uses RocksDB with periodic checkpoints, which are a few hundred GBs every 15 minutes. We are trying to reduce the checkpointing duration b

Re: Reduce checkpoint-induced latency by tuning the amount of resource available for checkpoints

2022-12-19 Thread Robin Cassan via user
ttps://issues.apache.org/jira/browse/FLINK-28699 > [2] > https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/fault-tolerance/checkpointing/#state-backend-incremental > [3] > https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/config

Re: Reduce checkpoint-induced latency by tuning the amount of resource available for checkpoints

2022-12-19 Thread Robin Cassan via user
elp to make incremental checkpoint size small and >>> stable which could make the CPU more stable. >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-28699 >>> [2] >>> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/fault

Kubernetes operator: config for taskmanager.memory.process.size ignored

2023-06-14 Thread Robin Cassan via user
Hello all! I am using the flink kubernetes operator and I would like to set the value for `taskmanager.memory.process.size`. I set the desired value in the flinkdeployment resource specs (here, I want 55gb), however it looks like the value that is effectively passed to the taskmanager is the same

Re: Kubernetes operator: config for taskmanager.memory.process.size ignored

2023-06-14 Thread Robin Cassan via user
t; > So basically the spec is a config shorthand, there is no reason to > override it as you won't get a different behaviour at the end of the day. > > Gyula > > On Wed, Jun 14, 2023 at 11:55 AM Robin Cassan via user < > user@flink.apache.org> wrote: > >> He

Re: Kubernetes operator: config for taskmanager.memory.process.size ignored

2023-06-14 Thread Robin Cassan via user
d fraction. > > Gyula > > On Wed, Jun 14, 2023 at 2:46 PM Robin Cassan < > robin.cas...@contentsquare.com> wrote: > >> Thanks Gyula for your answer! I'm wondering about your claim: >> > In Flink kubernetes the process is the pod so pod memory is always