Re: How to read a savepoint fast without exploding the memory

2025-02-12 Thread Jean-Marc Paulin
ith relatively small states. >>> RockDb commenting out the remove() is a bit surprisingly high but since >>> it's causing correctness issues under some circumstances I would abandon >>> that. >>> >>> > I probably need more time to ensure we definite

Re: How to read a savepoint fast without exploding the memory

2025-02-12 Thread Jean-Marc Paulin
gt; > I probably need more time to ensure we definitely see all the keys, but > that looks very good. > > Yeah, correctness is key here so waiting on your results. > > BR, > G > > > On Wed, Feb 12, 2025 at 11:56 AM Jean-Marc Paulin wrote: > >> Hi Gabor, >

Re: How to read a savepoint fast without exploding the memory

2025-02-12 Thread Jean-Marc Paulin
: > Yeah, I think this is more like the 1.x and 2.x incompatibility. > I've just opened the PR agains 1.20 which you can cherry-pick here [1]. > > [1] https://github.com/apache/flink/pull/26145 > > BR, > G > > > On Tue, Feb 11, 2025 at 4:19 PM Jean-Marc Paulin wrote

Re: How to read a savepoint fast without exploding the memory

2025-02-12 Thread Jean-Marc Paulin
: > Yeah, I think this is more like the 1.x and 2.x incompatibility. > I've just opened the PR agains 1.20 which you can cherry-pick here [1]. > > [1] https://github.com/apache/flink/pull/26145 > > BR, > G > > > On Tue, Feb 11, 2025 at 4:19 PM Jean-Marc Paulin wr

Re: How to read a savepoint fast without exploding the memory

2025-02-11 Thread Jean-Marc Paulin
xecution time: 25M27.126737S >>> - New execution time: 1M19.602042S >>> In short: ~95% performance gain. >>> >>> G >>> >>> >>> On Thu, Feb 6, 2025 at 9:06 AM Gabor Somogyi >>> wrote: >>> >>>> In short, when yo

Re: How to read a savepoint fast without exploding the memory

2025-02-11 Thread Jean-Marc Paulin
:06 AM Gabor Somogyi >> wrote: >> >>> In short, when you don't care about >>> multiple KeyedStateReaderFunction.readKey calls then you're on the safe >>> side. >>> >>> G >>> >>> On Wed, Feb 5, 2025 at 6:27 P

Re: How to read a savepoint fast without exploding the memory

2025-02-05 Thread Jean-Marc Paulin
s in this area. It's still a question whether the >>> tested patch is the right approach >>> but at least we've touched the root cause. >>> >>> The next step on my side is to have a deep dive and understand all the >>> aspects why remove is t

Re: How to read a savepoint fast without exploding the memory

2025-02-05 Thread Jean-Marc Paulin
t;> 2025-02-04 13:39:14,704 INFO o.a.f.e.s.r.FlinkTestStateReaderJob >> [] - Execution time: PT6.930659S >> >> Don't need to mention that the bigger is the processor API. >> >> G >> >> >> On Tue, Feb 4, 2025 at 1:40 PM Jean-Marc

Re: How to read a savepoint fast without exploding the memory

2025-02-04 Thread Jean-Marc Paulin
on a custom written operator, which opens the state > info for you. > > The only drawback is that you must know the keyBy range... this can be > problematic but if you can do it it's a win :) > > G > > > On Tue, Feb 4, 2025 at 12:16 PM Jean-Marc Paulin wrote: > >

Re: How to read a savepoint fast without exploding the memory

2025-02-04 Thread Jean-Marc Paulin
h > and intended to eliminate the slowness. Since we don't know what exactly > causes the slowness the new Frocksdb-8.10.0 can be also an imrpvement. > > All in all it will take some time to sort this out. > > [1] https://issues.apache.org/jira/browse/FLINK-37109 > >

How to read a savepoint fast without exploding the memory

2025-02-04 Thread Jean-Marc Paulin
What would be the best approach to read a savepoint and minimise the memory consumption. We just need to transform it into something else for investigation. Our flink 1.20 streaming job is using HashMap backend, and is spread over 6 task slots in 6 pods (under k8s). Savepoints are saved on S3. A s

Re: Q: How to best configure checkpoint to ensure they do not fill-up the storage?

2025-01-03 Thread Jean-Marc Paulin
; latest checkpoint instead of a savepoint (even though it is the latest), > and all the checkpoints will be properly cleaned up. Did you trigger the > savepoint periodically? And did you clean that up manually? > > > Best, > Zakelly > > On Thu, Jan 2, 2025 at 3:50 PM Jean-Ma

Re: Q: How to best configure checkpoint to ensure they do not fill-up the storage?

2025-01-01 Thread Jean-Marc Paulin
-application/checkpoints increasing? And have you set the state >> TTL? >> >> >> Best, >> Zakelly >> >> On Tue, Dec 31, 2024 at 7:58 PM Jean-Marc Paulin >> wrote: >> >>> Hi, >>> >>> We are on Flink 1.20/Java17 running in a

Q: How to best configure checkpoint to ensure they do not fill-up the storage?

2024-12-31 Thread Jean-Marc Paulin
Hi, We are on Flink 1.20/Java17 running in a k8s environment, with checkpoints enabled on S3 and the following checkpoint options: execution.checkpointing.dir: s3://flink-application/checkpoints execution.checkpointing.externalized-checkpoint-retention: DELETE_ON_CANCELLATION executio

How to programatically register a 3rd party serializer (Flink 1.20)

2024-12-05 Thread Jean-Marc Paulin
I am trying to use 3rd party serializers as per https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/fault-tolerance/serialization/third_party_serializers/ but the code sample does not compile ``` Configuration config = new Configuration(); // register the class of the se

Meaning behind the pekko.framesize configuration property and how to reduce it's size

2024-10-03 Thread Jean-Marc Paulin
Hi, We are running a Flink 1.19 application under heavy load. and out job manager failed to restart with exceptions like "Transient association error (association remains live)org.apache.pekko.remote.OversizedPayloadException: Discarding oversized payload sent to Actor[pekko.ssl.tcp://fl...@a

Error class RestoreMode not found after upgrading from Flink 1.19 to 1.20

2024-08-23 Thread Jean-Marc Paulin
Hi, We did further test and verification on this and it seems to only be an issue in HA with zookeeper. In that scenario I believe we resume from a checkpoint. I suppose the solution would be to cancel the job based on Flink 1.19 and then resubmit it with Flink 1.20. Yes we recompiled our app

Error class RestoreMode not found after upgrading from Flink 1.19 to 1.20

2024-08-22 Thread Jean-Marc Paulin
Hi, we have an error when Flink 1.20 resume a job from a savepoint that was written with Flink 1.19 (We are in an upgrade scenario here). And we hit the error `Caused by: java.lang.ClassNotFoundException: org.apache.flink.runtime.jobgraph.RestoreMode`, see below So I get that `org.apache.flink.

Re: Failed to resume from HA when the checkpoint has been deleted.

2024-06-11 Thread Jean-Marc Paulin
scenario. But maybe there isn't any. Best regards JM From: Zhanghao Chen Sent: Tuesday, June 11, 2024 03:56 To: Jean-Marc Paulin ; user@flink.apache.org Subject: [EXTERNAL] Re: Failed to resume from HA when the checkpoint has been deleted. Hi, In this case

Failed to resume from HA when the checkpoint has been deleted.

2024-06-10 Thread Jean-Marc Paulin
Hi, We have a 1.19 Flink streaming job, with HA enabled (ZooKeeper), checkpoint/savepoint in S3. We had an outage and now the jobmanager keeps restarting. We think it because it read the job id to be restarted from ZooKeeper, but because we lost our S3 Storage as part of the outage it cannot f

Saw a java.lang.ClassNotFoundException: com.facebook.presto.hive.s3.PrestoS3FileSystem$UnrecoverableS3OperationException

2024-05-09 Thread Jean-Marc Paulin
Hi, We use S3 as our datastore for checkpoint/savepoints, and following an S3 error we saw that exception: ``` java.io.IOException: GET operation failed: Could not transfer error message at org.apache.flink.runtime.blob.BlobClient.getInternal(BlobClient.java:231) at org.apache.

RE: Flink 1.18: Unable to resume from a savepoint with error InvalidPidMappingException

2024-04-23 Thread Jean-Marc Paulin
___ From: Yanfei Lei Sent: Monday, April 22, 2024 03:28 To: Jean-Marc Paulin Cc: user@flink.apache.org Subject: [EXTERNAL] Re: Flink 1.18: Unable to resume from a savepoint with error InvalidPidMappingException Hi JM, Yes, `InvalidPidMappingException` occurs because the tra

Flink 1.18: Unable to resume from a savepoint with error InvalidPidMappingException

2024-04-19 Thread Jean-Marc Paulin
Hi, we use Flink 1.18 with Kafka Sink, and we enabled `EXACTLY_ONCE` on one of our kafka sink. We set the transation timeout to 15 minutes. When we try to restore from a savepoint, way after that 15 minutes window, Flink enter in a RESTARTING loop. We see the error: ``` { "exception": {

Q: Not all the task slots are used. Are we missing a setting somewhere?

2024-02-23 Thread Jean-Marc Paulin
Hi, We used to run with 3 task managers with numberOfTaskSlots = 2. So all together we had 6 task slots and our application used them all. Trying to increase throughput, we increased the number of task managers to 6. So now we have 12 task slots all together. However our application still only

Is the kafka-connector doc missing a dependency on flink-connector-base

2023-12-04 Thread Jean-Marc Paulin
Hi, Trying to update the kafka connector to my project and I am missing a class. Is the doc missing a dependency on flink-connector-base ? org.apache.flink flink-connector-base compile I added it and it works. I think that's required but I would have expected this i