Flink Kubernetes Operator Scale Issue

2023-04-27 Thread Talat Uyarer via user
Hi All,

We are using Flink Kubernetes Operator on our production. We have 3k+ jobs
in standalone mode. But after 2.5k jobs operator getting slow. Now when we
submit a job it takes 10+ minutes to the job runs. Does anyone use similar
scale or more job ?

Now we run as a single pod. Does operator support multi pods if i increase
replicas ?

Do you have any suggestions where should i start looking to debug ?

Thanks


Re: Flink Kubernetes Operator Scale Issue

2023-04-27 Thread Gyula Fóra
Hi!

It’s currently not possible to run the operator in parallel by simply
adding more replicas. However there are different things you can do to
scale both vertically and horizontally.

First of all you can run multiple operators each watching different set of
namespaces to partition the load.

The operator also supports watching CRs with a certain label selector which
would allow you to horizontally partition the load with custom CR labels if
necessary.

You can also try increasing the reconciler parallelism of the operator to
use more threads and reconcile more CRs in parallel. If you increase this
you might need to increase the heap size as well.

Let me know if this helps!

Gyula

On Thu, 27 Apr 2023 at 09:15, Talat Uyarer via user 
wrote:

> Hi All,
>
> We are using Flink Kubernetes Operator on our production. We have 3k+ jobs
> in standalone mode. But after 2.5k jobs operator getting slow. Now when we
> submit a job it takes 10+ minutes to the job runs. Does anyone use similar
> scale or more job ?
>
> Now we run as a single pod. Does operator support multi pods if i increase
> replicas ?
>
> Do you have any suggestions where should i start looking to debug ?
>
> Thanks
>


Re: [Discussion] - Release major Flink version to support JDK 17 (LTS)

2023-04-27 Thread Jing Ge via user
Thanks Tamir for the information. According to the latest comment of the
task FLINK-24998, this bug should be gone while using the latest JDK 17. I
was wondering whether it means that there are no more issues to stop us
releasing a major Flink version to support Java 17? Did I miss something?

Best regards,
Jing

On Thu, Apr 27, 2023 at 8:18 AM Tamir Sagi 
wrote:

> More details about the JDK bug here
> https://bugs.openjdk.org/browse/JDK-8277529
>
> Related Jira ticket
> https://issues.apache.org/jira/browse/FLINK-24998
>
> --
> *From:* Jing Ge via user 
> *Sent:* Monday, April 24, 2023 11:15 PM
> *To:* Chesnay Schepler 
> *Cc:* Piotr Nowojski ; Alexis Sarda-Espinosa <
> sarda.espin...@gmail.com>; Martijn Visser ;
> d...@flink.apache.org ; user 
> *Subject:* Re: [Discussion] - Release major Flink version to support JDK
> 17 (LTS)
>
>
> *EXTERNAL EMAIL*
>
>
> Thanks Chesnay for working on this. Would you like to share more info
> about the JDK bug?
>
> Best regards,
> Jing
>
> On Mon, Apr 24, 2023 at 11:39 AM Chesnay Schepler 
> wrote:
>
> As it turns out Kryo isn't a blocker; we ran into a JDK bug.
>
> On 31/03/2023 08:57, Chesnay Schepler wrote:
>
>
> https://github.com/EsotericSoftware/kryo/wiki/Migration-to-v5#migration-guide
>
> Kroy themselves state that v5 likely can't read v2 data.
>
> However, both versions can be on the classpath without classpath as v5
> offers a versioned artifact that includes the version in the package.
>
> It probably wouldn't be difficult to migrate a savepoint to Kryo v5,
> purely from a read/write perspective.
>
> The bigger question is how we expose this new Kryo version in the API. If
> we stick to the versioned jar we need to either duplicate all current
> Kryo-related APIs or find a better way to integrate other serialization
> stacks.
> On 30/03/2023 17:50, Piotr Nowojski wrote:
>
> Hey,
>
> > 1. The Flink community agrees that we upgrade Kryo to a later version,
> which means breaking all checkpoint/savepoint compatibility and releasing a
> Flink 2.0 with Java 17 support added and Java 8 and Flink Scala API support
> dropped. This is probably the quickest way, but would still mean that we
> expose Kryo in the Flink APIs, which is the main reason why we haven't been
> able to upgrade Kryo at all.
>
> This sounds pretty bad to me.
>
> Has anyone looked into what it would take to provide a smooth migration
> from Kryo2 -> Kryo5?
>
> Best,
> Piotrek
>
> czw., 30 mar 2023 o 16:54 Alexis Sarda-Espinosa 
> napisał(a):
>
> Hi Martijn,
>
> just to be sure, if all state-related classes use a POJO serializer, Kryo
> will never come into play, right? Given FLINK-16686 [1], I wonder how many
> users actually have jobs with Kryo and RocksDB, but even if there aren't
> many, that still leaves those who don't use RocksDB for
> checkpoints/savepoints.
>
> If Kryo were to stay in the Flink APIs in v1.X, is it impossible to let
> users choose between v2/v5 jars by separating them like log4j2 jars?
>
> [1] https://issues.apache.org/jira/browse/FLINK-16686
>
> Regards,
> Alexis.
>
> Am Do., 30. März 2023 um 14:26 Uhr schrieb Martijn Visser <
> martijnvis...@apache.org>:
>
> Hi all,
>
> I also saw a thread on this topic from Clayton Wohl [1] on this topic,
> which I'm including in this discussion thread to avoid that it gets lost.
>
> From my perspective, there's two main ways to get to Java 17:
>
> 1. The Flink community agrees that we upgrade Kryo to a later version,
> which means breaking all checkpoint/savepoint compatibility and releasing a
> Flink 2.0 with Java 17 support added and Java 8 and Flink Scala API support
> dropped. This is probably the quickest way, but would still mean that we
> expose Kryo in the Flink APIs, which is the main reason why we haven't been
> able to upgrade Kryo at all.
> 2. There's a contributor who makes a contribution that bumps Kryo, but
> either a) automagically reads in all old checkpoints/savepoints in using
> Kryo v2 and writes them to new snapshots using Kryo v5 (like is mentioned
> in the Kryo migration guide [2][3] or b) provides an offline tool that
> allows users that are interested in migrating their snapshots manually
> before starting from a newer version. That potentially could prevent the
> need to introduce a new Flink major version. In both scenarios, ideally the
> contributor would also help with avoiding the exposure of Kryo so that we
> will be in a better shape in the future.
>
> It would be good to get the opinion of the community for either of these
> two options, or potentially for another one that I haven't mentioned. If it
> appears that there's an overall agreement on the direction, I would propose
> that a FLIP gets created which describes the entire process.
>
> Looking forward to the thoughts of others, including the Users (therefore
> including the User ML).
>
> Best regards,
>
> Martijn
>
> [1]  https://lists.apache.org/thread/qcw8wy9dv8szxx9bh49nz7jnth22p1v2
> [2] https://lists.apache.org/thr

Re: Flink SQL State

2023-04-27 Thread Yaroslav Tkachenko
Hi Giannis,

I'm curious, what tool did you use for this analysis (what the screenshot
shows)? Is it something custom?

Thank you.

On Wed, Apr 26, 2023 at 10:38 PM Giannis Polyzos 
wrote:

> This is really helpful,
>
> Thanks
>
> On Thu, Apr 27, 2023 at 5:46 AM Yanfei Lei  wrote:
>
>> Hi Giannis,
>>
>> Except “default” Colume Family(CF), all other CFs represent the state
>> in rocksdb state backend, the name of a CF is the name of a
>> StateDescriptor.
>>
>> - deduplicate-state is a value state, you can find it in
>> DeduplicateFunctionBase.java and
>> MiniBatchDeduplicateFunctionBase.java, they are used for
>> deduplication.
>> - _timer_state/event_user-timers, _timer_state/event_timers ,
>> _timer_state/processing_timers and _timer_state/processing_user-timers
>>  are created by internal time service, which can be found in
>> InternalTimeServiceManagerImpl.java. Here is a blog post[1] on best
>> practices for using timers.
>> - timer, next-index, left and right can be found in
>> TemporalRowTimeJoinOperator.java, TemporalRowTimeJoinOperator
>> implements the logic of temporal join, this post[2] might be helpful
>> in understanding what happened to temporal join.
>>
>> [1]
>> https://www.alibabacloud.com/help/en/realtime-compute-for-apache-flink/latest/datastream-timer
>> [2]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/joins/#temporal-joins
>>
>> Giannis Polyzos  于2023年4月26日周三 23:19写道:
>> >
>> > I have two input kafka topics - a compacted one (with upsert-kafka) and
>> a normal one.
>> > When I perform a temporal join I notice the following state being
>> created in rocksdb and was hoping someone could help me better understand
>> what everything means
>> >
>> >
>> > > deduplicate-state: does it refer to duplicate keys found by the
>> kafka-upsert-connector?
>> > > timers: what timer and _timer_state/event_timers refer to and whats
>> their difference? Is it to keep track on when the join results need to be
>> materialised or state to be expired?
>> > > next-index: what does it refer to?
>> > > left: also I'm curious why the left cf has 407 entries. Is it records
>> that are being buffered because there is no match on the right table?
>> >
>> > Thanks
>>
>>
>>
>> --
>> Best,
>> Yanfei
>>
>


Re: Flink SQL State

2023-04-27 Thread Giannis Polyzos
Correct, its some custom code i put together to investigate what gets
written in rocksdb

On Thu, Apr 27, 2023 at 6:06 PM Yaroslav Tkachenko 
wrote:

> Hi Giannis,
>
> I'm curious, what tool did you use for this analysis (what the screenshot
> shows)? Is it something custom?
>
> Thank you.
>
> On Wed, Apr 26, 2023 at 10:38 PM Giannis Polyzos 
> wrote:
>
>> This is really helpful,
>>
>> Thanks
>>
>> On Thu, Apr 27, 2023 at 5:46 AM Yanfei Lei  wrote:
>>
>>> Hi Giannis,
>>>
>>> Except “default” Colume Family(CF), all other CFs represent the state
>>> in rocksdb state backend, the name of a CF is the name of a
>>> StateDescriptor.
>>>
>>> - deduplicate-state is a value state, you can find it in
>>> DeduplicateFunctionBase.java and
>>> MiniBatchDeduplicateFunctionBase.java, they are used for
>>> deduplication.
>>> - _timer_state/event_user-timers, _timer_state/event_timers ,
>>> _timer_state/processing_timers and _timer_state/processing_user-timers
>>>  are created by internal time service, which can be found in
>>> InternalTimeServiceManagerImpl.java. Here is a blog post[1] on best
>>> practices for using timers.
>>> - timer, next-index, left and right can be found in
>>> TemporalRowTimeJoinOperator.java, TemporalRowTimeJoinOperator
>>> implements the logic of temporal join, this post[2] might be helpful
>>> in understanding what happened to temporal join.
>>>
>>> [1]
>>> https://www.alibabacloud.com/help/en/realtime-compute-for-apache-flink/latest/datastream-timer
>>> [2]
>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/joins/#temporal-joins
>>>
>>> Giannis Polyzos  于2023年4月26日周三 23:19写道:
>>> >
>>> > I have two input kafka topics - a compacted one (with upsert-kafka)
>>> and a normal one.
>>> > When I perform a temporal join I notice the following state being
>>> created in rocksdb and was hoping someone could help me better understand
>>> what everything means
>>> >
>>> >
>>> > > deduplicate-state: does it refer to duplicate keys found by the
>>> kafka-upsert-connector?
>>> > > timers: what timer and _timer_state/event_timers refer to and whats
>>> their difference? Is it to keep track on when the join results need to be
>>> materialised or state to be expired?
>>> > > next-index: what does it refer to?
>>> > > left: also I'm curious why the left cf has 407 entries. Is it
>>> records that are being buffered because there is no match on the right
>>> table?
>>> >
>>> > Thanks
>>>
>>>
>>>
>>> --
>>> Best,
>>> Yanfei
>>>
>>


Re: Flink SQL State

2023-04-27 Thread Yaroslav Tkachenko
Got it! Any chance you can open-source some of that? I think it can be
extremely useful for the community.

Thank you.

On Thu, Apr 27, 2023 at 8:08 AM Giannis Polyzos 
wrote:

> Correct, its some custom code i put together to investigate what gets
> written in rocksdb
>
> On Thu, Apr 27, 2023 at 6:06 PM Yaroslav Tkachenko 
> wrote:
>
>> Hi Giannis,
>>
>> I'm curious, what tool did you use for this analysis (what the screenshot
>> shows)? Is it something custom?
>>
>> Thank you.
>>
>> On Wed, Apr 26, 2023 at 10:38 PM Giannis Polyzos 
>> wrote:
>>
>>> This is really helpful,
>>>
>>> Thanks
>>>
>>> On Thu, Apr 27, 2023 at 5:46 AM Yanfei Lei  wrote:
>>>
 Hi Giannis,

 Except “default” Colume Family(CF), all other CFs represent the state
 in rocksdb state backend, the name of a CF is the name of a
 StateDescriptor.

 - deduplicate-state is a value state, you can find it in
 DeduplicateFunctionBase.java and
 MiniBatchDeduplicateFunctionBase.java, they are used for
 deduplication.
 - _timer_state/event_user-timers, _timer_state/event_timers ,
 _timer_state/processing_timers and _timer_state/processing_user-timers
  are created by internal time service, which can be found in
 InternalTimeServiceManagerImpl.java. Here is a blog post[1] on best
 practices for using timers.
 - timer, next-index, left and right can be found in
 TemporalRowTimeJoinOperator.java, TemporalRowTimeJoinOperator
 implements the logic of temporal join, this post[2] might be helpful
 in understanding what happened to temporal join.

 [1]
 https://www.alibabacloud.com/help/en/realtime-compute-for-apache-flink/latest/datastream-timer
 [2]
 https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/joins/#temporal-joins

 Giannis Polyzos  于2023年4月26日周三 23:19写道:
 >
 > I have two input kafka topics - a compacted one (with upsert-kafka)
 and a normal one.
 > When I perform a temporal join I notice the following state being
 created in rocksdb and was hoping someone could help me better understand
 what everything means
 >
 >
 > > deduplicate-state: does it refer to duplicate keys found by the
 kafka-upsert-connector?
 > > timers: what timer and _timer_state/event_timers refer to and whats
 their difference? Is it to keep track on when the join results need to be
 materialised or state to be expired?
 > > next-index: what does it refer to?
 > > left: also I'm curious why the left cf has 407 entries. Is it
 records that are being buffered because there is no match on the right
 table?
 >
 > Thanks



 --
 Best,
 Yanfei

>>>


Re: Flink SQL State

2023-04-27 Thread Giannis Polyzos
Will definitely do as it's going to be part of a wider Flink course / book
(haven't decided yet on the format) Im putting together.
but I can share before that If you want

On Thu, Apr 27, 2023 at 6:11 PM Yaroslav Tkachenko 
wrote:

> Got it! Any chance you can open-source some of that? I think it can be
> extremely useful for the community.
>
> Thank you.
>
> On Thu, Apr 27, 2023 at 8:08 AM Giannis Polyzos 
> wrote:
>
>> Correct, its some custom code i put together to investigate what gets
>> written in rocksdb
>>
>> On Thu, Apr 27, 2023 at 6:06 PM Yaroslav Tkachenko 
>> wrote:
>>
>>> Hi Giannis,
>>>
>>> I'm curious, what tool did you use for this analysis (what the
>>> screenshot shows)? Is it something custom?
>>>
>>> Thank you.
>>>
>>> On Wed, Apr 26, 2023 at 10:38 PM Giannis Polyzos 
>>> wrote:
>>>
 This is really helpful,

 Thanks

 On Thu, Apr 27, 2023 at 5:46 AM Yanfei Lei  wrote:

> Hi Giannis,
>
> Except “default” Colume Family(CF), all other CFs represent the state
> in rocksdb state backend, the name of a CF is the name of a
> StateDescriptor.
>
> - deduplicate-state is a value state, you can find it in
> DeduplicateFunctionBase.java and
> MiniBatchDeduplicateFunctionBase.java, they are used for
> deduplication.
> - _timer_state/event_user-timers, _timer_state/event_timers ,
> _timer_state/processing_timers and _timer_state/processing_user-timers
>  are created by internal time service, which can be found in
> InternalTimeServiceManagerImpl.java. Here is a blog post[1] on best
> practices for using timers.
> - timer, next-index, left and right can be found in
> TemporalRowTimeJoinOperator.java, TemporalRowTimeJoinOperator
> implements the logic of temporal join, this post[2] might be helpful
> in understanding what happened to temporal join.
>
> [1]
> https://www.alibabacloud.com/help/en/realtime-compute-for-apache-flink/latest/datastream-timer
> [2]
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/joins/#temporal-joins
>
> Giannis Polyzos  于2023年4月26日周三 23:19写道:
> >
> > I have two input kafka topics - a compacted one (with upsert-kafka)
> and a normal one.
> > When I perform a temporal join I notice the following state being
> created in rocksdb and was hoping someone could help me better understand
> what everything means
> >
> >
> > > deduplicate-state: does it refer to duplicate keys found by the
> kafka-upsert-connector?
> > > timers: what timer and _timer_state/event_timers refer to and
> whats their difference? Is it to keep track on when the join results need
> to be materialised or state to be expired?
> > > next-index: what does it refer to?
> > > left: also I'm curious why the left cf has 407 entries. Is it
> records that are being buffered because there is no match on the right
> table?
> >
> > Thanks
>
>
>
> --
> Best,
> Yanfei
>



Apache Flink Kubernetes Operator 1.4.0

2023-04-27 Thread rania duni
Hello!

As a newcomer to Flink and Kubernetes, I am seeking detailed instructions
to help me properly configure and deploy this Flink example (
https://github.com/apache/flink-kubernetes-operator/tree/main/examples/autoscaling)
on a Minikube environment. Can you give me specific configurations and
steps that I need to follow? Maybe it is a naive question, but I am new to
all this.


Re: [Discussion] - Release major Flink version to support JDK 17 (LTS)

2023-04-27 Thread Martijn Visser
Scala 2.12.7 doesn't compile on Java 17, see
https://issues.apache.org/jira/browse/FLINK-25000.

On Thu, Apr 27, 2023 at 3:11 PM Jing Ge  wrote:

> Thanks Tamir for the information. According to the latest comment of the
> task FLINK-24998, this bug should be gone while using the latest JDK 17. I
> was wondering whether it means that there are no more issues to stop us
> releasing a major Flink version to support Java 17? Did I miss something?
>
> Best regards,
> Jing
>
> On Thu, Apr 27, 2023 at 8:18 AM Tamir Sagi 
> wrote:
>
>> More details about the JDK bug here
>> https://bugs.openjdk.org/browse/JDK-8277529
>>
>> Related Jira ticket
>> https://issues.apache.org/jira/browse/FLINK-24998
>>
>> --
>> *From:* Jing Ge via user 
>> *Sent:* Monday, April 24, 2023 11:15 PM
>> *To:* Chesnay Schepler 
>> *Cc:* Piotr Nowojski ; Alexis Sarda-Espinosa <
>> sarda.espin...@gmail.com>; Martijn Visser ;
>> d...@flink.apache.org ; user 
>> *Subject:* Re: [Discussion] - Release major Flink version to support JDK
>> 17 (LTS)
>>
>>
>> *EXTERNAL EMAIL*
>>
>>
>> Thanks Chesnay for working on this. Would you like to share more info
>> about the JDK bug?
>>
>> Best regards,
>> Jing
>>
>> On Mon, Apr 24, 2023 at 11:39 AM Chesnay Schepler 
>> wrote:
>>
>> As it turns out Kryo isn't a blocker; we ran into a JDK bug.
>>
>> On 31/03/2023 08:57, Chesnay Schepler wrote:
>>
>>
>> https://github.com/EsotericSoftware/kryo/wiki/Migration-to-v5#migration-guide
>>
>> Kroy themselves state that v5 likely can't read v2 data.
>>
>> However, both versions can be on the classpath without classpath as v5
>> offers a versioned artifact that includes the version in the package.
>>
>> It probably wouldn't be difficult to migrate a savepoint to Kryo v5,
>> purely from a read/write perspective.
>>
>> The bigger question is how we expose this new Kryo version in the API. If
>> we stick to the versioned jar we need to either duplicate all current
>> Kryo-related APIs or find a better way to integrate other serialization
>> stacks.
>> On 30/03/2023 17:50, Piotr Nowojski wrote:
>>
>> Hey,
>>
>> > 1. The Flink community agrees that we upgrade Kryo to a later version,
>> which means breaking all checkpoint/savepoint compatibility and releasing a
>> Flink 2.0 with Java 17 support added and Java 8 and Flink Scala API support
>> dropped. This is probably the quickest way, but would still mean that we
>> expose Kryo in the Flink APIs, which is the main reason why we haven't been
>> able to upgrade Kryo at all.
>>
>> This sounds pretty bad to me.
>>
>> Has anyone looked into what it would take to provide a smooth migration
>> from Kryo2 -> Kryo5?
>>
>> Best,
>> Piotrek
>>
>> czw., 30 mar 2023 o 16:54 Alexis Sarda-Espinosa 
>> napisał(a):
>>
>> Hi Martijn,
>>
>> just to be sure, if all state-related classes use a POJO serializer, Kryo
>> will never come into play, right? Given FLINK-16686 [1], I wonder how many
>> users actually have jobs with Kryo and RocksDB, but even if there aren't
>> many, that still leaves those who don't use RocksDB for
>> checkpoints/savepoints.
>>
>> If Kryo were to stay in the Flink APIs in v1.X, is it impossible to let
>> users choose between v2/v5 jars by separating them like log4j2 jars?
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-16686
>>
>> Regards,
>> Alexis.
>>
>> Am Do., 30. März 2023 um 14:26 Uhr schrieb Martijn Visser <
>> martijnvis...@apache.org>:
>>
>> Hi all,
>>
>> I also saw a thread on this topic from Clayton Wohl [1] on this topic,
>> which I'm including in this discussion thread to avoid that it gets lost.
>>
>> From my perspective, there's two main ways to get to Java 17:
>>
>> 1. The Flink community agrees that we upgrade Kryo to a later version,
>> which means breaking all checkpoint/savepoint compatibility and releasing a
>> Flink 2.0 with Java 17 support added and Java 8 and Flink Scala API support
>> dropped. This is probably the quickest way, but would still mean that we
>> expose Kryo in the Flink APIs, which is the main reason why we haven't been
>> able to upgrade Kryo at all.
>> 2. There's a contributor who makes a contribution that bumps Kryo, but
>> either a) automagically reads in all old checkpoints/savepoints in using
>> Kryo v2 and writes them to new snapshots using Kryo v5 (like is mentioned
>> in the Kryo migration guide [2][3] or b) provides an offline tool that
>> allows users that are interested in migrating their snapshots manually
>> before starting from a newer version. That potentially could prevent the
>> need to introduce a new Flink major version. In both scenarios, ideally the
>> contributor would also help with avoiding the exposure of Kryo so that we
>> will be in a better shape in the future.
>>
>> It would be good to get the opinion of the community for either of these
>> two options, or potentially for another one that I haven't mentioned. If it
>> appears that there's an overall agreement on the direction, I would propose
>> that a FLIP gets creat

Re: [Discussion] - Release major Flink version to support JDK 17 (LTS)

2023-04-27 Thread Thomas Weise
Is the intention to bump the Flink major version and only support Java 17+?
If so, can Scala not be upgraded at the same time?

Thanks,
Thomas


On Thu, Apr 27, 2023 at 4:53 PM Martijn Visser 
wrote:

> Scala 2.12.7 doesn't compile on Java 17, see
> https://issues.apache.org/jira/browse/FLINK-25000.
>
> On Thu, Apr 27, 2023 at 3:11 PM Jing Ge  wrote:
>
> > Thanks Tamir for the information. According to the latest comment of the
> > task FLINK-24998, this bug should be gone while using the latest JDK 17.
> I
> > was wondering whether it means that there are no more issues to stop us
> > releasing a major Flink version to support Java 17? Did I miss something?
> >
> > Best regards,
> > Jing
> >
> > On Thu, Apr 27, 2023 at 8:18 AM Tamir Sagi 
> > wrote:
> >
> >> More details about the JDK bug here
> >> https://bugs.openjdk.org/browse/JDK-8277529
> >>
> >> Related Jira ticket
> >> https://issues.apache.org/jira/browse/FLINK-24998
> >>
> >> --
> >> *From:* Jing Ge via user 
> >> *Sent:* Monday, April 24, 2023 11:15 PM
> >> *To:* Chesnay Schepler 
> >> *Cc:* Piotr Nowojski ; Alexis Sarda-Espinosa <
> >> sarda.espin...@gmail.com>; Martijn Visser ;
> >> d...@flink.apache.org ; user <
> user@flink.apache.org>
> >> *Subject:* Re: [Discussion] - Release major Flink version to support JDK
> >> 17 (LTS)
> >>
> >>
> >> *EXTERNAL EMAIL*
> >>
> >>
> >> Thanks Chesnay for working on this. Would you like to share more info
> >> about the JDK bug?
> >>
> >> Best regards,
> >> Jing
> >>
> >> On Mon, Apr 24, 2023 at 11:39 AM Chesnay Schepler 
> >> wrote:
> >>
> >> As it turns out Kryo isn't a blocker; we ran into a JDK bug.
> >>
> >> On 31/03/2023 08:57, Chesnay Schepler wrote:
> >>
> >>
> >>
> https://github.com/EsotericSoftware/kryo/wiki/Migration-to-v5#migration-guide
> >>
> >> Kroy themselves state that v5 likely can't read v2 data.
> >>
> >> However, both versions can be on the classpath without classpath as v5
> >> offers a versioned artifact that includes the version in the package.
> >>
> >> It probably wouldn't be difficult to migrate a savepoint to Kryo v5,
> >> purely from a read/write perspective.
> >>
> >> The bigger question is how we expose this new Kryo version in the API.
> If
> >> we stick to the versioned jar we need to either duplicate all current
> >> Kryo-related APIs or find a better way to integrate other serialization
> >> stacks.
> >> On 30/03/2023 17:50, Piotr Nowojski wrote:
> >>
> >> Hey,
> >>
> >> > 1. The Flink community agrees that we upgrade Kryo to a later version,
> >> which means breaking all checkpoint/savepoint compatibility and
> releasing a
> >> Flink 2.0 with Java 17 support added and Java 8 and Flink Scala API
> support
> >> dropped. This is probably the quickest way, but would still mean that we
> >> expose Kryo in the Flink APIs, which is the main reason why we haven't
> been
> >> able to upgrade Kryo at all.
> >>
> >> This sounds pretty bad to me.
> >>
> >> Has anyone looked into what it would take to provide a smooth migration
> >> from Kryo2 -> Kryo5?
> >>
> >> Best,
> >> Piotrek
> >>
> >> czw., 30 mar 2023 o 16:54 Alexis Sarda-Espinosa <
> sarda.espin...@gmail.com>
> >> napisał(a):
> >>
> >> Hi Martijn,
> >>
> >> just to be sure, if all state-related classes use a POJO serializer,
> Kryo
> >> will never come into play, right? Given FLINK-16686 [1], I wonder how
> many
> >> users actually have jobs with Kryo and RocksDB, but even if there aren't
> >> many, that still leaves those who don't use RocksDB for
> >> checkpoints/savepoints.
> >>
> >> If Kryo were to stay in the Flink APIs in v1.X, is it impossible to let
> >> users choose between v2/v5 jars by separating them like log4j2 jars?
> >>
> >> [1] https://issues.apache.org/jira/browse/FLINK-16686
> >>
> >> Regards,
> >> Alexis.
> >>
> >> Am Do., 30. März 2023 um 14:26 Uhr schrieb Martijn Visser <
> >> martijnvis...@apache.org>:
> >>
> >> Hi all,
> >>
> >> I also saw a thread on this topic from Clayton Wohl [1] on this topic,
> >> which I'm including in this discussion thread to avoid that it gets
> lost.
> >>
> >> From my perspective, there's two main ways to get to Java 17:
> >>
> >> 1. The Flink community agrees that we upgrade Kryo to a later version,
> >> which means breaking all checkpoint/savepoint compatibility and
> releasing a
> >> Flink 2.0 with Java 17 support added and Java 8 and Flink Scala API
> support
> >> dropped. This is probably the quickest way, but would still mean that we
> >> expose Kryo in the Flink APIs, which is the main reason why we haven't
> been
> >> able to upgrade Kryo at all.
> >> 2. There's a contributor who makes a contribution that bumps Kryo, but
> >> either a) automagically reads in all old checkpoints/savepoints in using
> >> Kryo v2 and writes them to new snapshots using Kryo v5 (like is
> mentioned
> >> in the Kryo migration guide [2][3] or b) provides an offline tool that
> >> allows users that are interested in migrating their snapshots manually
> >> before 

Re: Apache Flink Kubernetes Operator 1.4.0

2023-04-27 Thread Hang Ruan
Hi, rania,

I think the quick start document[1] is helpful for you. Other information
could be found in its documents[2].

Best,
Hang

[1]
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/try-flink-kubernetes-operator/quick-start/
[2] https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/

rania duni  于2023年4月28日周五 01:10写道:

> Hello!
>
> As a newcomer to Flink and Kubernetes, I am seeking detailed instructions
> to help me properly configure and deploy this Flink example (
> https://github.com/apache/flink-kubernetes-operator/tree/main/examples/autoscaling)
> on a Minikube environment. Can you give me specific configurations and
> steps that I need to follow? Maybe it is a naive question, but I am new to
> all this.
>


Re: Can I setup standby taskmanagers while using reactive mode?

2023-04-27 Thread Wei Hou via user
Thank you for all your responses! I think Gyula is right, simply do a MAX -
some_offset is not ideal as it can make the standby TM useless.
It is difficult for the scheduler to determine whether a pod has been lost
or scaled down when we enable autoscaling, which affects its decision to
utilize standby TMs. We probably need to monitor the HPA events in order to
get this information.
I will wait to see if there is a solution for this problem in the future.


On Wed, Apr 26, 2023 at 7:20 AM Gyula Fóra  wrote:

> I think the behaviour is going to get a little weird because this would
> actually defeat the purpose of the standby TM.
> MAX - some offset will decrease once you lose a TM so in this case we
> would scale down to again have a spare (which we never actually use.)
>
> Gyula
>
> On Wed, Apr 26, 2023 at 4:02 PM Chesnay Schepler 
> wrote:
>
>> Reactive mode doesn't support standby taskmanagers. As you said it
>> always uses all available resources in the cluster.
>>
>> I can see it being useful though to not always scale to MAX but (MAX -
>> some_offset).
>>
>> I'd suggest to file a ticket.
>>
>> On 26/04/2023 00:17, Wei Hou via user wrote:
>> > Hi Flink community,
>> >
>> > We are trying to use Flink’s reactive mode with Kubernetes HPA for
>> autoscaling, however since the reactive mode will always use all available
>> resources, it causes a problem when we need standby task managers for fast
>> failure recover: The job will always use these extra standby task managers
>> as active task manager to process data.
>> >
>> > I wonder if you have any suggestion on this, should we avoid using
>> Flink reactive mode together with standby task managers?
>> >
>> > Best,
>> > Wei
>> >
>> >
>>
>>


Re: Can I setup standby taskmanagers while using reactive mode?

2023-04-27 Thread Gyula Fóra
You could also check out the Autoscaler logic in the Flink Kubernetes
Operator (
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
)
On the current main and in the upcoming 1.5.0 release the mechanism is
pretty nice and solid :)

It works with the native integration so you can also set standby TMs with a
simple config.

Cheers,
Gyula

On Fri, Apr 28, 2023 at 7:31 AM Wei Hou  wrote:

> Thank you for all your responses! I think Gyula is right, simply do a MAX -
> some_offset is not ideal as it can make the standby TM useless.
> It is difficult for the scheduler to determine whether a pod has been lost
> or scaled down when we enable autoscaling, which affects its decision to
> utilize standby TMs. We probably need to monitor the HPA events in order to
> get this information.
> I will wait to see if there is a solution for this problem in the future.
>
>
> On Wed, Apr 26, 2023 at 7:20 AM Gyula Fóra  wrote:
>
>> I think the behaviour is going to get a little weird because this would
>> actually defeat the purpose of the standby TM.
>> MAX - some offset will decrease once you lose a TM so in this case we
>> would scale down to again have a spare (which we never actually use.)
>>
>> Gyula
>>
>> On Wed, Apr 26, 2023 at 4:02 PM Chesnay Schepler 
>> wrote:
>>
>>> Reactive mode doesn't support standby taskmanagers. As you said it
>>> always uses all available resources in the cluster.
>>>
>>> I can see it being useful though to not always scale to MAX but (MAX -
>>> some_offset).
>>>
>>> I'd suggest to file a ticket.
>>>
>>> On 26/04/2023 00:17, Wei Hou via user wrote:
>>> > Hi Flink community,
>>> >
>>> > We are trying to use Flink’s reactive mode with Kubernetes HPA for
>>> autoscaling, however since the reactive mode will always use all available
>>> resources, it causes a problem when we need standby task managers for fast
>>> failure recover: The job will always use these extra standby task managers
>>> as active task manager to process data.
>>> >
>>> > I wonder if you have any suggestion on this, should we avoid using
>>> Flink reactive mode together with standby task managers?
>>> >
>>> > Best,
>>> > Wei
>>> >
>>> >
>>>
>>>