[ANNOUNCE] Apache Flink Stateful Functions 3.2.0 released

2022-02-01 Thread Till Rohrmann
The Apache Flink community is very happy to announce the release of Apache Flink Stateful Functions 3.2.0. Stateful Functions is an API that simplifies building distributed stateful applications. It's based on functions with persistent state that can interact dynamically with strong consistency gu

Re: [ANNOUNCE] Apache Flink ML 2.0.0 released

2022-01-10 Thread Till Rohrmann
This is really great news. Thanks a lot for all the work Dong, Yun, Zhipeng and others! Cheers, Till On Fri, Jan 7, 2022 at 2:36 PM David Morávek wrote: > Great job! <3 Thanks Dong and Yun for managing the release and big thanks > to everyone who has contributed! > > Best, > D. > > On Fri, Jan

Re: [DISCUSS] Deprecate MapR FS

2022-01-05 Thread Till Rohrmann
+1 for dropping the MapR FS. Cheers, Till On Wed, Jan 5, 2022 at 10:11 AM Martijn Visser wrote: > Hi everyone, > > Thanks for your input. I've checked the MapR implementation and it has no > annotation at all. Given the circumstances that we thought that MapR was > already dropped, I would prop

Re: [DISCUSS] Drop Gelly

2022-01-03 Thread Till Rohrmann
I haven't seen any changes or requests to/for Gelly in ages. Hence, I would assume that it is not really used and can be removed. +1 for dropping Gelly. Cheers, Till On Mon, Jan 3, 2022 at 2:20 PM Martijn Visser wrote: > Hi everyone, > > Flink is bundled with Gelly, a Graph API library [1]. Th

Re: [ANNOUNCE] Apache Flink Stateful Functions 3.1.1 released

2021-12-23 Thread Till Rohrmann
Thanks a lot for being our release manager and swiftly addressing the log4j CVE Igal! Cheers, Till On Wed, Dec 22, 2021 at 5:41 PM Igal Shilman wrote: > The Apache Flink community is very happy to announce the release of Apache > Flink Stateful Functions (StateFun) 3.1.1. > > This is a bugfix r

Re: [DISCUSS] Changing the minimal supported version of Hadoop

2021-12-23 Thread Till Rohrmann
If there are no users strongly objecting to dropping Hadoop support for < 2.8, then I am +1 for this since otherwise we won't gain a lot as Xintong said. Cheers, Till On Wed, Dec 22, 2021 at 10:33 AM David Morávek wrote: > Agreed, if we drop the CI for lower versions, there is actually no point

Re: [ANNOUNCE] Apache Flink 1.14.2 / 1.13.5 / 1.12.7 / 1.11.6 released

2021-12-17 Thread Till Rohrmann
Thanks a lot for driving these releases Chesnay! This is super helpful for the community. For the synchronization problem, I guess we just have to wait a bit longer. Cheers, Till On Fri, Dec 17, 2021 at 7:39 AM Leonard Xu wrote: > I guess this is related to publishers everywhere are updating t

Re: [DISCUSS] Change some default config values of blocking shuffle

2021-12-14 Thread Till Rohrmann
As part of this FLIP, does it make sense to also extend the documentation for the sort shuffle [1] to include a tuning guide? I am thinking of a more in depth description of what things you might observe and how to influence them with the configuration options. [1] https://nightlies.apache.org/fli

Re: [DISCUSS] Change some default config values of blocking shuffle

2021-12-03 Thread Till Rohrmann
Thanks for starting this discussion Yingjie, How will our tests be affected by these changes? Will Flink require more resources and, thus, will it risk destabilizing our testing infrastructure? I would propose to create a FLIP for these changes since you propose to change the default behaviour. I

Re: [ANNOUNCE] Open source of remote shuffle project for Flink batch data processing

2021-11-30 Thread Till Rohrmann
Great news, Yingjie. Thanks a lot for sharing this information with the community and kudos to all the contributors of the external shuffle service :-) Cheers, Till On Tue, Nov 30, 2021 at 2:32 PM Yingjie Cao wrote: > Hi dev & users, > > We are happy to announce the open source of remote shuffl

Re: [DISCUSS] Drop Zookeeper 3.4

2021-11-26 Thread Till Rohrmann
x27;m not sure about ZK compatibility, but we'd also upgrade Curator to > 5.x, which doesn't support ookeeperK 3.4 anymore. > > On 25/11/2021 21:56, Till Rohrmann wrote: > > Should we ask on the user mailing list whether anybody is still using > > ZooKeeper 3.4 and

Re: Reactive mode in 1.13

2021-11-03 Thread Till Rohrmann
a lot working fine now. Also you also explain how to pass > parameters to the job. In the session cluster I am passing arguments using > api. > > > > Here how can I pass the arguments to the job? > > > > > > Regards, > > Ravi Sankar Reddy. > > > &g

Re: Reactive mode in 1.13

2021-11-02 Thread Till Rohrmann
am.readOrdinaryObject(ObjectInputStream.java:2186) > ~[?:1.8.0_312] > > at > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) > ~[?:1.8.0_312] > > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) > ~[?:1.8.0_312

Re: Reactive mode in 1.13

2021-11-02 Thread Till Rohrmann
Hi Ravi, The reactive mode shouldn't do things differently compared to a normal application cluster deployment. Maybe you can show us exactly how you submit a job, the contents of the bundled jar, how you build the fat jar and the logs of the failed Flink run. Moving this discussion to the user M

Re: Issue in deploying Flink application in AKS 1.21

2021-11-01 Thread Till Rohrmann
We are getting the attached exception when running the > application written in Flink version 1.13.2. AKS is 1.21. Please let us > know if you need additional information. > > > > Regards, > > Sundar. > > *From:* Till Rohrmann > *Sent:* Monday, 1 Nov

Re: Issue in deploying Flink application in AKS 1.21

2021-11-01 Thread Till Rohrmann
Hi Deepa, I cannot see the attached image. Maybe you can post the description of the problem. I am also moving this discussion to the user ML. Cheers, Till On Mon, Nov 1, 2021 at 9:05 AM Sekaran, Sreedeepa < sreedeepa.seka...@westpac.com.au> wrote: > Hi Team, > > > > We are working with Flink

Re: [ANNOUNCE] Apache Flink 1.13.3 released

2021-10-22 Thread Till Rohrmann
Thanks Chesnay and Martijn for managing this release and to everyone who contributed to it. Cheers, Till On Fri, Oct 22, 2021 at 11:04 AM Yangze Guo wrote: > Thank Chesnay, Martijn, and everyone involved! > > Best, > Yangze Guo > > On Fri, Oct 22, 2021 at 4:25 PM Yun Tang wrote: > > > > Thanks

Re: [ANNOUNCE] Flink mailing lists archive service has migrated to Apache Archive service

2021-09-30 Thread Till Rohrmann
Thanks for the hint with the managed search engines Matthias. I think this is quite helpful. Cheers, Till On Wed, Sep 15, 2021 at 4:27 PM Matthias Pohl wrote: > Thanks Leonard for the announcement. I guess that is helpful. > > @Robert is there any way we can change the default setting to someth

Re: Job leader ... lost leadership with version 1.13.2

2021-09-02 Thread Till Rohrmann
Forwarding the discussion back to the user mailing list. On Thu, Sep 2, 2021 at 12:25 PM Till Rohrmann wrote: > The stack trace looks ok. This happens whenever the leader loses > leadership and this can have different reasons. What's more interesting is > what happens before

Re: Checkpointing failure, subtasks get stuck

2021-09-02 Thread Till Rohrmann
Hi Xiangyu, Can you provide us with more information about your job, which state backend you are using and how you've configured the checkpointing? Can you also provide some information about the problematic checkpoints (e.g. alignment time, async/sync duration) that you find on the checkpoint det

Re: Job leader ... lost leadership with version 1.13.2

2021-09-02 Thread Till Rohrmann
Hi Xiangyu, Do you have the logs of the problematic test run available? Ideally, we can enable the DEBUG log level to get some more information. I think this information would be needed to figure out the problem. Cheers, Till On Thu, Sep 2, 2021 at 11:47 AM Xiangyu Su wrote: > Hello Everyone,

Re: [ANNOUNCE] Apache Flink Stateful Functions 3.1.0 released

2021-09-01 Thread Till Rohrmann
Great news! Thanks a lot for all your work on the new release :-) Cheers, Till On Wed, Sep 1, 2021 at 9:07 AM Johannes Moser wrote: > Congratulations, great job. 🎉 > > On 31.08.2021, at 17:09, Igal Shilman wrote: > > The Apache Flink community is very happy to announce the release of Apache >

Re: 【Announce】Zeppelin 0.10.0 is released, Flink on Zeppelin Improved

2021-08-26 Thread Till Rohrmann
Cool, thanks for letting us know Jeff. Hopefully, many users use Zeppelin together with Flink. Cheers, Till On Thu, Aug 26, 2021 at 4:47 AM Leonard Xu wrote: > Thanks Jeff for the great work ! > > Best, > Leonard > > 在 2021年8月25日,22:48,Jeff Zhang 写道: > > Hi Flink users, > > We (Zeppelin commun

Re: AdaptiveScheduler stopped without exception

2021-08-24 Thread Till Rohrmann
Hi Yanjie, The observed exception in the logs is just a side effect of the shut down procedure. It is a bug that shutting down the Dispatcher will result in a fatal exception coming from the ApplicationDispatcherBootstrap. I've created a ticket in order to fix it [1]. The true reason for stopping

Re: [ANNOUNCE] Apache Flink 1.11.4 released

2021-08-10 Thread Till Rohrmann
This is great news. Thanks a lot for being our release manager Godfrey and also to everyone who made this release possible. Cheers, Till On Tue, Aug 10, 2021 at 11:09 AM godfrey he wrote: > The Apache Flink community is very happy to announce the release of Apache > Flink 1.11.4, which is the f

Re: [ANNOUNCE] Apache Flink 1.12.5 released

2021-08-10 Thread Till Rohrmann
This is great news. Thanks a lot for being our release manager Jingsong and also to everyone who made this release possible. Cheers, Till On Tue, Aug 10, 2021 at 10:57 AM Jingsong Lee wrote: > The Apache Flink community is very happy to announce the release of Apache > Flink 1.12.5, which is th

Re: [ANNOUNCE] Apache Flink 1.13.2 released

2021-08-09 Thread Till Rohrmann
Thanks Yun Tang for being our release manager and the great work! Also thanks a lot to everyone who contributed to this release. Cheers, Till On Mon, Aug 9, 2021 at 9:48 AM Yu Li wrote: > Thanks Yun Tang for being our release manager and everyone else who made > the release possible! > > Best R

Re: Flink job failure during yarn node termination

2021-08-03 Thread Till Rohrmann
Hi Rainie, It looks to me as if Yarn is causing this problem. Which Yarn node are you terminating? Have you configured your Yarn cluster to be highly available in case you are terminating the ResourceManager? Flink should retry the operation of starting a new container in case it fails. If this i

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread Till Rohrmann
(I > understand that normally, as user code is not a JVM-blocking activity such > as a GC, it should have no impact on heartbeats, but from experience, it > really does) > > > > Cheers, > > Arnaud > > > > > > *De :* Gen Luo > *Envoyé :* jeudi 22 juillet 20

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-21 Thread Till Rohrmann
about long pauses of unresponsiveness of Flink. [1] https://issues.apache.org/jira/browse/FLINK-23216 [2] https://issues.apache.org/jira/browse/FLINK-22483 Cheers, Till On Wed, Jul 21, 2021 at 4:58 AM Yang Wang wrote: > Thanks @Till Rohrmann for starting this discussion > > Firstly,

[DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-16 Thread Till Rohrmann
Hi everyone, Since Flink 1.5 we have the same heartbeat timeout and interval default values that are defined as heartbeat.timeout: 50s and heartbeat.interval: 10s. These values were mainly chosen to compensate for lengthy GC pauses and blocking operations that were executed in the main threads of

Re: savepoint failure

2021-07-14 Thread Till Rohrmann
lism settings have not changed. >> >> >> >> On Tue, Jul 13, 2021 at 12:11 PM Dan Hill wrote: >> >>> Hey. I just hit a similar error in production when trying to >>> savepoint. We also use protobufs. >>> >>> Has anyone found a b

Re: Job Recovery Time on TM Lost

2021-07-09 Thread Till Rohrmann
y helpful to find the lost container quickly. In our inner >>> flink version, we optimize it by task's report and jobmaster's probe. When >>> a task fails because of the connection, it reports to the jobmaster. The >>> jobmaster will try to confirm the liveness

Re: Job Recovery Time on TM Lost

2021-07-06 Thread Till Rohrmann
double check should be good > enough. But since I'm not experienced with an unstable environment, I can't > tell whether that's also enough for it. > > On Mon, Jul 5, 2021 at 5:59 PM Till Rohrmann wrote: > >> I think for RPC communication there are retry strategie

Re: Job Recovery Time on TM Lost

2021-07-05 Thread Till Rohrmann
ations for > fault tolerance of heartbeat, unless we also introduce some retry > strategies for netty connections. > > > On Fri, Jul 2, 2021 at 9:34 PM Till Rohrmann wrote: > >> Could you share the full logs with us for the second experiment, Lu? I >> cannot tell

Re: Job Recovery Time on TM Lost

2021-07-02 Thread Till Rohrmann
ote: >>> >>>> Thanks TIll and Yang for help! Also Thanks Till for a quick fix! >>>> >>>> I did another test yesterday. In this test, I intentionally throw >>>> exception from the source operator: >>>> ``` >>>> if (ru

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Till Rohrmann
nt. > As a result, Flink will not deploy tasks onto the dead TaskManagers. > > > I think most of the time cost in Phase 1 might be cancelling the tasks on > the dead TaskManagers. > > > Best, > Yang > > > Till Rohrmann 于2021年7月1日周四 下午4:49写道: > >> The

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Till Rohrmann
The analysis of Gen is correct. Flink currently uses its heartbeat as the primary means to detect dead TaskManagers. This means that Flink will take at least `heartbeat.timeout` time before the system recovers. Even if the cancellation happens fast (e.g. by having configured a low akka.ask.timeout)

Re: Add control mode for flink

2021-06-11 Thread Till Rohrmann
Thanks for starting this discussion. I do see the benefit of dynamically configuring your Flink job and the cluster running it. Some of the use cases which were mentioned here are already possible. E.g. adjusting the log level dynamically can be done by configuring an appropriate logging backend an

Re: How to gracefully handle job recovery failures

2021-06-11 Thread Till Rohrmann
Hi Li, Roman is right about Flink's behavior and what you can do about it. The idea behind its current behavior is the following: If Flink cannot recover a job, it is very hard for it to tell whether it is due to an intermittent problem or a permanent one. No matter how often you retry, you can al

Re: [DISCUSS] FLIP-148: Introduce Sort-Merge Based Blocking Shuffle to Flink

2021-06-08 Thread Till Rohrmann
Great :-) On Tue, Jun 8, 2021 at 1:11 PM Yingjie Cao wrote: > Hi Till, > > Thanks for the suggestion. The blog post is already on the way. > > Best, > Yingjie > > Till Rohrmann 于2021年6月8日周二 下午5:30写道: > >> Thanks for the update Yingjie. Would it make sense to

Re: [DISCUSS] FLIP-148: Introduce Sort-Merge Based Blocking Shuffle to Flink

2021-06-08 Thread Till Rohrmann
Thanks for the update Yingjie. Would it make sense to write a short blog post about this feature including some performance improvement numbers? I think this could be interesting to our users. Cheers, Till On Mon, Jun 7, 2021 at 4:49 AM Jingsong Li wrote: > Thanks Yingjie for the great effort!

Re: recover from svaepoint

2021-06-03 Thread Till Rohrmann
tead of > >> this.producerIdAndEpoch = ProducerIdAndEpoch.NONE; > > > On Thu, Jun 3, 2021 at 12:11 AM Till Rohrmann > wrote: > >> Thanks for the update. Skimming over the code it looks indeed that we are >> overwriting the values of the static value ProducerIdAndEpoch.NONE.

Re: recover from svaepoint

2021-06-03 Thread Till Rohrmann
more familiar could double check this part of the code. Concerning the required changing of the UID of an operator Piotr, is this a known issue and documented somewhere? I find this rather surprising from a user's point of view. Cheers, Till On Thu, Jun 3, 2021 at 8:53 AM Till Rohrmann

Re: recover from svaepoint

2021-06-02 Thread Till Rohrmann
enqueueRequest(handler); return handler.result; }, State.INITIALIZING); } On Wed, Jun 2, 2021 at 3:55 PM Piotr Nowojski wrote: > Hi, > > I think there is no generic way. If this error has happened indeed after > starting a second job from the same savepoint, or som

Re: How to check is Kafka partition "idle" in emitRecordsWithTimestamps

2021-06-02 Thread Till Rohrmann
ssueKey FLINK-18450 Preview comment > issues.apache.org > > Thanks, > Alexey > -- > *From:* Till Rohrmann > *Sent:* Tuesday, June 1, 2021 6:24 AM > *To:* Alexey Trenikhun > *Cc:* Flink User Mail List > *Subject:* Re: How

Re: S3 + Parquet credentials issue

2021-06-01 Thread Till Rohrmann
Hi Angelo, what Svend has written is very good advice. Additionally, you could give us a bit more context by posting the exact stack trace and the exact configuration you use to deploy the Flink cluster. To me this looks like a configuration/setup problem in combination with AWS. Cheers, Till On

Re: Reading Flink states from svaepoint uning State Processor API

2021-06-01 Thread Till Rohrmann
Hi Min, Usually, you should be able to provide type information and thereby a serializer via the StateDescriptors which you create to access the state. If this is not working, then you need to give us a bit more context to understand what's not working. I am also pulling in Seth who is the origin

Re: savepoint fail

2021-06-01 Thread Till Rohrmann
Responded as part of the following discussion https://lists.apache.org/x/thread.html/re85a1fb3f17b5ef1af2844fc1a07b6d2f5e6237bf4c33059e13890ee@%3Cuser.flink.apache.org%3E. Let's continue the discussion there. Cheers, Till On Mon, May 31, 2021 at 11:02 AM 周瑞 wrote: > HI: > When "sink.seman

Re: Got exception when running the localhost cluster

2021-06-01 Thread Till Rohrmann
Hi Lingfeng, Youngwoo is right. Flink currently officially supports Java 8 and Java 11. Cheers, Till On Mon, May 31, 2021 at 9:00 AM Youngwoo Kim (김영우) wrote: > Hi Lingfeng, > > I believe Java 8 or 11 is appropriate for the Flink cluster at this point. > I'm not sure that Flink 1.13 supports J

Re: Flink kafka

2021-06-01 Thread Till Rohrmann
Responded as part of the following discussion https://lists.apache.org/x/thread.html/re85a1fb3f17b5ef1af2844fc1a07b6d2f5e6237bf4c33059e13890ee@%3Cuser.flink.apache.org%3E. Let's continue the discussion there. Cheers, Till On Sun, May 30, 2021 at 2:32 PM 周瑞 wrote: > > 程序用于测试 flink kafka exactly

Re: classloader.resolve-order is not honored when submitting job to a remote cluster

2021-06-01 Thread Till Rohrmann
Hi Tao, I think this looks like a bug to me. Could it be that this problem is covered by [1, 2]? Maybe you can review this PR and check whether it solves the problem. If yes, then let's quickly get it in. [1] https://issues.apache.org/jira/browse/FLINK-21445 [2] https://github.com/apache/flink/pu

Re: Error about "Rejected TaskExecutor registration at the ResourceManger"

2021-06-01 Thread Till Rohrmann
Hi Kai, The rejection you are seeing should not be serious. The way this can happen is the following: If Yarn restarts the application master, Flink will try to recover previously started containers. If this is not possible or Yarn only tells about a subset of the previously allocated containers,

Re: Flink Metrics Naming

2021-06-01 Thread Till Rohrmann
Hi Mason, The idea is that a metric is not uniquely identified by its name alone but instead by its path. The groups in which it is defined specify this path (similar to directories). That's why it is valid to specify two metrics with the same name if they reside in different groups. I think Prom

Re: How to check is Kafka partition "idle" in emitRecordsWithTimestamps

2021-06-01 Thread Till Rohrmann
Hi Alexey, looking at KafkaTopicPartitionStatus, it looks that it does not contain this information. In a nutshell, what you probably have to do is to aggregate the watermarks across all partitions and then pause the consumption of a partition if its watermark advances too much wrt to the minimum

Re: recover from svaepoint

2021-06-01 Thread Till Rohrmann
The error message says that we are trying to reuse a transaction id that is currently being used or has expired. I am not 100% sure how this can happen. My suspicion is that you have resumed a job multiple times from the same savepoint. Have you checked that there is no other job which has been re

Re: Getting Abstract method Error with flink 1.13.0 and 1.12.0

2021-06-01 Thread Till Rohrmann
Hi Dipanjan, this type of question is best sent to Flink's user mailing list because there are a lot more people using Flink who could help you. The dev mailing list is intended to be used for development discussions. Cheers, Till On Tue, Jun 1, 2021 at 1:31 PM Dipanjan Mazumder wrote: > Hi ,

Re: How to use or configure flink checkpointing with siddhi internal state

2021-06-01 Thread Till Rohrmann
Hi Dipanjan, I am assuming that you are using the flink-siddhi library [1]. I am not an expert but it looks as if the AbstractSiddhiOperator overrides the snapshotState [2] method to store the Siddhi state in Flink. [1] https://github.com/haoch/flink-siddhi [2] https://github.com/haoch/flink-sidd

Re: [ANNOUNCE] Apache Flink 1.13.1 released

2021-05-31 Thread Till Rohrmann
Thanks for the great work Dawid and to everyone who has contributed to this release. Cheers, Till On Mon, May 31, 2021 at 10:25 AM Yangze Guo wrote: > Thanks, Dawid for the great work, thanks to everyone involved. > > Best, > Yangze Guo > > On Mon, May 31, 2021 at 4:14 PM Youngwoo Kim (김영우) >

Re: Issue with using siddhi extension function with flink

2021-05-21 Thread Till Rohrmann
Hi Dipanjan, Please double check whether the libraries are really contained in the job jar you are submitting because if the library is contained in this jar, then it should be on the classpath and you should be able to load it. Cheers, Till On Thu, May 20, 2021 at 3:43 PM Dipanjan Mazumder wro

Re: SIGSEGV error

2021-05-18 Thread Till Rohrmann
;bucket-states",new > JavaSerializer()); > > Hope this is helpful. > > Yours sincerely > Josh > > > > Till Rohrmann 于2021年5月18日周二 下午2:54写道: > >> Hi Joshua, >> >> could you try whether the job also fails when not using the gzip format? >>

Re: SIGSEGV error

2021-05-17 Thread Till Rohrmann
gt;>> be triggered now by newer versions of FLink. >>> >>> I found this on StackOverflow, which looks like it could be related: >>> https://stackoverflow.com/questions/38326183/jvm-crashed-in-java-util-zip-zipfile-getentry >>> Can you try the suggested option

Re: Issues while writing data to a parquet sink

2021-05-17 Thread Till Rohrmann
Hi Adi, To me, this looks like a version conflict of some kind. Maybe you use different Avro versions for your user program and on your Flink cluster. Could you check that you don't have conflicting versions on your classpath? It would also be helpful to have a minimal example that allows reproduc

Re: Flink 1.13.0 reactive mode: Job stop and cannot restore from checkpoint

2021-05-17 Thread Till Rohrmann
One small addition: The old mapping looks to use the SubtaskStateMapper.RANGE whereas the new mapping looks to use the SubtaskStateMapper.ROUND_ROBIN. On Mon, May 17, 2021 at 11:56 AM Till Rohrmann wrote: > Hi ChangZhuo Chen, > > This looks like a bug in Flink. Could you provide us

Re: Flink 1.13.0 reactive mode: Job stop and cannot restore from checkpoint

2021-05-17 Thread Till Rohrmann
Hi ChangZhuo Chen, This looks like a bug in Flink. Could you provide us with the logs of the run and more information about your job? In particular, how does your topology look like? My suspicion is the following: You have an operator with two inputs. One input is keyed whereas the other input is

Re: Task Local Recovery with mountable disks in the cloud

2021-05-10 Thread Till Rohrmann
t; ticket do? > > Thanks, > Sonam > -- > *From:* Till Rohrmann > *Sent:* Monday, April 26, 2021 10:24 AM > *To:* dev > *Cc:* user@flink.apache.org ; Sonam Mandal < > soman...@linkedin.com> > *Subject:* Re: Task Local Recovery with mou

Re: Flink auto-scaling feature and documentation suggestions

2021-05-06 Thread Till Rohrmann
Yes, exposing an API to adjust the parallelism of individual operators is definitely a good step towards the auto-scaling feature which we will consider. The missing piece is persisting this information so that in case of recovery you don't recover with a completely different parallelism. I also a

Re: Flink auto-scaling feature and documentation suggestions

2021-05-06 Thread Till Rohrmann
Hi Vishal, thanks a lot for all your feedback on the new reactive mode. I'll try to answer your questions. 0. In order to avoid confusion let me quickly explain a bit of terminology: The reactive mode is the new feature that allows Flink to react to newly available resources and to make use of th

Re: Presence of Jars in Flink reg security

2021-05-05 Thread Till Rohrmann
Hi Prasanna, in the latest Flink version (1.13.0) I couldn't find these dependencies. Which version of Flink are you looking at? What you could check is whether one of these dependencies is contained in one of Flink's shaded dependencies [1]. [1] https://github.com/apache/flink-shaded Cheers, Ti

Re: Flink(1.12.2/scala 2.11) HA with Zk in kubernetes standalone mode deployment is not working

2021-05-03 Thread Till Rohrmann
Somewhere the system retrieves the address x.x.x.x:43092 which cannot be connected to. Can you check that this points towards a valid Flink process? Maybe it is some leftover information in the ZooKeeper from a previous run? Maybe you can check what's written in the Znodes for /leader/resource_mana

Re: [ANNOUNCE] Apache Flink 1.13.0 released

2021-05-03 Thread Till Rohrmann
This is great news. Thanks a lot for being our release managers Dawid and Guowei! And also thanks to everyone who has made this release possible :-) Cheers, Till On Mon, May 3, 2021 at 5:46 PM vishalovercome wrote: > This is a very big release! Many thanks to the flink developers for their > co

Re: How to implement a window that emits events at regular intervals and on specific events

2021-04-30 Thread Till Rohrmann
only > set this once at the first event. But when testing it does work as I want > and fires every ten seconds and the fires and purges only after no events > have been received for 2 minutes (as specified in the SessionWindow). Is > the processingTimeTimer being updated every

Re: KafkaSourceBuilder causing invalid negative offset on checkpointing

2021-04-30 Thread Till Rohrmann
apache.flink.streaming.api.operators.SourceOperator.snapshotState(SourceOperator.java:288) > at > org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:205) > ... 20 more > > > > -- > *From:*

Re: [ANNOUNCE] Apache Flink 1.12.3 released

2021-04-29 Thread Till Rohrmann
Great to hear. Thanks a lot for being our release manager Arvid and to everyone who has contributed to this release! Cheers, Till On Thu, Apr 29, 2021 at 4:11 PM Arvid Heise wrote: > Dear all, > > The Apache Flink community is very happy to announce the release of Apache > Flink 1.12.3, which i

Re: Key by Kafka partition / Kinesis shard

2021-04-29 Thread Till Rohrmann
eknosrc.com >> >> >> >> >> >> >> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >> Virus-free. >> www.avast.com >> <https://www.avast.com/sig-email?utm_

Re: KafkaSourceBuilder causing invalid negative offset on checkpointing

2021-04-29 Thread Till Rohrmann
ssionWindows.withGap(Time.seconds(60))).aggregate(new > SensorEventAggregator) > > L > > > -- > *From:* Till Rohrmann > *Sent:* Thursday, April 29, 2021 09:16 > *To:* Lars Skjærven ; Becket Qin < > becket@gmail.com> > *Cc:* user@fl

Re: How to implement a window that emits events at regular intervals and on specific events

2021-04-29 Thread Till Rohrmann
ueStates in both > cases? > > On Thu, 29 Apr 2021 at 10:25, Till Rohrmann wrote: > >> Hi Tim, >> >> I think you could use Flink's trigger API [1] to implement a trigger >> which fires when it sees a certain event or after some time. >> >> [1] &g

Re: Best practice for packaging and deploying Flink jobs on K8S

2021-04-29 Thread Till Rohrmann
ython > code on a Flink cluster. > > Thanks, > Sumeet > > > > On Thu, Apr 29, 2021 at 12:37 PM Till Rohrmann > wrote: > >> Hi Sumeet, >> >> Is there a problem with the documented approaches on how to submit the >> Python program (not working) o

Re: Taskmanager killed often after migrating to flink 1.12

2021-04-29 Thread Till Rohrmann
will update you. > > Regards > Sambaran > > On Wed, Apr 28, 2021 at 5:33 PM Till Rohrmann > wrote: > >> Hi Sambaran, >> >> could you also share the cause why the checkpoints could not be discarded >> with us? >> >> With Flink 1.10, we introduced

Re: KafkaSourceBuilder causing invalid negative offset on checkpointing

2021-04-29 Thread Till Rohrmann
Hi Lars, I think this is a duplicate message. Let's continue the discussion on your original message. Cheers, Till On Wed, Apr 28, 2021 at 8:50 PM Lars Skjærven wrote: > Hello, > I ran into an issue when using the new KafkaSourceBuilder (running Flink > 1.12.2, scala 2.12.13, on ververica plat

Re: Queryable State unavailable after Kubernetes HA State cleanup

2021-04-29 Thread Till Rohrmann
Hi Sandeep, I don't fully understand the problematic scenario yet. What exactly is the HA state maintained by Kubernetes in S3? Queryable state works by asking for the current state of an operator. If you use asQueryableState, then you create a reducing state which appends all stream elements. Th

Re: How to implement a window that emits events at regular intervals and on specific events

2021-04-29 Thread Till Rohrmann
Hi Tim, I think you could use Flink's trigger API [1] to implement a trigger which fires when it sees a certain event or after some time. [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#triggers Cheers, Till On Wed, Apr 28, 2021 at 5:25 PM Tim Josefs

Re: Flink Resuming From Checkpoint With "-s" FAILURE

2021-04-29 Thread Till Rohrmann
Hi Zachary, How did you configure the Kafka connector to commit the offsets (periodically, on checkpoint)? One explanation for the graphs you showed is that you enabled periodic committing of the offsets. If this automatic commit happens between two checkpoints and you later fall back to the earli

Re: Key by Kafka partition / Kinesis shard

2021-04-29 Thread Till Rohrmann
Hi Yegor, If you want to use Flink's keyed windowing logic, then you need to insert a keyBy/shuffle operation because Flink currently cannot simply use the partitioning of the Kinesis shards. The reason is that Flink needs to group the keys into the correct key groups in order to support rescaling

Re: KafkaSourceBuilder causing invalid negative offset on checkpointing

2021-04-29 Thread Till Rohrmann
Hi Lars, The KafkaSourceBuilder constructs the new KafkaSource which has not been fully hardened in 1.12.2. In fact, it should not be documented yet. I think you are running into an instability/bug of. The new Kafka source should be hardened a lot more in the 1.13.0 release. Could you tell us exa

Re: Best practice for packaging and deploying Flink jobs on K8S

2021-04-29 Thread Till Rohrmann
Hi Sumeet, Is there a problem with the documented approaches on how to submit the Python program (not working) or are you asking in general? Given the documentation, I would assume that you can configure the requirements.txt via `set_python_requirements`. I am also pulling in Dian who might be ab

Re: Exception handling

2021-04-28 Thread Till Rohrmann
Hi Jacob, one of the contracts Flink has is that if a UDF throws an exception then this means that it has failed and that it needs recovery. Hence, it is the responsibility of the user to make sure that tolerable exceptions do not bubble up. If you have dirty input data then it might make sense to

Re: Taskmanager killed often after migrating to flink 1.12

2021-04-28 Thread Till Rohrmann
Hi Sambaran, could you also share the cause why the checkpoints could not be discarded with us? With Flink 1.10, we introduced a stricter memory model for the TaskManagers. That could be a reason why you see more TaskManagers being killed by the underlying resource management system. You could ma

Re: Task Local Recovery with mountable disks in the cloud

2021-04-26 Thread Till Rohrmann
Hi Sonam, sorry for the late reply. We were a bit caught in the midst of the feature freeze for the next major Flink release. In general, I think it is a very good idea to disaggregate the local state storage to make it reusable across TaskManager failures. However, it is also not trivial to do.

Re: when should `FlinkYarnSessionCli` be included for parsing CLI arguments?

2021-04-26 Thread Till Rohrmann
I should always be the first choice. > Could you help me to understand why the FlinkYarnSessionCli can be > activated? > > > Best, > Yangze Guo > > On Mon, Apr 26, 2021 at 4:48 PM Till Rohrmann > wrote: > > > > Hi Tony, > > > > I think you are ri

Re: when should `FlinkYarnSessionCli` be included for parsing CLI arguments?

2021-04-26 Thread Till Rohrmann
Hi Tony, I think you are right that Flink's cli does not behave super consistent at the moment. Case 2. should definitely work because `-t yarn-application` should overwrite what is defined in the Flink configuration. The problem seems to be that we don't resolve the configuration wrt the specifie

Re: How to know if task-local recovery kicked in for some nodes?

2021-04-12 Thread Till Rohrmann
hts on a good minimum state size we should experiment > with to check recovery time differences between the two modes? > > Thanks, > Sonam > -- > *From:* dhanesh arole > *Sent:* Wednesday, April 7, 2021 3:43:11 AM > *To:* Till Rohrmann &

Re: Task manager local state data after crash / recovery

2021-04-09 Thread Till Rohrmann
Hi Dhanesh, The way local state works in Flink currently is the following: The user configures a `taskmanager.state.local.root-dirs` or the tmp directory is used where Flink creates a "localState" directory. This is the base directory for all local state. Within this directory a TaskManager create

Re: Flink: Exception from container-launch exitCode=2

2021-04-09 Thread Till Rohrmann
I actually think that the logging problem is caused by Hadoop 2.7.3 which pulls in the slf4j-log4j12-1.7.10.jar. This binding is then used but there is no proper configuration file for log4j because Flink actually uses log4j2. Cheers, Till On Fri, Apr 9, 2021 at 12:05 PM Till Rohrmann wrote

Re: Flink: Exception from container-launch exitCode=2

2021-04-09 Thread Till Rohrmann
Hi Yik San, to me it looks as if there is a problem with the job and the deployment. Unfortunately, the logging seems to not have worked. Could you check that you have a valid log4j.properties file in your conf directory. Cheers, Till On Wed, Apr 7, 2021 at 4:57 AM Yik San Chan wrote: > *The q

Re: Flink does not cleanup some disk memory after submitting jar over rest

2021-04-09 Thread Till Rohrmann
Hi, What you could also do is to create several heap dumps [1] whenever you submit a new job. This could allow us to analyze whether there is something increasing the heap memory consumption. Additionally, you could try to upgrade your cluster to Flink 1.12.2 since we fixed some problems Maciek me

Re: Why does flink-quickstart-scala suggests adding connector dependencies in the default scope, while Flink Hive integration docs suggest the opposite

2021-04-09 Thread Till Rohrmann
cluster" part - > does it always require a cluster restart whenever the /lib directory > changes? > > Thanks. > > Best, > Yik San > > On Fri, Apr 9, 2021 at 1:07 AM Till Rohrmann wrote: > >> Hi Yik San, >> >> for future reference, I copy my an

Re: Why does flink-quickstart-scala suggests adding connector dependencies in the default scope, while Flink Hive integration docs suggest the opposite

2021-04-08 Thread Till Rohrmann
Hi Yik San, for future reference, I copy my answer from the SO here: The reason for this difference is that for Hive it is recommended to start the cluster with the respective Hive dependencies. The documentation [1] states that it's best to put the dependencies into the lib directory before you

Re: Flink 1.13 and CSV (batch) writing

2021-04-08 Thread Till Rohrmann
Hi Flavio, I tried to execute the code snippet you have provided and I could not reproduce the problem. Concretely I am running this code: final EnvironmentSettings envSettings = EnvironmentSettings.newInstance() .useBlinkPlanner() .inStreamingMode() .build(); final TableEnvironment

Re: Reducing Task Manager Count Greatly Increases Savepoint Restore

2021-04-08 Thread Till Rohrmann
Hi Kevin, when decreasing the TaskManager count I assume that you also decrease the parallelism of the Flink job. There are three aspects which can then cause a slower recovery. 1) Each Task gets a larger key range assigned. Therefore, each TaskManager has to download more data in order to restar

  1   2   3   4   5   6   7   8   9   10   >