答复: How to specify the number of TaskManagers in Yarn Cluster using Per-Job Mode

2020-08-17 Thread 范超
Thanks Yangze for providing these links I'll try it ! -邮件原件- 发件人: Yangze Guo [mailto:karma...@gmail.com] 发送时间: 2020年8月18日 星期二 12:57 收件人: 范超 抄送: user (user@flink.apache.org) 主题: Re: How to specify the number of TaskManagers in Yarn Cluster using Per-Job Mode The number of TM mainly dep

Re: How to specify the number of TaskManagers in Yarn Cluster using Per-Job Mode

2020-08-17 Thread Yangze Guo
The number of TM mainly depends on the parallelism and job graph. Flink now allows you to set the maximum slots number (slotmanager-number-of-slots-max[1]). There is also a plan to support setting the minimum number of slots[2]. [1] https://ci.apache.org/projects/flink/flink-docs-master/ops/confi

答复: How to specify the number of TaskManagers in Yarn Cluster using Per-Job Mode

2020-08-17 Thread 范超
Thanks Yangze 1. Do you meet any problem when deploying on Yarn or running Flink job? My job works well 2. Why do you need to start the TMs on all the three machines? From cluster perspective, I wonder if the process pressure can be balance to 3 machines. 3. Flink can control how many TM to sta

Re: Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input

2020-08-17 Thread Yun Gao
Hi, Very thanks for bringing up this discussion! One more question is that does the BATCH and STREAMING mode also decides the shuffle types and operators? I'm asking so because that even for blocking mode, it should also benefit from keeping some edges to be pipeline if the resources

Re: Re: [DISCUSS] Removing deprecated methods from DataStream API

2020-08-17 Thread Yun Gao
+1 for removing the methods that are deprecated for a while & have alternative methods. One specific thing is that if we remove the DataStream#split, do we consider enabling side-output in more operators in the future ? Currently it should be only available in ProcessFunctions, but not availabl

Flink checkpoint recovery time

2020-08-17 Thread Zhinan Cheng
Hi all, I am working on measuring the failure recovery time of Flink and I want to decompose the recovery time into different parts, say the time to detect the failure, the time to restart the job, and the time to restore the checkpointing. I found that I can measure the down time during failure

答复: How to specify the number of TaskManagers in Yarn Cluster using Per-Job Mode

2020-08-17 Thread 范超
Thanks Yangze The reason why I don’t deploying a standalone cluster, it's because there kafka, kudu, hadoop, zookeeper on these machines, maybe currently using the yarn to manage resources is the best choice for me. If Flink can not control how many tm to start , could anyone providing me some b

Re: How to specify the number of TaskManagers in Yarn Cluster using Per-Job Mode

2020-08-17 Thread Yangze Guo
Hi, Flink can control how many TM to start, but where to start the TMs depends on Yarn. Do you meet any problem when deploying on Yarn or running Flink job? Why do you need to start the TMs on all the three machines? Best, Yangze Guo On Tue, Aug 18, 2020 at 11:25 AM 范超 wrote: > > Thanks Yangze

Re: How to specify the number of TaskManagers in Yarn Cluster using Per-Job Mode

2020-08-17 Thread Yangze Guo
Hi, I think that is only related to the Yarn scheduling strategy. AFAIK, Flink could not control it. You could check the RM log to figure out why it did not schedule the containers to all the three machines. BTW, if you have specific requirements to start with all the three machines, how about dep

答复: How to specify the number of TaskManagers in Yarn Cluster using Per-Job Mode

2020-08-17 Thread 范超
Thanks Yangze All 3 machines NodeManager is started. I just don't know why not three machines each running a Flink TaskManager and how to achieve this -邮件原件- 发件人: Yangze Guo [mailto:karma...@gmail.com] 发送时间: 2020年8月18日 星期二 10:10 收件人: 范超 抄送: user (user@flink.apache.org) 主题: Re: How to

Re: How to specify the number of TaskManagers in Yarn Cluster using Per-Job Mode

2020-08-17 Thread Yangze Guo
Hi, Do you start the NodeManager in all the three machines? If so, could you check all the NMs correctly connect to the ResourceManager? Best, Yangze Guo On Tue, Aug 18, 2020 at 10:01 AM 范超 wrote: > > Hi, Dev and Users > I’ve 3 machines each one is 8 cores and 16GB memory. > Following it’s my R

Re: Performance Flink streaming kafka consumer sink to s3

2020-08-17 Thread Vijayendra Yadav
Hi, Do you think there can be any issue with Flinks performance, with 400Kb up to 1 MB payload record sizes ? my Spark streaming seems to be doing better. Are there any recommended configurations or increasing parallelism to improve Flink streaming using flink kafka connect? Regards, Vijay On F

How to specify the number of TaskManagers in Yarn Cluster using Per-Job Mode

2020-08-17 Thread 范超
Hi, Dev and Users I’ve 3 machines each one is 8 cores and 16GB memory. Following it’s my Resource Manager screenshot the cluster have 36GB total. I specify the paralism to 3 or even up to 12, But the task manager is always running on two nodes not all three machine, the third node does not start

Re: 回复:How to get the evaluation result of a time-based window aggregation in time after a new event falling into the window?

2020-08-17 Thread Theo Diefenthal
Hi Chengcheng Zhang, I think your request is related to this feature request from two years ago here [1], with me asking about the status one year ago [2]. You might want to upvote this so we can hope that it gets some more attention in future. Today, it is possible to write your own DataStr

Re: [DISCUSS] Removing deprecated methods from DataStream API

2020-08-17 Thread Kostas Kloudas
+1 for removing them. >From a quick look, most of them (not all) have been deprecated a long time ago. Cheers, Kostas On Mon, Aug 17, 2020 at 9:37 PM Dawid Wysakowicz wrote: > > @David Yes, my idea was to remove any use of fold method and all related > classes including WindowedStream#fold > >

Re: [DISCUSS] Removing deprecated methods from DataStream API

2020-08-17 Thread Dawid Wysakowicz
@David Yes, my idea was to remove any use of fold method and all related classes including WindowedStream#fold @Klou Good idea to also remove the deprecated enableCheckpointing() & StreamExecutionEnvironment#readFile and alike. I did another pass over some of the classes and thought we could also

Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input

2020-08-17 Thread Kostas Kloudas
Hi Kurt and David, Thanks a lot for the insightful feedback! @Kurt: For the topic of checkpointing with Batch Scheduling, I totally agree with you that it requires a lot more work and careful thinking on the semantics. This FLIP was written under the assumption that if the user wants to have chec

Re: coordination of sinks

2020-08-17 Thread Fabian Hueske
Hi Marco, You cannot really synchronize data that is being emitted via different streams (without bringing them together in an operator). I see two options: 1) emit the event to create the partition and the data to be written into the partition to the same stream. Flink guarantees that records d

RE: JobManager refusing connections when running many jobs in parallel?

2020-08-17 Thread Hailu, Andreas
Interesting – what is the JobManager submission bounded by? Does it only allow a certain number of submissions per second, or is there a number of threads it accepts? // ah From: Robert Metzger Sent: Tuesday, August 11, 2020 4:46 AM To: Hailu, Andreas [Engineering] Cc: user@flink.apache.org;

Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input

2020-08-17 Thread David Anderson
Kostas, I'm pleased to see some concrete details in this FLIP. I wonder if the current proposal goes far enough in the direction of recognizing the need some users may have for "batch" and "bounded streaming" to be treated differently. If I've understood it correctly, the section on scheduling al

Re: [DISCUSS] Removing deprecated methods from DataStream API

2020-08-17 Thread David Anderson
I assume that along with DataStream#fold you would also remove WindowedStream#fold. I'm in favor of going ahead with all of these. David On Mon, Aug 17, 2020 at 10:53 AM Dawid Wysakowicz wrote: > Hi devs and users, > > I wanted to ask you what do you think about removing some of the > deprecat

Re: [DISCUSS] Removing deprecated methods from DataStream API

2020-08-17 Thread Kostas Kloudas
Thanks a lot for starting this Dawid, Big +1 for the proposed clean-up, and I would also add the deprecated methods of the StreamExecutionEnvironment like: enableCheckpointing(long interval, CheckpointingMode mode, boolean force) enableCheckpointing() isForceCheckpointing() readFile(FileInputFor

[DISCUSS] Removing deprecated methods from DataStream API

2020-08-17 Thread Dawid Wysakowicz
Hi devs and users, I wanted to ask you what do you think about removing some of the deprecated APIs around the DataStream API. The APIs I have in mind are: * RuntimeContext#getAllAccumulators (deprecated in 0.10) * DataStream#fold and all related classes and methods such as FoldFunction,

Re: Tracing and Flink

2020-08-17 Thread bvarga
Hi Aaron, I've recently been looking at this topic and working on a prototype. The approach I am trying is "backward tracing", or data provenance tracing, where we try to explain what inputs and steps have affected the production of an output record. Arvid has summarized the most important aspect