Re: Questions regarding Key Managed state

2020-04-03 Thread Congxian Qiu
Hi Many keys can be in a single state(each state can have multiple key-group, and keys will be assigned to the right key-group) If you write a custom process function that uses a state you created, then there is only one user state in that instance(do not count the underlying state of Flink if th

Re: Correct way to e2e test a Flink application?

2020-04-03 Thread Laurent Exsteens
Hello Robert, thanks a lot, I'll sure have a look at it! Regards, Laurent. On Thu, 2 Apr 2020 at 10:54, Robert Metzger wrote: > Hey Laurent, > Flink developed an internal framework for executing end to end tests from > Java. Here's an example of such a test: > https://github.com/apache/flink/

Sync two DataStreams

2020-04-03 Thread Georgi Stoyanov
Hi, I want to implement a flow where the data from one stream is needed to validate data for second stream when the job is started without a savepoint or checkpoint. Both of them are reading from kafka. I want the data in the first one to be fully read and then to check the events from the s

Re: Anomaly detection Apache Flink

2020-04-03 Thread Marta Paes Moreira
Forgot to mention that you might also want to have a look into Flink CEP [1], Flink's library for Complex Event Processing. It allows you to define and detect event patterns over streams, which can come in pretty handy for anomaly detection. [1] https://ci.apache.org/projects/flink/flink-docs-sta

RE: Anomaly detection Apache Flink

2020-04-03 Thread Nienhuis, Ryan
I would also have a look at the random cut forest algorithm. This is the base algorithm that is used for anomaly detection in several AWS services (Quicksight, Kinesis Data Analytics, etc.). It doesn’t help with getting it working with Flink, but may be a good place to start for an algorithm. h

Re: Questions regarding Key Managed state

2020-04-03 Thread KristoffSC
Thank you for your answers. I have one more question. The Key Managed state for Keyed stream is per key or per operator? For example I have a keyed stream that is processed by MyProcessFunction with parallelism = 3. So I have three instances of MyProcessFuntion. The process function has a KeyMa

Re: [SURVEY] What Change Data Capture tools are you using?

2020-04-03 Thread Jark Wu
Thanks for the tips @Timo :) Best, Jark On Fri, 3 Apr 2020 at 22:11, Timo Walther wrote: > Hi Jark, > > thanks for sharing the results. I think for the databases usage part, > I'm pretty sure that we could also rely on some Gartner landscape > research. Just an idea. > > Thanks for performing t

Re: Confluent schema registry Kafka client for sql-client.sh

2020-04-03 Thread Jark Wu
Hi Ruurtjan, Thanks for reporting this. There is already an issue to track to support this feature! https://issues.apache.org/jira/browse/FLINK-16048 Currently, Flink SQL only supports JSON,CSV and standard AVRO format. The Kafka Avro format is not supported yet, but this's definitely in our roadm

Re: Parallisation of S3 write sink

2020-04-03 Thread David Magalhães
Thanks for your feedback Till. I think in this scenario the best approach is to go into the ThreadPool. On Fri, Apr 3, 2020 at 1:47 PM Till Rohrmann wrote: > Hi David, > > I assume that you have written your own TwoPhaseCommitSink which writes to > S3, right? If that is the case, then it is main

Re: [SURVEY] What Change Data Capture tools are you using?

2020-04-03 Thread Timo Walther
Hi Jark, thanks for sharing the results. I think for the databases usage part, I'm pretty sure that we could also rely on some Gartner landscape research. Just an idea. Thanks for performing the survey! Timo On 03.04.20 10:17, Jark Wu wrote: Hi everyone, Thanks all for the feedbacks! I wo

Confluent schema registry Kafka client for sql-client.sh

2020-04-03 Thread Ruurtjan Pul
Hi there, I'm not sure if this is possible, because it doesn't seem to be documented anywhere, so here we go... I've got a Kafka topic with Avro encoded records that have been produced with a Confluent schema registry Kafka client. Note that these Avro records are slightly different from vanilla

Re: Parallisation of S3 write sink

2020-04-03 Thread Till Rohrmann
Hi David, I assume that you have written your own TwoPhaseCommitSink which writes to S3, right? If that is the case, then it is mainly up to your implementation how it writes files to S3. If your S3 client supports uploading multiple files concurrently, then you should go for it. Async I/O won't

Re: Making job fail on Checkpoint Expired?

2020-04-03 Thread Robin Cassan
Hi Congxian, Thanks for confirming! The reason I want this behavior is because we are currently investigating issues with checkpoints that keep getting timeouts after the job has been running for a few hours. We observed that, after a few timeouts, if the job was being restarted because of a lost

Re: Anomaly detection Apache Flink

2020-04-03 Thread Marta Paes Moreira
Hi, Salvador. You can find some more examples of real-time anomaly detection with Flink in these presentations from Microsoft [1] and Salesforce [2] at Flink Forward. This blogpost [3] also describes how to build that kind of application using Kinesis Data Analytics (based on Flink). Let me know

Parallisation of S3 write sink

2020-04-03 Thread David Magalhães
I have a scenario where multiple small files need to be written on S3. I'm using TwoPhaseCommit sink since I have a specific scenario where I can't use StreamingFileSink. I've notice that because the way the S3 write is done (sequencially), the checkpoint is timining out (10 minutes), because it t

Re: subscribe

2020-04-03 Thread Till Rohrmann
Hi Giriraj, in order to subscribe to Flink's ML you have to send an email to user-subscr...@flink.apache.org. Cheers, Till On Fri, Apr 3, 2020 at 8:30 AM Giriraj Chauhan wrote: > >

Anomaly detection Apache Flink

2020-04-03 Thread Salvador Vigo
Hi there, I am working in an approach to make some experiments related with anomaly detection in real time with Apache Flink. I would like to know if there are already some open issues in the community. The only example I found was the one of Scott Kidder and the

Re: Making job fail on Checkpoint Expired?

2020-04-03 Thread Congxian Qiu
Currently, only checkpoint declined will be counted into `continuousFailureCounter`. Could you please share why do you want the job to fail when checkpoint expired? Best, Congxian Timo Walther 于2020年4月2日周四 下午11:23写道: > Hi Robin, > > this is a very good observation and maybe even unintended beh

Re: [SURVEY] What Change Data Capture tools are you using?

2020-04-03 Thread Jark Wu
Hi everyone, Thanks all for the feedbacks! I would like to share with you the survey report. I started an English survey and a Chinese survey in user@ and user-zh@ mailing list. The results are big different. *English survey:* - Top3 CDC tools: Debezium (66.7%), Oracle GoldenGate (15.6%), Maxw

Re: Issue with single job yarn flink cluster HA

2020-04-03 Thread Dinesh J
Hi Andrey, Sure We will try to use Flink 1.10 to see if HA issues we are facing is fixed and update in this thread. Thanks, Dinesh On Thu, Apr 2, 2020 at 3:22 PM Andrey Zagrebin wrote: > Hi Dinesh, > > Thanks for sharing the logs. There were couple of HA fixes since 1.7, e.g. > [1] and [2]. > I