Re: Flink Kafka EXACTLY_ONCE causing KafkaException ByteArraySerializer is not an instance of Serializer

2020-07-30 Thread Qingsheng Ren
Hi Vikash, It's a bug about classloader used in `abortTransaction()` method in `FlinkKafkaProducer`, Flink version 1.10.0. I think it has been fixed in 1.10.1 and 1.11 according to FLINK-16262. Are you using Flink version 1.10.0? Vikash Dat 于2020年7月30日周四 下午9:26写道: > Has anyone had success with

Re: Unable to recover from checkpoint

2020-07-30 Thread Congxian Qiu
Hi Sivaprasanna For RocksDBStateBackend incremental checkpoint, the latest checkpoint may contain the files of the previous checkpoint(the files in the shared directory), so delete the files belong to the previous checkpoint may lead to FileNotFoundException. Currently, we can only parse the

Re: [DISCUSS] FLIP-133: Rework PyFlink Documentation

2020-07-30 Thread Xingbo Huang
Hi Jincheng, Thanks a lot for bringing up this discussion and the proposal. Big +1 for improving the structure of PyFlink doc. It will be very friendly to give PyFlink users a unified entrance to learn PyFlink documents. Best, Xingbo Dian Fu 于2020年7月31日周五 上午11:00写道: > Hi Jincheng, > > Thanks

Re: [DISCUSS] FLIP-133: Rework PyFlink Documentation

2020-07-30 Thread Dian Fu
Hi Jincheng, Thanks a lot for bringing up this discussion and the proposal. +1 to improve the Python API doc. I have received many feedbacks from PyFlink beginners about the PyFlink doc, e.g. the materials are too few, the Python doc is mixed with the Java doc and it's not easy to find the doc

is it possible one task manager stuck and still fetching data from Kinesis?

2020-07-30 Thread Terry Chia-Wei Wu
We are running Flink 1.10 about 900+ task managers with kinesis as an input stream. The problem we are having now is that only Max Age of kinesis shard is growing and the average age of that kinesis is very low meaning most of shards having very low age. We already checked the data skew issue but i

Re: Does Flink automatically apply any backpressure ?

2020-07-30 Thread Jake
Hi Suraj Puvvada Yes, Flink back pressure depend on the Flink task buffer。process task will sends buffer remaining size to source, source will slow down. https://www.ververica.com/blog/how-flink-handles-backpressure Jake > On J

[DISCUSS] FLIP-133: Rework PyFlink Documentation

2020-07-30 Thread jincheng sun
Hi folks, Since the release of Flink 1.11, users of PyFlink have continued to grow. As far as I know there are many companies have used PyFlink for data analysis, operation and maintenance monitoring business has been put into production(Such as 聚美优品[1](Jumei), 浙江墨芷[2] (Mozhi) etc.). According t

Re: Colocating Compute

2020-07-30 Thread Satyam Shekhar
Hi Dawid, I am currently on Flink v1.10. Do streaming pipelines support unbounded InputFormat in v1.10? My current setup uses SourceFunction for streaming pipeline and InputFormat for batch queries. I see the documentation for Flink v1.11 describe concepts for Split, SourceReader, and SplitEnumer

Re: [Compression] Flink DataStream[GenericRecord] Avrowriters

2020-07-30 Thread Vijayendra Yadav
Hi Ravi, Perfect, This is looking good. Thanks for your help. Regards, Vijay On Thu, Jul 30, 2020 at 5:39 AM Ravi Bhushan Ratnakar < ravibhushanratna...@gmail.com> wrote: > Hi Vijayendra, > > There is an issue with the CustomeAvroWriters.java which i shared with you > earlier, i am sending you

StreamingFileSink: any risk parallelizing active buckets checkpointing?

2020-07-30 Thread Paul Bernier
Hi experts, I am trying to use S3 StreamingFileSink with a high number of active buckets (>1000). I found that checkpointing duration will grow linearly with the number of active buckets, which makes achieving high number of active buckets difficult. One reason for that is the each active bucke

Re: Count of records coming into the Flink App from KDS and at each step through Execution Graph

2020-07-30 Thread Chesnay Schepler
If you do the aggregation in Prometheus I would think that you do not need to reset the counter; but it's been a while since I've used it. Flink will not automatically reset counters. If this is necessary then you will have to manually reset the counter every 5 seconds. The name under which it

Re: Count of records coming into the Flink App from KDS and at each step through Execution Graph

2020-07-30 Thread Vijay Balakrishnan
Hi David, Thx for your reply. To summarize: Use a Counter: counter = getRuntimeContext() .getMetricGroup() .addGroup("MyMetricsKey", "MyMetricsValue") //specify my value for each custom event_name here- I might not know all custom event_names in advance .counter("myCounter"); This MyMetric

Does Flink automatically apply the rebalance operator ?

2020-07-30 Thread Suraj Puvvada
Hello We are testing a simple use case where we read from kafka -> process and write to kafka. I have set parallelism of the job to 3 and parallelism for the process function to 6. When I looked at the job graph in the Flink UI noticed that between the source and process function a rebalance step

Does Flink automatically apply any backpressure ?

2020-07-30 Thread Suraj Puvvada
Hello I am trying to understand if Flink has a mechanism to automatically apply any backpressure by throttling any operators ? For example if I have a Process function that reads from a Kafkaa source and writes to a Kafka sink. If the process function is slow will the kafka source be automatically

Count of records in the Stream for a time window of 5s

2020-07-30 Thread Vijay Balakrishnan
Hi, Trying to get a count of records in the Stream for a time window of 5s. Always getting a count of 1 ?? Sent in 10 records.Expect the count to be 10 at the end. Tried to follow the advise here from Fabian Hueske- https://stackoverflow.com/questions/45606999/how-to-count-the-number-of-records-p

BucketingSink & StreamingFileSink

2020-07-30 Thread Mariano González Núñez
Hi Flink Team, I'm Mariano & I'm working with Apache Flink to process data and sink from Kafka to Azure Datalake (ADLS Gen1). We are having problems with the sink in parquet format in the ADLS Gen1, also don't work with the gen2. We try to do the StreamingFileSink to store in parquet but we can't

Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Flavio Pompermaier
We use runCustomOperation to group a set of operators and into a single functional unit, just to make the code more modular.. It's very comfortable indeed. On Thu, Jul 30, 2020 at 5:20 PM Aljoscha Krettek wrote: > That is good input! I was not aware that anyone was actually using > `runCustomOpe

Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Aljoscha Krettek
That is good input! I was not aware that anyone was actually using `runCustomOperation()`. Out of curiosity, what are you using that for? We have definitely thought about the first two points you mentioned, though. Especially processing-time will make it tricky to define unified execution sema

Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Flavio Pompermaier
I just wanted to be propositive about missing api.. :D On Thu, Jul 30, 2020 at 4:29 PM Seth Wiesman wrote: > +1 Its time to drop DataSet > > Flavio, those issues are expected. This FLIP isn't just to drop DataSet > but to also add the necessary enhancements to DataStream such that it works > wel

user@flink.apache.org

2020-07-30 Thread Sofya T. Irwin
Hi, I'm trying to investigate a SQL job using a time-windowed join that is exhibiting a large, growing state. The join syntax is most similar to the interval join ( https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/joins.html ). A few questions: 1. Am I correct in understa

Re: Flink state reconciliation

2020-07-30 Thread Seth Wiesman
That is doable via the state processor API, though Arvid's idea does sound simpler :) You could read the operator with the rules, change the data as necessary and then rewrite it out as a new savepoint to start the job. On Thu, Jul 30, 2020 at 5:24 AM Arvid Heise wrote: > Another idea: since y

Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Seth Wiesman
+1 Its time to drop DataSet Flavio, those issues are expected. This FLIP isn't just to drop DataSet but to also add the necessary enhancements to DataStream such that it works well on bounded input. On Thu, Jul 30, 2020 at 8:49 AM Flavio Pompermaier wrote: > Just to contribute to the discussion

Flink streaming job logging reserves space

2020-07-30 Thread Maxim Parkachov
Hi everyone, I have a strange issue with flink logging. I use pretty much standard log4 config, which is writing to standard output in order to see it in Flink GUI. Deployment is on YARN with job mode. I can see logs in UI, no problem. On the servers, where Flink YARN containers are running, there

Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Flavio Pompermaier
Just to contribute to the discussion, when we tried to do the migration we faced some problems that could make migration quite difficult. 1 - It's difficult to test because of https://issues.apache.org/jira/browse/FLINK-18647 2 - missing mapPartition 3 - missing DataSet runOperation(CustomUnaryOp

Re: Customization of execution environment

2020-07-30 Thread Flavio Pompermaier
That's fine and it's basically what I do as well..I was arguing that it's bad (IMHO) that you could access the config from the BatchTableEnvironment (via bte.getConfig().getConfiguration()). You legitimately think that you are customizing the env but that's illusory. You should not be able to set p

Re: Flink Kafka EXACTLY_ONCE causing KafkaException ByteArraySerializer is not an instance of Serializer

2020-07-30 Thread Vikash Dat
Has anyone had success with using exactly_once in a kafka producer in flink? As of right now I don't think the code shown in the docs (https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-producer) actually works. -- Sent from: http://apache-flink-user-mailing-l

Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Till Rohrmann
+1 for this effort. Great to see that we are making progress towards our goal of a truly unified batch and stream processing engine. Cheers, Till On Thu, Jul 30, 2020 at 2:28 PM Kurt Young wrote: > +1, looking forward to the follow up FLIPs. > > Best, > Kurt > > > On Thu, Jul 30, 2020 at 6:40 P

Re: Colocating Compute

2020-07-30 Thread Dawid Wysakowicz
Hi Satyam, I think you can use the InputSplitAssigner also for streaming pipelines through an InputFormat. You can use StreamExecutionEnvironment#createInput or for SQL you can write your source according to the documentation here: https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/t

Unable to recover from checkpoint

2020-07-30 Thread Sivaprasanna
Hello, We recently ran into an unexpected scenario. Our stateful streaming pipeline uses RocksDB as the backend and has incremental checkpointing enabled. We have RETAIN_ON_CANCELATION enabled so some of the previous cancellation and restarts had left a lot of unattended checkpoint directories whi

Re: [Compression] Flink DataStream[GenericRecord] Avrowriters

2020-07-30 Thread Ravi Bhushan Ratnakar
Hi Vijayendra, There is an issue with the CustomeAvroWriters.java which i shared with you earlier, i am sending you the fixed version, hope this will resolve the issue of reading it from the avro tool. Please use below supported possible string value for codecName null - for nullCodec deflate -

Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Kurt Young
+1, looking forward to the follow up FLIPs. Best, Kurt On Thu, Jul 30, 2020 at 6:40 PM Arvid Heise wrote: > +1 of getting rid of the DataSet API. Is DataStream#iterate already > superseding DataSet iterations or would that also need to be accounted for? > > In general, all surviving APIs shoul

Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Arvid Heise
+1 of getting rid of the DataSet API. Is DataStream#iterate already superseding DataSet iterations or would that also need to be accounted for? In general, all surviving APIs should also offer a smooth experience for switching back and forth. On Thu, Jul 30, 2020 at 9:39 AM Márton Balassi wrote:

Re: Flink state reconciliation

2020-07-30 Thread Arvid Heise
Another idea: since your handling on Flink is idempotent, would it make sense to also periodically send the whole rule set anew? Going further, depending on the number of rules, their size, and the update frequency. Would it be possible to always transfer the complete rule set and just discard the

Re: Customization of execution environment

2020-07-30 Thread Arvid Heise
I'm not entirely sure, if I completely understand the interaction of BTE and ExecEnv, but I'd create it this way Configuration conf = new Configuration(); conf.setInteger(TaskManagerOptions.NUM_TASK_SLOTS, PARALLELISM); ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment(conf);

Re: State Restoration issue with flink 1.10.1

2020-07-30 Thread Yun Tang
Hi I compared the implementation of CepOperator between Flink-1.10.1 and Flink-1.8.2, however they should behave the same as code for map state does not change much. The error you meet might be caused by the change of inputSerializer [1], could you check whether you have introduced any differen

Re: Unable to submit high parallelism job in cluster

2020-07-30 Thread Annemarie Burger
Hi! The problem was indeed a exponentially slow subtask that related to the parallelism, all working now, thanks! -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Unable to submit high parallelism job in cluster

2020-07-30 Thread Arvid Heise
Hi Annemarie, could you please share your topology? If you have a shuffle, your job needs 2 slots per parallelism. So you'd only be able to scale up to 384/2. On Tue, Jul 28, 2020 at 6:32 PM Robert Metzger wrote: > Ah, the good old cloud-11 cluster at DIMA. I used that one as well in 2014 > to

Re: How to stream CSV from S3?

2020-07-30 Thread Arvid Heise
Hi John, I found an example on SO [1] in Scala. [1] https://stackoverflow.com/a/52093079/10299342 On Tue, Jul 28, 2020 at 4:29 PM John Smith wrote: > Hi, is there an example on how RowCsvInputFormat is initialized? > > On Tue, 28 Jul 2020 at 04:00, Jingsong Li wrote: > >> - `env.readCsvFile`

Re: Recommended pattern for implementing a DLQ with Flink+Kafka

2020-07-30 Thread Arvid Heise
Hi Tom, using side outputs is actually the established Flink pattern in that regard. The advantage of side output is that you do not depend on the DLQ concept of the source system, which is incredibly useful if you read from multiple systems. Most commonly, the side-output is then outputted to an

Re: Handle idle kafka source in Flink 1.9

2020-07-30 Thread Arvid Heise
Hi Hemant, sorry for the late reply. You can just create your own watermark assigner and either copy the assigner from Flink 1.11 or take the one that we use in our trainings [1]. [1] https://github.com/ververica/flink-training-troubleshooting/blob/master/src/main/java/com/ververica/flinktrainin

Re: How to retain the column'name when convert a Table to DataStream

2020-07-30 Thread Dawid Wysakowicz
Hi, I am afraid you are facing an issue that was not checked for/was not considered. I think your use case is absolutely valid and should be supported. The problem you are facing as far as I can tell from an initial investigation is that the top-level projection/rename is not being applied. Inter

Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Márton Balassi
Hi All, Thanks for the write up and starting the discussion. I am in favor of unifying the APIs the way described in the FLIP and deprecating the DataSet API. I am looking forward to the detailed discussion of the changes necessary. Best, Marton On Wed, Jul 29, 2020 at 12:46 PM Aljoscha Krettek