Re: Exact-once processing when a job fails

2022-01-05 Thread Caizhi Weng
Hi! Also note that although this eventual consistency seems not good enough, but for 99.99% of the time the job can run smoothly without failure. In this case the records are correct and good. Only in the 0.01% case when the job fails will user see inconsistency for a small period of time (for a c

Converting parquet MessageType to flink RowType

2022-01-05 Thread Meghajit Mazumdar
Hello, We want to read and process Parquet Files using a FileSource and the DataStream API. Currently, as referenced from the documentation

Re: Flink native k8s integration vs. operator

2022-01-05 Thread Thomas Weise
Hi David, Thank you for the reply and context! As for workload types and where native integration might fit: I think that any k8s native solution that satisfies category 3) can also take care of 1) and 2) while the native integration by itself can't achieve that. Existence of [1] might serve as f

Re: Exact-once processing when a job fails

2022-01-05 Thread Caizhi Weng
Hi! Flink guarantees *eventual* consistency for systems without transactions (by transaction I mean a system supporting writing a few records then commit), or with transactions but users prefer latency than consistency. That is to say, everything produced by Flink before a checkpoint is "not secur

Re: extending RichSinkFunction doesn't force to implement any of its methods

2022-01-05 Thread Caizhi Weng
Hi! This is because ProcessFunction#processElement is a must while all methods in SinkFunction are not mandatory (for example you can create a sink which just discards all records by directly implementing SinkFunction). However if you want your sink to be more useful you'll have to see which metho

Re: StaticFileSplitEnumerator - keeping track 9f already processed files

2022-01-05 Thread Caizhi Weng
Hi! Do you mean the pathsAlreadyProcessed set in ContinuousFileSplitEnumerator? This is because ContinuousFileSplitEnumerator has to continuously add new files to splitAssigner, while StaticFileSplitEnumerator does not. The pathsAlreadyProcessed set records the paths already discovered by Continu

Re: Metaspace OOM : class loaders not being GC

2022-01-05 Thread Caizhi Weng
Hi! As far as I remember this is a known issue a few years ago but Flink currently has no solution to this (correct me if I'm wrong). I see that you're running jobs on a yarn session. Could you switch to yarn-per-job mode (where JM and TMs are created and destroyed for each job) for a workaround?

Re: Passing msg and record to the process function

2022-01-05 Thread Caizhi Weng
Hi! The last expression in your try block is if(validationMessages.isEmpty) { (parsedJson.toString(), validationMessages.foreach((msg=>msg.getMessage.toString))) } else { (parsedJson.toString(), "Format is correct...") } The first one produces a (String, Unit) type while the second one produ

StaticFileSplitEnumerator - keeping track 9f already processed files

2022-01-05 Thread Krzysztof Chmielewski
Hi, Why StaticFileSplitEnumerator from FileSource does not track the already processed files similar to how ContinuousFileSplitEnumerator does? I'm thinking about scenario where we have a Bounded FileSource that reads a lot of files using FileSource and stream it's content to Kafka.If there will b

extending RichSinkFunction doesn't force to implement any of its methods

2022-01-05 Thread Siddhesh Kalgaonkar
I have implemented a Cassandra sink and when I am trying to call it from another class via DataStream it is not calling any of the methods. I tried extending other interfaces like ProcessFunction and it is forcing me to implement its methods whereas. when it comes to RichSinkFunction it doesn't for

Passing msg and record to the process function

2022-01-05 Thread Siddhesh Kalgaonkar
I have written a process function where I am parsing the JSON and if it is not according to the expected format it passes as Failure to the process function and I print the records which are working fine. Now, I was trying to print the message and the record in case of Success and Failure. I implem

Re: [DISCUSS] Deprecate MapR FS

2022-01-05 Thread Till Rohrmann
+1 for dropping the MapR FS. Cheers, Till On Wed, Jan 5, 2022 at 10:11 AM Martijn Visser wrote: > Hi everyone, > > Thanks for your input. I've checked the MapR implementation and it has no > annotation at all. Given the circumstances that we thought that MapR was > already dropped, I would prop

Re: [DISCUSS] Deprecate MapR FS

2022-01-05 Thread Martijn Visser
Hi everyone, Thanks for your input. I've checked the MapR implementation and it has no annotation at all. Given the circumstances that we thought that MapR was already dropped, I would propose to immediately remove MapR in Flink 1.15 instead of first marking it as deprecated and removing it in Fli

Re: S3 server side encryption using FileSink

2022-01-05 Thread David Morávek
Hi James, I'm not an expert on s3, but in general this should be a matter of configuring the s3 filesystem implementation that Flink is using (that's what ends up writing the actual files to s3). Flink currently comes with the Hadoop & Presto (also kind of Hadoop based) based implementations. Loo

Re: Pod Disruption in Flink Kubernetes Cluster

2022-01-05 Thread David Morávek
Hi Tianyi, this really depends on your kubernetes setup (eg. if autoscaling is enabled, you're using spot / preemtible instances). In general applications that run on Kubernetes needs be resilient to these kind of failures, Flink is no exception. In case of the failure, Flink needs to restart the

Re: [DISCUSS] Change some default config values of blocking shuffle

2022-01-05 Thread Yun Gao
Very thanks @Yingjie for completing the experiments! Also +1 for changing the default config values. From the experiments, Changing the default config values would largely increase the open box experience of the flink batch, thus it seems worth changing from my side even if it would cause some c