Re: Why we need again kubernetes flink operator?

2021-10-28 Thread Vijay Bhaskar
yaml and is more friendly to K8s users. > > The flink-k8s-operator could integrate with standalone mode[1], but also > native K8s mode[2]. > > [1]. https://github.com/GoogleCloudPlatform/flink-on-k8s-operator > [2]. https://github.com/wangyang0918/flink-native-k8s-operator > &g

Re: High availability data clean up

2021-10-21 Thread Vijay Bhaskar
In HA mode the configMap will be retained after deletion of the deployment: https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/ ( Refer High availability data clean up) On Fri, Oct 22, 2021 at 8:13 AM Yangze Guo wrote: > For application mode, when the jo

Re: Why we need again kubernetes flink operator?

2021-10-21 Thread Vijay Bhaskar
Understood that we have kubernetes HA configuration where we specify s3:// or HDFS:/// persistent storage, as mentioned here: https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/ Regards Bhaskar On Fri, Oct 22, 2021 at 10:47 AM Vijay Bhaskar wrote: >

Why we need again kubernetes flink operator?

2021-10-21 Thread Vijay Bhaskar
All, I have used flink upto last year using flank 1.9. That time we built our own cluster using zookeeper and monitoring jobs. Now I am revisiting different applications. Found that community has come up with this native kubernetes deployment: https://ci.apache.org/projects/flink/flink-docs-mast

Catching SIGINT With flink Jobs

2021-10-16 Thread Vijay Bhaskar
Can we register any method in Custom Source and Custom Sink to catch SIGINIT? (I know that we need to follow procedure to stop a flink job by saving the state.) Regards Bhaskar

Re: Checkpoint size increasing even i enable increasemental checkpoint

2021-10-12 Thread Vijay Bhaskar
Since state size is small, you can try FileState Backend, rather than RocksDB. You can check once. Thumb rule is if FileStateBackend Performs worse, RocksDB is good. Regards Bhasakar On Tue, Oct 12, 2021 at 1:47 PM Yun Tang wrote: > Hi Lei, > > RocksDB state-backend's checkpoint is composited b

Re: Flink RocksDB Performance

2021-07-16 Thread Vijay Bhaskar
Yes absolutely. Unless we need a very large state order of GB rocks DB is not required. RocksDB is good only because the Filesystem is very bad at LargeState. In other words FileSystem performs much better than RocksDB upto GB's. After that the file system degrades compared to RocksDB. Its not that

Flink FsStatebackend is giving better performance than RocksDB

2020-07-17 Thread Vijay Bhaskar
Hi While doing scale testing we observed that FSStatebackend is out performing RocksDB. When using RocksDB, off heap memory keeps growing over a period of time and after a day pod got terminated with OOM. Whereas the same data pattern FSStatebackend is running for days without any memory spike an

Re: Trouble with large state

2020-06-22 Thread Vijay Bhaskar
t; } > > isSkippedElement = false; > > > > evictingWindowState.setCurrentNamespace(window); > > evictingWindowState.add(element); > > > > t

Re: Trouble with large state

2020-06-19 Thread Vijay Bhaskar
h zero. > >> > >> The application I am replacing has a latency of 36-48 hours, so if I > >> had > >> to fully stop processing to take every snapshot synchronously, it > >> might > >> be seen as totally acceptable, especially for initial bo

Re: Trouble with large state

2020-06-18 Thread Vijay Bhaskar
by much. > > On first glance, the code change to allow RocksDBStateBackend into a > synchronous snapshots mode looks pretty easy. Nevertheless, I was > hoping to do the initial launch of my application without needing to > modify the framework. > > Regards, > > > Jeff He

Re: Trouble with large state

2020-06-18 Thread Vijay Bhaskar
For me this seems to be an IO bottleneck at your task manager. I have a couple of queries: 1. What's your checkpoint interval? 2. How frequently are you updating the state into RocksDB? 3. How many task managers are you using? 4. How much data each task manager handles while taking the checkpoint?

Issue with job status

2020-06-18 Thread Vijay Bhaskar
Hi I am using flink 1.9 and facing the below issue Suppose i have deployed any job and in case there are not enough slots, then the job is stuck in waiting for slots. But flink job status is showing it as "RUNNING" actually it's not. For me this is looking like a bug. It impacts our production whi

Re: Flink not restoring from checkpoint when job manager fails even with HA

2020-06-08 Thread Vijay Bhaskar
special job id. > If you stick to use this, all runs of this jobgraph would have the same > job id. > > > Best > Yun Tang > -- > *From:* Vijay Bhaskar > *Sent:* Monday, June 8, 2020 12:42 > *To:* Yun Tang > *Cc:* Kathula, Sandeep ; us

Re: Flink not restoring from checkpoint when job manager fails even with HA

2020-06-07 Thread Vijay Bhaskar
Hi Yun If we start using the special Job ID and redeploy the job, then after deployment, will it going to get assigned with special Job ID? or new Job ID? Regards Bhaskar On Mon, Jun 8, 2020 at 9:33 AM Yun Tang wrote: > Hi Sandeep > > In general, Flink assign unique job-id to each job and use

Re: In consistent Check point API response

2020-05-27 Thread Vijay Bhaskar
Created JIRA for it: https://issues.apache.org/jira/browse/FLINK-17966 Regards Bhaskar On Wed, May 27, 2020 at 1:28 PM Vijay Bhaskar wrote: > Thanks Yun. In that case it would be good to give the reference of that > documentation in the Flink Rest API: > https://ci.apache.org/proje

Re: In consistent Check point API response

2020-05-27 Thread Vijay Bhaskar
/blob/master/docs/monitoring/checkpoint_monitoring.zh.md > > Best > Yun Tang > -- > *From:* Vijay Bhaskar > *Sent:* Tuesday, May 26, 2020 15:18 > *To:* Yun Tang > *Cc:* user > *Subject:* Re: In consistent Check point API response > > Thank

Question on Job Restart strategy

2020-05-26 Thread Vijay Bhaskar
Hi We are using restart strategy of fixed delay. I have fundamental question: Why the reset counter is not zero after streaming job restart is successful? Let's say I have number of restarts max are: 5 My streaming job tried 2 times and 3'rd attempt its successful, why counter is still 2 but not ze

Re: In consistent Check point API response

2020-05-26 Thread Vijay Bhaskar
checkpoint, they are actually not the same thing. > The only scenario that they're the same in numbers is when Flink just > restore successfully before a new checkpoint completes. > > Best > Yun Tang > > > -- > *From:* Vijay Bhaska

Re: In consistent Check point API response

2020-05-25 Thread Vijay Bhaskar
b.com/apache/flink/blob/8f992e8e868b846cf7fe8de23923358fc6b50721/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1250 > > Best > Yun Tang > -- > *From:* Vijay Bhaskar > *Sent:* Monday, May 25, 2020 17:01 > *To:

Re: In consistent Check point API response

2020-05-25 Thread Vijay Bhaskar
t; savepoint/checkpoint is restored. [1] > > [1] > https://github.com/apache/flink/blob/50253c6b89e3c92cac23edda6556770a63643c90/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1285 > > Best > Yun Tang > > --

In consistent Check point API response

2020-05-24 Thread Vijay Bhaskar
Hi I am using flink retained check points and along with jobs/:jobid/checkpoints API for retrieving the latest retained check point Following the response of Flink Checkpoints API: I have my jobs restart attempts are 5 check point API response in "latest" key, check point file name of both "rest

Re: Stop job with savepoint during graceful shutdown on a k8s cluster

2020-03-16 Thread Vijay Bhaskar
For point (1) above: Its up to user to have proper sink and source to choose to have exactly once semantics as per the documentation: https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/guarantees.html If we choose the supported source and sink combinations duplicates will be avoi

Re: Stop job with savepoint during graceful shutdown on a k8s cluster

2020-03-13 Thread Vijay Bhaskar
Please find answers inline Our understanding is to stop job with savepoint, all the task manager will persist their state during savepoint. If a Task Manager receives a shutdown signal while savepoint is being taken, does it complete the savepoint before shutdown ? [Ans ] Why task manager is shutd

Re: Start flink job from the latest checkpoint programmatically

2020-03-13 Thread Vijay Bhaskar
2 things you can do, stop flink job is going to generate savepoint. You need to save the save point directory path in some persistent store (because you are restarting the cluster, otherwise checkpoint monitoring api should give you save point file details) After spinning the cluster read the path

Re: JobGraphs not cleaned up in HA mode

2019-11-28 Thread Vijay Bhaskar
> -- 原始邮件 -- > *发件人:* "vino yang"; > *发送时间:* 2019年11月28日(星期四) 晚上7:17 > *收件人:* "曾祥才"; > *抄送:* "Vijay Bhaskar";"User-Flink"< > user@flink.apache.org>; > *主题:* Re: JobGraphs not cleaned up in HA mode > > Hi, > >

Re: JobGraphs not cleaned up in HA mode

2019-11-27 Thread Vijay Bhaskar
-- > *发件人:* "Vijay Bhaskar"; > *发送时间:* 2019年11月28日(星期四) 下午3:05 > *收件人:* "曾祥才"; > *主题:* Re: JobGraphs not cleaned up in HA mode > > Again it could not find the state store file: "Caused by: > java.io.FileNotFoundException: /flink/ha/submittedJob

Re: JobGraphs not cleaned up in HA mode

2019-11-27 Thread Vijay Bhaskar
is a external persistent store (a nas directory mounts > to the job manager) > > > > > -- 原始邮件 ---------- > *发件人:* "Vijay Bhaskar"; > *发送时间:* 2019年11月28日(星期四) 下午2:29 > *收件人:* "曾祥才"; > *抄送:* "user"; > *主题:* Re: JobGraphs n

Re: JobGraphs not cleaned up in HA mode

2019-11-27 Thread Vijay Bhaskar
Following are the mandatory condition to run in HA: a) You should have persistent common external store for jobmanager and task managers to while writing the state b) You should have persistent external store for zookeeper to store the Jobgraph. Zookeeper is referring path: /flink/checkpoints/su

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-10-14 Thread Vijay Bhaskar
d by the community. > > Cheers, > Till > > On Fri, Oct 11, 2019 at 11:39 AM Vijay Bhaskar > wrote: > >> Apart from these we have other environment and there check point worked >> fine in HA mode with complete cluster restart. But one of the job we are >> se

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-10-11 Thread Vijay Bhaskar
using flink 1.6.2 in production. Is this an issue already known before and fixed recently Regards Bhaskar On Fri, Oct 11, 2019 at 2:08 PM Vijay Bhaskar wrote: > We are seeing below logs in production sometime ago, after that we stopped > HA. Do you people think HA is enabled properly fr

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-10-11 Thread Vijay Bhaskar
u want to change your job topology, just follow > the general rule to restore from savepoint/checkpoint, do not rely on HA to > do job migration things. > > Best > Yun Tang > -- > *From:* Hao Sun > *Sent:* Friday, October 11, 2019 8:33 > *To:* Yun

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-10-10 Thread Vijay Bhaskar
ter-id, it could > recover jobs and checkpoint from zookeeper. I think it has been supported > for a long time. Maybe you could have a > try with flink-1.8 or 1.9. > > Best, > Yang > > > Vijay Bhaskar 于2019年10月10日周四 下午2:26写道: > >> Thanks Yang and Sean. I ha

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-10-09 Thread Vijay Bhaskar
gt;> promising enough option that we're going to run in HA for a month or two >> and monitor results before we put in any extra work to customize the >> savepoint start-up behavior. >> >> On Fri, Sep 27, 2019 at 2:24 AM Vijay Bhaskar >> wrote: >> >>

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-26 Thread Vijay Bhaskar
I don't think HA will help to recover from cluster crash, for that we should take periodic savepoint right? Please correct me in case i am wrong Regards Bhaskar On Fri, Sep 27, 2019 at 11:48 AM Vijay Bhaskar wrote: > Suppose my cluster got crashed and need to bring up the entire cluste

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-26 Thread Vijay Bhaskar
Suppose my cluster got crashed and need to bring up the entire cluster back? Does HA still helps to run the cluster from latest save point? Regards Bhaskar On Thu, Sep 26, 2019 at 7:44 PM Sean Hester wrote: > thanks to everyone for all the replies. > > i think the original concern here with "ju

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-25 Thread Vijay Bhaskar
One of the way you should do is, have a separate cluster job manager program in kubernetes, which is actually managing jobs. So that you can decouple the job control. While restarting the job, make sure to follow the below steps: a) First job manager takes save point by killing the job and notes d

Re: Recommended approach to debug this

2019-09-22 Thread Vijay Bhaskar
One more suggestion is to run the same job in regular 2 node cluster and see whether you are getting the same exception. So that you can narrow down the issue easily. Regards Bhaskar On Mon, Sep 23, 2019 at 7:50 AM Zili Chen wrote: > Hi Debasish, > > As mentioned by Dian, it is an internal e

Re: Running flink examples

2019-09-18 Thread Vijay Bhaskar
Can you check whether its able to read the supplied input file properly or not? Regards Bhaskar On Wed, Sep 18, 2019 at 1:07 PM RAMALINGESWARA RAO THOTTEMPUDI < tr...@iitkgp.ac.in> wrote: > Hi Sir, > > I am trying to run the flink programs particularl Pagerank. > > I have used the following com

Re: externalizing config flies for flink class loader

2019-09-12 Thread Vijay Bhaskar
= overrideConfigFromArgs.withFallBack(defaultConfig) //This is going to override your default configuration params On Fri, Sep 13, 2019 at 11:38 AM Vijay Bhaskar wrote: > Hi > You can use this way: > Use typesafe configuration, which provides excellent configuration > methodologies. > You

Re: externalizing config flies for flink class loader

2019-09-12 Thread Vijay Bhaskar
Hi You can use this way: Use typesafe configuration, which provides excellent configuration methodologies. You supply default configuration, which is read by your application through reference.conf file of typesafe. If you want to override any of the defaults you can supply to command line argument

Re: Checkpointing is not performing well

2019-09-11 Thread Vijay Bhaskar
(for serialization) during checkpoints. > > Fabian > > Am Mi., 11. Sept. 2019 um 09:38 Uhr schrieb Ravi Bhushan Ratnakar < > ravibhushanratna...@gmail.com>: > >> What is the upper limit of checkpoint size of Flink System? >> >> Regards, >> Ravi >> >&g

Re: Checkpointing is not performing well

2019-09-10 Thread Vijay Bhaskar
sible > like every 5 second. > > Thanks, > Ravi > > > On Tue 10 Sep, 2019, 11:37 Vijay Bhaskar, > wrote: > >> For me task count seems to be huge in number with the mentioned resource >> count. To rule out the possibility of issue with state backend can you >>

Re: Checkpointing is not performing well

2019-09-10 Thread Vijay Bhaskar
For me task count seems to be huge in number with the mentioned resource count. To rule out the possibility of issue with state backend can you start writing sink data as , i.e., data ignore sink. And try whether you could run it for longer duration without any issue. You can start decreasing the

Re: K8s job cluster and cancel and resume from a save point ?

2019-03-12 Thread Vijay Bhaskar
he.flink.runtime.entrypoint.ClusterEntrypoint - > Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with > exit code 1444.* > > > That I think is an issue. A cancelled job is a complete job and thus the > exit code should be 0 for k8s to mark it complete. > > &g

Re: K8s job cluster and cancel and resume from a save point ?

2019-03-12 Thread Vijay Bhaskar
; > > "hdfs:///tmp/xyz14/savepoint-00-1d4f71345e22", > "--job-classname", . > > > >- Restart > >kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml > >

Re: K8s job cluster and cancel and resume from a save point ?

2019-03-12 Thread Vijay Bhaskar
th cancel is a single atomic step, > rather then a save point *followed* by a cancellation ( else why would > that be an option ). > Thanks again. > > > On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar > wrote: > >> Hi Vishal, >> >> yarn-cancel doesn't m

Re: K8s job cluster and cancel and resume from a save point ?

2019-03-12 Thread Vijay Bhaskar
000 from the resource manager. > > 2019-03-12 08:10:44,477 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Stopping ZooKeeperLeaderElectionService > ZooKeeperLeaderElectionService{leaderPath='/leader//job_manager_

Re: K8s job cluster and cancel and resume from a save point ?

2019-03-12 Thread Vijay Bhaskar
Hi Vishal Save point with cancellation internally use /cancel REST API. Which is not stable API. It always exits with 404. Best way to issue is: a) First issue save point REST API b) Then issue /yarn-cancel rest API( As described in http://mail-archives.apache.org/mod_mbox/flink-user/201804

Re: Connection leak with flink elastic Sink

2018-12-14 Thread Vijay Bhaskar
el for logging to see if there is > anything suspicious? > > Cheers, > Gordon > > > On 13 December 2018 at 7:35:33 PM, Vijay Bhaskar (bhaskar.eba...@gmail.com) > wrote: > > Hi Gordon, > We are using flink cluster 1.6.1, elastic search connector version: > flink-connec

Re: Connection leak with flink elastic Sink

2018-12-13 Thread Vijay Bhaskar
13 December 2018 at 5:59:34 PM, Chesnay Schepler (ches...@apache.org) > wrote: > > Specifically which connector are you using, and which Flink version? > > On 12.12.2018 13:31, Vijay Bhaskar wrote: > > Hi > > We are using flink elastic sink which streams at the rate of 1000

Connection leak with flink elastic Sink

2018-12-12 Thread Vijay Bhaskar
Hi We are using flink elastic sink which streams at the rate of 1000 events/sec, as described in https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/elasticsearch.html . We are observing connection leak of elastic connections. After few minutes all the open connections are exceedi

Re: ***UNCHECKED*** Re: Checkpointing not working

2018-09-19 Thread Vijay Bhaskar
Can you please check the following document and verify whether you have enough network bandwidth to support 30 seconds check point interval worth of the streaming data? https://data-artisans.com/blog/how-to-size-your-apache-flink-cluster-general-guidelines Regards Bhaskar On Wed, Sep 19, 2018 at

Re: Question regarding state cleaning using timer

2018-09-17 Thread Vijay Bhaskar
s://training.data-artisans.com/ > [3] > https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/process_function.html > > On Sep 17, 2018, at 8:50 AM, Vijay Bhaskar > wrote: > > Thanks Hequn. But i want to give random TTL for each partitioned key. How > can

Re: Question regarding state cleaning using timer

2018-09-16 Thread Vijay Bhaskar
Thanks Hequn. But i want to give random TTL for each partitioned key. How can i achieve it? Regards Bhaskar On Mon, Sep 17, 2018 at 7:30 AM Hequn Cheng wrote: > Hi bhaskar, > > You need change nothing if you want to handle multi keys. Flink will do it > for you. The ValueState is a keyed state.

Re: ListState - elements order

2018-09-14 Thread Vijay Bhaskar
How it would be to use ValueState with values as string separated by the delimiter. So that order will never be a problem. Only overhead is to separate delimiter, read the elements and convert them into primitive types in case necessary. It just workaround. In case doesn't suite your requirements p