Re: Restore from savepoint with Iterations

2020-05-04 Thread Ashish Pokharel
ad. > > We ran into this when outlinks from web pages caused fan-out/amplification of > the data being iterated, but maybe you hit it with restoring from state. > > — Ken > > >> On May 4, 2020, at 6:41 PM, Ashish Pokharel > <mailto:ashish...@yahoo.com>>

Re: Restore from savepoint with Iterations

2020-05-04 Thread Ashish Pokharel
Could this be FLIP-15 related as well then? > On May 4, 2020, at 9:41 PM, Ashish Pokharel wrote: > > Hi all, > > Hope everyone is doing well! > > I am running into what seems like a deadlock (application stalled) situation > with a Flink streaming job upon restore f

Restore from savepoint with Iterations

2020-05-04 Thread Ashish Pokharel
Hi all, Hope everyone is doing well! I am running into what seems like a deadlock (application stalled) situation with a Flink streaming job upon restore from savepoint. Job has a slowly moving stream (S1) that needs to be “stateful” and a continuous stream (S2) which is “joined” with slow mov

Re: Kafka consumer behavior with same group Id, 2 instances of apps in separate cluster?

2019-09-03 Thread Ashish Pokharel
Thanks Becket, Sorry for delayed response. That’s what I thought as well. I built a hacky custom source today directly using Kafka client which was able to join consumer group etc. which works as I expected but not sure about production readiness for something like that :) The need arises beca

Re: Questions on Unbounded number of keys

2018-07-28 Thread Ashish Pokharel
Andrey, Till, This doesn’t jive with what I have noticed (fully acknowledge that I am still getting hang of the framework). I sent a couple of notes on this in earlier threads. With this very simple processing, I am running into slow creep up of memory with unbounded keys, which eventually en

Permissions to delete Checkpoint on cancel

2018-07-22 Thread Ashish Pokharel
All, We recently moved our Checkpoint directory from HDFS to local SSDs mounted on Data Nodes (we were starting to see perf impacts on checkpoints etc as complex ML apps were spinning up more and more in YARN). This worked great other than the fact that when jobs are being canceled or canceled

Re: IoT Use Case, Problem and Thoughts

2018-07-22 Thread Ashish Pokharel
:28 PM, Ashish Pokharel wrote: > > Hi Till, Fabian, > > Thanks for your responses again. > > Till, you have nailed it. I will comment on them individually. But first, I > feel like I am still not stating it well enough to illustrate the need. May > be I am overthinkin

Re: Memory Leak in ProcessingTimeSessionWindow

2018-07-22 Thread Ashish Pokharel
One more attempt to get some feedback on this. It basically boils down to using High-Level Window API in scenarios where keys are unbounded / infinite but can be expired after certain time. From what we have observed (solution 2 below), some properties of keys are still in state (guessing key it

Re: How to partition within same physical node in Flink

2018-07-04 Thread Ashish Pokharel
Thanks - I will wait for Stefan’s comments before I start digging in. > On Jul 4, 2018, at 4:24 AM, Fabian Hueske wrote: > > Hi Ashish, > > I think we don't want to make it an official public API (at least not at this > point), but maybe you can dig into the internal API and leverage it for yo

Re: IoT Use Case, Problem and Thoughts

2018-06-15 Thread Ashish Pokharel
; > Regarding the savepoints, if you are using the MemoryStateBackend failure at > too large state size is expected since all state is replicated into the > JobManager JVM. > Did you try to use the FsStateBackend? It also holds the state on the > TaskManager heap but backups it to a (

Re: IoT Use Case, Problem and Thoughts

2018-06-13 Thread Ashish Pokharel
Hi Fabian, Thanks for the prompt response and apologies for delayed response. You wrapped up the bottom lines pretty well - if I were to wrap it up I’d say “best possible” recovery on “known" restarts either say manual cancel + start OR framework initiated ones like on operator failures with t

Re: Error running on Hadoop 2.7

2018-03-25 Thread Ashish Pokharel
Hadoop being referenced by the “hadoop" command. > > — Ken > > >> On Mar 22, 2018, at 7:05 PM, Ashish Pokharel > <mailto:ashish...@yahoo.com>> wrote: >> >> Hi All, >> >> Looks like we are out of the woods for now (so we think) - we went wit

Re: Restart hook and checkpoint

2018-03-22 Thread Ashish Pokharel
le or solves the problem) one > could only do local checkpoints and not write to the distributed persistent > storage. That would bring down checkpointing costs and the recovery life > cycle would not need to be radically changed. > > Best, Fabian > > 2018-03-20 22:56 GM

Re: Error running on Hadoop 2.7

2018-03-22 Thread Ashish Pokharel
Hi All, Looks like we are out of the woods for now (so we think) - we went with Hadoop free version and relied on client libraries on edge node. However, I am still not very confident as I started digging into that stack as well and realized what Till pointed out (traces leads to a class that

Re: Restart hook and checkpoint

2018-03-20 Thread Ashish Pokharel
I definitely like the idea of event based checkpointing :) Fabian, I do agree with your point that it is not possible to take a rescue checkpoint consistently. The basis here however is not around the operator that actually failed. It’s to avoid data loss across 100s (probably 1000s of paralle

Re: Restart hook and checkpoint

2018-03-18 Thread Ashish Pokharel
overy+from+Task+Failures> > > Stefan cc'ed might be able to give you some pointers about configuration. > > Best, > Aljoscha > > >> On 6. Mar 2018, at 22:35, Ashish Pokharel > <mailto:ashish...@yahoo.com>> wrote: >> >> Hi Gordon, &

Re: Restart hook and checkpoint

2018-03-06 Thread Ashish Pokharel
Hi Gordon, The issue really is we are trying to avoid checkpointing as datasets are really heavy and all of the states are really transient in a few of our apps (flushed within few seconds). So high volume/velocity and transient nature of state make those app good candidates to just not have ch

Re: Task Manager detached under load

2018-02-05 Thread Ashish Pokharel
ditional statement about something going "horribly > wrong"). Under workload pressure, I reverted to 1.3.2 where everything works > perfectly, but we will try again soon on 1.4. When we do I will post the > actual log output. > > This was on YARN in AWS, with akka.as

Re: Understanding Restart Strategy

2018-01-24 Thread Ashish Pokharel
FYI, I think I have gotten to the bottom this situation. For anyone who might be in situation hopefully my observations will help. In my case, it had nothing to do with Flink Restart Strategy, it was doing it’s thing as expected. Issue really was, Kafka Producer timeout counters. As I mentione

Re: Kafka Producer timeout causing data loss

2018-01-24 Thread Ashish Pokharel
Fabian, Thanks for your feedback - very helpful as usual ! This is sort of becoming a huge problem for us right now because of our Kafka situation. For some reason I missed this detail going through the docs. We are definitely seeing heavy dose of data loss when Kafka timeouts are happening.

Re: Task Manager detached under load

2018-01-24 Thread Ashish Pokharel
I haven’t gotten much further with this. It doesn’t look like GC related - at least GC counters were not that atrocious. However, my main concern was once the load subsides why aren’t TM and JM connecting again? That doesn’t look normal. I could definitely tell JM was listening on the port and f

Re: kafka consumer client seems not auto commit offset

2017-11-15 Thread Ashish Pokharel
Gordon, Tony, Thought I would chime in real quick as I have tested this a few different ways in the last month (not sure if this will be helpful but thought I’d throw it out there). I actually haven’t noticed issue auto committing with any of those configs using Kafka property auto.offset.reset

Metric Registry Warnings

2017-11-11 Thread Ashish Pokharel
All, Hopefully this is a quick one. I enabled Graphite reporter in my App and I started to see the following warning messages all over the place: 2017-11-07 20:54:09,288 WARN org.apache.flink.runtime.metrics.MetricRegistry - Error while registering metric. java.lang.IllegalArgume

Re: Kafka Consumer fetch-size/rate and Producer queue timeout

2017-11-11 Thread Ashish Pokharel
Hi Gordon, Any further thoughts on this? I forgot to mention I am using Flink 1.3.2 and our Kafka is 0.10. We are in the process of upgrading Kafka but won’t be in Prod for at least couple of months. Thanks, Ashish > On Nov 8, 2017, at 9:35 PM, Ashish Pokharel wrote: > >

Re: Kafka Consumer fetch-size/rate and Producer queue timeout

2017-11-08 Thread Ashish Pokharel
ot cause is still along the lines of the producer > side. > > Would you happen to have any logs that maybe shows any useful information on > the producer side? > I think we might have a better chance of finding out what is going on by > digging there. > Also, which Flink versi

Kafka Consumer fetch-size/rate and Producer queue timeout

2017-11-05 Thread Ashish Pokharel
All, I am starting to notice a strange behavior in a particular streaming app. I initially thought it was a Producer issue as I was seeing timeout exceptions (records expiring in queue. I did try to modify request.timeout.ms, linger.ms etc to help with the issue if it were caused by a sudden bu

Re: Capacity Planning For Large State in YARN Cluster

2017-10-29 Thread Ashish Pokharel
is that switching to the RocksDBStateBackend could already > solve your problems. If this should not be the case, then please let me know > again. > > Cheers, > Till > > On Fri, Oct 27, 2017 at 5:27 AM, Ashish Pokharel <mailto:ashish...@yahoo.com>> wrote: >

Capacity Planning For Large State in YARN Cluster

2017-10-26 Thread Ashish Pokharel
Hi Everyone, We have hit a roadblock moving an app at Production scale and was hoping to get some guidance. Application is pretty common use case in stream processing but does require maintaining large number of keyed states. We are processing 2 streams - one of which is a daily burst of stream