ad.
>
> We ran into this when outlinks from web pages caused fan-out/amplification of
> the data being iterated, but maybe you hit it with restoring from state.
>
> — Ken
>
>
>> On May 4, 2020, at 6:41 PM, Ashish Pokharel > <mailto:ashish...@yahoo.com>>
Could this be FLIP-15 related as well then?
> On May 4, 2020, at 9:41 PM, Ashish Pokharel wrote:
>
> Hi all,
>
> Hope everyone is doing well!
>
> I am running into what seems like a deadlock (application stalled) situation
> with a Flink streaming job upon restore f
Hi all,
Hope everyone is doing well!
I am running into what seems like a deadlock (application stalled) situation
with a Flink streaming job upon restore from savepoint. Job has a slowly moving
stream (S1) that needs to be “stateful” and a continuous stream (S2) which is
“joined” with slow mov
Thanks Becket,
Sorry for delayed response. That’s what I thought as well. I built a hacky
custom source today directly using Kafka client which was able to join consumer
group etc. which works as I expected but not sure about production readiness
for something like that :)
The need arises beca
Andrey, Till,
This doesn’t jive with what I have noticed (fully acknowledge that I am still
getting hang of the framework). I sent a couple of notes on this in earlier
threads.
With this very simple processing, I am running into slow creep up of memory
with unbounded keys, which eventually en
All,
We recently moved our Checkpoint directory from HDFS to local SSDs mounted on
Data Nodes (we were starting to see perf impacts on checkpoints etc as complex
ML apps were spinning up more and more in YARN). This worked great other than
the fact that when jobs are being canceled or canceled
:28 PM, Ashish Pokharel wrote:
>
> Hi Till, Fabian,
>
> Thanks for your responses again.
>
> Till, you have nailed it. I will comment on them individually. But first, I
> feel like I am still not stating it well enough to illustrate the need. May
> be I am overthinkin
One more attempt to get some feedback on this. It basically boils down to using
High-Level Window API in scenarios where keys are unbounded / infinite but can
be expired after certain time. From what we have observed (solution 2 below),
some properties of keys are still in state (guessing key it
Thanks - I will wait for Stefan’s comments before I start digging in.
> On Jul 4, 2018, at 4:24 AM, Fabian Hueske wrote:
>
> Hi Ashish,
>
> I think we don't want to make it an official public API (at least not at this
> point), but maybe you can dig into the internal API and leverage it for yo
;
> Regarding the savepoints, if you are using the MemoryStateBackend failure at
> too large state size is expected since all state is replicated into the
> JobManager JVM.
> Did you try to use the FsStateBackend? It also holds the state on the
> TaskManager heap but backups it to a (
Hi Fabian,
Thanks for the prompt response and apologies for delayed response.
You wrapped up the bottom lines pretty well - if I were to wrap it up I’d say
“best possible” recovery on “known" restarts either say manual cancel + start
OR framework initiated ones like on operator failures with t
Hadoop being referenced by the “hadoop" command.
>
> — Ken
>
>
>> On Mar 22, 2018, at 7:05 PM, Ashish Pokharel > <mailto:ashish...@yahoo.com>> wrote:
>>
>> Hi All,
>>
>> Looks like we are out of the woods for now (so we think) - we went wit
le or solves the problem) one
> could only do local checkpoints and not write to the distributed persistent
> storage. That would bring down checkpointing costs and the recovery life
> cycle would not need to be radically changed.
>
> Best, Fabian
>
> 2018-03-20 22:56 GM
Hi All,
Looks like we are out of the woods for now (so we think) - we went with Hadoop
free version and relied on client libraries on edge node.
However, I am still not very confident as I started digging into that stack as
well and realized what Till pointed out (traces leads to a class that
I definitely like the idea of event based checkpointing :)
Fabian, I do agree with your point that it is not possible to take a rescue
checkpoint consistently. The basis here however is not around the operator that
actually failed. It’s to avoid data loss across 100s (probably 1000s of
paralle
overy+from+Task+Failures>
>
> Stefan cc'ed might be able to give you some pointers about configuration.
>
> Best,
> Aljoscha
>
>
>> On 6. Mar 2018, at 22:35, Ashish Pokharel > <mailto:ashish...@yahoo.com>> wrote:
>>
>> Hi Gordon,
&
Hi Gordon,
The issue really is we are trying to avoid checkpointing as datasets are really
heavy and all of the states are really transient in a few of our apps (flushed
within few seconds). So high volume/velocity and transient nature of state make
those app good candidates to just not have ch
ditional statement about something going "horribly
> wrong"). Under workload pressure, I reverted to 1.3.2 where everything works
> perfectly, but we will try again soon on 1.4. When we do I will post the
> actual log output.
>
> This was on YARN in AWS, with akka.as
FYI,
I think I have gotten to the bottom this situation. For anyone who might be in
situation hopefully my observations will help.
In my case, it had nothing to do with Flink Restart Strategy, it was doing it’s
thing as expected. Issue really was, Kafka Producer timeout counters. As I
mentione
Fabian,
Thanks for your feedback - very helpful as usual !
This is sort of becoming a huge problem for us right now because of our Kafka
situation. For some reason I missed this detail going through the docs. We are
definitely seeing heavy dose of data loss when Kafka timeouts are happening.
I haven’t gotten much further with this. It doesn’t look like GC related - at
least GC counters were not that atrocious. However, my main concern was once
the load subsides why aren’t TM and JM connecting again? That doesn’t look
normal. I could definitely tell JM was listening on the port and f
Gordon, Tony,
Thought I would chime in real quick as I have tested this a few different ways
in the last month (not sure if this will be helpful but thought I’d throw it
out there). I actually haven’t noticed issue auto committing with any of those
configs using Kafka property auto.offset.reset
All,
Hopefully this is a quick one. I enabled Graphite reporter in my App and I
started to see the following warning messages all over the place:
2017-11-07 20:54:09,288 WARN org.apache.flink.runtime.metrics.MetricRegistry
- Error while registering metric.
java.lang.IllegalArgume
Hi Gordon,
Any further thoughts on this?
I forgot to mention I am using Flink 1.3.2 and our Kafka is 0.10. We are in the
process of upgrading Kafka but won’t be in Prod for at least couple of months.
Thanks, Ashish
> On Nov 8, 2017, at 9:35 PM, Ashish Pokharel wrote:
>
>
ot cause is still along the lines of the producer
> side.
>
> Would you happen to have any logs that maybe shows any useful information on
> the producer side?
> I think we might have a better chance of finding out what is going on by
> digging there.
> Also, which Flink versi
All,
I am starting to notice a strange behavior in a particular streaming app. I
initially thought it was a Producer issue as I was seeing timeout exceptions
(records expiring in queue. I did try to modify request.timeout.ms, linger.ms
etc to help with the issue if it were caused by a sudden bu
is that switching to the RocksDBStateBackend could already
> solve your problems. If this should not be the case, then please let me know
> again.
>
> Cheers,
> Till
>
> On Fri, Oct 27, 2017 at 5:27 AM, Ashish Pokharel <mailto:ashish...@yahoo.com>> wrote:
>
Hi Everyone,
We have hit a roadblock moving an app at Production scale and was hoping to get
some guidance. Application is pretty common use case in stream processing but
does require maintaining large number of keyed states. We are processing 2
streams - one of which is a daily burst of stream
28 matches
Mail list logo