Re: Frequent exceptions killing streaming job

2016-02-26 Thread Nick Dimiduk
Sorry I wasn't clear. No, the lock contention is not in Flink. On Friday, February 26, 2016, Stephan Ewen wrote: > Was the contended lock part of Flink's runtime, or the application code? > If it was part of the Flink Runtime, can you share what you found? > > On Thu, Feb 25, 2016 at 6:03 PM, Ni

Re: Frequent exceptions killing streaming job

2016-02-26 Thread Stephan Ewen
Was the contended lock part of Flink's runtime, or the application code? If it was part of the Flink Runtime, can you share what you found? On Thu, Feb 25, 2016 at 6:03 PM, Nick Dimiduk wrote: > For what it's worth, I dug into the TM logs and found that this exception > was not the root cause, m

Re: Frequent exceptions killing streaming job

2016-02-25 Thread Nick Dimiduk
For what it's worth, I dug into the TM logs and found that this exception was not the root cause, merely a symptom of other backpressure building in the flow (actually, lock contention in another part of the stack). While Flink was helpful in finding and bubbling up this stack to the UI, it was ult

Re: Frequent exceptions killing streaming job

2016-01-20 Thread Robert Metzger
Hey Nick, I had a discussion with Stephan Ewen on how we could resolve the issue. I filed a JIRA with our suggested approach: https://issues.apache.org/jira/browse/FLINK-3264 By handling this directly in the KafkaConsumer, we would avoid fetching data we can not handle anyways (discarding in the

Re: Frequent exceptions killing streaming job

2016-01-17 Thread Nick Dimiduk
On Sunday, January 17, 2016, Stephan Ewen wrote: > I agree, real time streams should never go down. > Glad to hear that :) > [snip] Both should be supported. > Agreed. > Since we interpret streaming very broadly (also including analysis of > historic streams or timely data), the "backpress

Re: Frequent exceptions killing streaming job

2016-01-17 Thread Stephan Ewen
Hi Nick! I agree, real time streams should never go down. Whether you want to allow the stream processor to temporarily fall behind (back pressure on an event spike) and catch up a bit later, or whether you want to be always at the edge of real time and drop messages, is use case specific. Both s

Re: Frequent exceptions killing streaming job

2016-01-16 Thread Nick Dimiduk
This goes back to the idea that streaming applications should never go down. I'd much rather consume at max capacity and knowingly drop some portion of the incoming pipe than have the streaming job crash. Of course, once the job itself is robust, I still need the runtime to be robust -- YARN vs (po

Re: Frequent exceptions killing streaming job

2016-01-16 Thread Stephan Ewen
@Robert: Is it possible to add a "fallback" strategy to the consumer? Something like "if offsets cannot be found, use latest"? I would make this an optional feature to activate. I would think it is quite surprising to users if records start being skipped in certain situations. But I can see that t

Re: Frequent exceptions killing streaming job

2016-01-16 Thread Robert Metzger
Hi Nick, I'm sorry you ran into the issue. Is it possible that Flink's Kafka consumer falls back in the topic so far that the offsets it's requesting are invalid? For that, the retention time of Kafka has to be pretty short. Skipping records under load is something currently not supported by Fli